Our recent SNIA Data, Networking & Storage Forum (DNSF) webinar, “AI Storage: The Critical Role of Storage in Optimizing AI Training Workloads,” was an insightful look at how AI workloads interact with storage at every stage of the AI data pipeline with a focus on data loading and checkpointing. Attendees gave this session a 5-star rating and asked a lot of wonderful questions. Our presenter, Ugur Kaynar, has answered them here. We’d love to hear your questions or feedback in the comments field.
Q. Great content on File and Object Storage, Are there any use cases for Block Storage in AI infrastructure requirements?
A. Today, by default, AI frameworks cannot directly access block storage, and need a file system to interact with block storage during training. Block storage provides raw storage capacity, but it lacks the structure needed to manage files and directories. Like most AI frameworks, PyTorch depends on a file system to manage and access data stored on block storage.
Q. Do high speed networks make some significant enhancements to I/O and checkpointing process?
A. High-speed networks enable faster data transfer rates and the I/O bandwidth can be better utilized which can significantly reduce the time required to save checkpoints. This minimizes downtime and helps maintain system performance.
However, it is important to keep in mind that the performance of checkpointing depends on both the storage network and the storage system. It’s essential to maintain a balance between the two for optimal results.
If the network is fast but the storage system is slow, or vice versa, the slower component will create a bottleneck. This imbalance can lead to inefficiencies and longer checkpointing times. When both the network and storage systems are balanced, data can flow smoothly between them. This maximizes throughput, ensuring that data is written to storage as quickly as it is transferred over the network.
Q. What is the rule of thumb or range of storage throughput per GPU?
A. Please see answer below.
Q. What is the typical IO performance requirements for AI training in terms of IOs per second, bytes per second?
A. The storage throughput per GPU can vary based on the specific workload and the performance requirements of the AI model being trained. Models processing text data typically need throughput ranging from a few MB/s to hundreds of MB/s per GPU.
In contrast, more demanding models that handle image or video data require higher throughput, often around a few GB/s, due to the larger sample sizes.
According to the latest MLPerf Storage benchmark results, the 3D-Unet medical image segmentation model requires approximately 2.8 GB/s storage throughput to maintain 90% utilization of H100 GPUs. Benchmark MLPerf Storage | MLCommons V1.1 Results
Storing the checkpoint data for large models requires significant storage throughput in the range of several GB/s per GPU
Q. Do you see in this workflow a higher demand for throughput from the storage layer? Or with random operations more demand for IOPs? How do the devices have to change to accommodate AI?
A. Typically, random IO operations have higher IOPs requirements than throughput.
Q. How frequently are checkpoints created and then immediately read from? Is there scope for a checkpointed cache for such immediate reads?
A. In AI training, checkpoints are usually generated at set intervals, which can differ based on the specific needs of the training process. For large-scale models, checkpoints might be saved every few minutes or after a certain number of iterations/steps to minimize data loss.
Immediate reads from these checkpoints often happen when resuming training after a failure or during model evaluation for validations checks.
Implementing a checkpoint cache can be highly advantageous given the frequency of these operations. By storing the most recent checkpoint data, such a cache can facilitate quicker recovery, reduce wait times, and enhance overall training efficiency.
Q. How does storage see the serialized checkpointing write? Is it a single thread/job to an individual storage device?
A. The serialized checkpoint data is stored as large sequential blocks by a single writer.
Q. Do you see something like CXL memory helping with checkpointing?
A. Eventually. CXL provides two primary use cases: extending and sharing system memory, and integrating accelerators into the CPU’s coherent link. From an AI checkpointing perspective, CXL could act as an additional memory tier for storing checkpoints.
Q. Great presentation, thanks. Can you touch on how often these large models are checkpointed. Is it daily? Hourly? Also, what type of storage is preferred for checkpointing — SSD or HDD, or a mix of both — and are the checkpoints saved indefinitely? I’m trying to understand if checkpointing is a storage pain point with regard to having enough storage capacity on hand?
A. For checkpointing, the preferred storage type is high speed flash storage (NVMe SSD) due to its high performance during the training.
The frequency of checkpointing for large AI models can vary based on several factors, including the model size, training duration, and the specific requirements of the training process. Therefore, it is difficult to generalize.
For example, Meta has reported that they perform checkpointing at 30-minute intervals for recommendation models. (nsdi22-paper-eisenman.pdf)
However, a common guideline is to keep the time spent on checkpointing to less than 5% of your training time. This ensures that the checkpointing process does not significantly impact the overall training efficiency while still providing sufficient recovery points in case of interruptions.
If the model runs multiple epochs, checkpoints are typically saved after each epoch, which is one complete pass through the training dataset. This practice ensures that you have recovery points at the end of each epoch.
Another common approach to do checkpointing at regular intervals (certain number of iterations or steps), such as every 500 iterations, for example: (Storage recommendations for AI workloads on Azure infrastructure (IaaS) – Cloud Adoption Framework | Microsoft Learn).
Q. Why is serialization needed in checkpointing?
A. Serialization ensures that the model’s state, including its parameters and optimizer states, is captured in a consistent manner. By converting the model’s state into a structured format, serialization allows for efficient storage and retrieval.
Q. What is the difference between tensor cores and Matrix Multiplication Accelerator (MMA) Engines?
A. Tensor Cores are highly specialized for AI and deep learning tasks, providing significant performance boosts for these specific workloads, while MMA Engines are more general-purpose and used across a broader range of applications.
Q. Since the checkpoints are so large, do a lot of AI environments utilize tape to keep multiple copies of checkpoints?
A. During training, checkpoints are typically written to redundant fast storage solutions like all-flash arrays. This ensures that the expensive GPUs are not left idle, waiting for data to be written or read. The checkpoint data is replicated to ensure durability and prevent data loss.
Tape storage, on the other hand, is more suitable for archival purposes. Tapes can be used to store checkpoints long-term due to its cost-effectiveness, durability, and scalability. While it’s not ideal for the high-speed demands of active training, it excels in preserving data for future reference or compliance reasons.
Q. Do you think S3/object will adopt something like RDMA for faster access to read/write data directly to GPU memory?
A. Currently, there is no RDMA support on S3. However, the increasing usage of object storage indicates that object storage solutions will adopt similar optimization approaches as file systems like RDMA for faster access to read/write data.
Q. Are checkpoints stored after training, or are they deleted automatically?
A. Checkpoints are typically stored after training to allow for model recovery, fine-tuning, or further analysis. They are not deleted automatically unless explicitly configured to do so. This storage ensures that you can resume training from a specific point if needed, which is especially useful in long or complex training processes.
Keep up with all that is going on at the SNIA Data, Networking & Storage Forum, follow us on LinkedIn and X @SNIA