Q&A for Accelerating Gen AI Dataflow Bottlenecks

Generative AI is front page news everywhere you look. With advancements happening so quickly, it is hard to keep up. The SNIA Networking Storage Forum recently convened a panel of experts from a wide range of backgrounds to talk about Gen AI in general and specifically discuss how dataflow bottlenecks can constrain Gen AI application performance well below optimal levels. If you missed this session, “Accelerating Generative AI: Options for Conquering the Dataflow Bottlenecks,” it’s available on-demand at the SNIA Educational Library.

We promised to provide answers to our audience questions, and here they are.

Q: If ResNet-50 is a dinosaur from 2015, which model would you recommend using instead for benchmarking?

A: Setting aside the unfair aspersions being cast on the venerable ResNet-50, which is still used for inferencing benchmarks 😊, we suggest checking out the MLCommons website. In the benchmarks section you’ll see multiple use cases on Training and Inference. There are multiple benchmarks available that can provide more information about the ability of your infrastructure to effectively handle your intended workload.

Q: Even if/when we use optics to connect clusters, there is a roughly 5ns/meter delay for the fiber between clusters. Seems like that physical distance limit almost mandates alternate ways of programming optimization to ‘stitch’ the interplay between data and compute?

A: With regards to the use of optics versus copper to connect clusters, signals propagate through fiber and copper at about the same speed, so moving to an all-optical cabling infrastructure for latency reduction reasons is probably not the best use of capital. Also, even if there were a slight difference in the signal propagation speed through a particular optical or copper based medium, 5ns/m is small compared to switch and NIC packet processing latencies (e.g., 200-800 ns per hop) until you get to full metro distances. In addition, the software latencies are 2-6 us on top of the physical latencies for the most optimized systems. For AI fabrics data/messages are pipelined, so the raw latency does not have much effect. Interestingly, the time for data to travel between nodes is only one of the limiting factors when it comes to AI performance limitations and it’s not the biggest limitation either. Along these lines, there’s a phenomenal talk by Stephen Jones (NVIDIA) “How GPU computing works” that explains how latency between GPU and Memory impacts the overall system efficiency much more than anything else. That said, the various collective communication libraries (NCCL, RCCL, etc) and in network compute (e.g., SHARP) can have a big impact on the overall system efficiency by helping to avoid network contention.

Q: Does this mean that GPUs are more efficient to use than CPUs and DPUs?

A: GPUs, CPUs, AI accelerators, and DPUs all provide different functions and have different tradeoffs. While a CPU is good at executing arbitrary streams of instructions through applications/programs, embarrassingly parallelizable workloads (e.g., matrix multiplications which are common in deep learning) can be much more efficient when performed by GPUs or AI accelerators due to the GPUs’ and accelerators’ ability to execute linear algebra operations in parallel. Similarly, I wouldn’t use a GPU or AI accelerator as a general-purpose data mover, I’d use a CPU or an IPU/DPU for that.

Q: With regards to vector engines, are there DPUs, switches (IB or Ethernet) that contain vector engines?

A: There are commercially available vector engine accelerators but currently there are no IPUs/DPUs or switches that provide this functionality natively.

Q: One of the major bottlenecks in modern AI is GPU to GPU connectivity. Ex. NVIDIA uses a proprietary GPU-GPU interconnect, At DGX-2 the focus was on 16 GPUs within a single box with NVSwitch, but then with A100 NVIDIA pulled this back to 8GPUs. But then expanded on that to a super-pod and a second level of switching to get to 256GPUS. How does NVlink, or other proprietary GPU to GPU interconnects address bottlenecks? And why has industry focused on an 8 GPU deployment vs a 16 GPU deployment resolution, given that LLMs are not training on 10’s of thousands of GPUs?

A: GPU-GPU interconnects all addresses bottlenecks in the same way that other high-speed fabrics do. GPU-GPU have direct connections featuring large bandwidth, optimized interconnect (point to point or parallel paths), and lightweight protocols. These interconnects have so far been proprietary and not interoperable across GPU vendors. The number of GPUs in a server chassis is dependent on many practical factors, e.g., 8 Gaudis per server leveraging standard RoCE ports provides a good balance to support training and inference.

Q: How do you see the future of blending of memory and storage being enabled for generative AI workloads and the direction of “unified” memory between accelerators, GPUs, DPUs and CPUs?

A: If by unified memory, you mean centralized memory that can be treated like a resource pool and be consumed by GPUs in place of HBM or by CPUs/DPUs in place of DRAM, then we do not believe we will see unified memory in the foreseeable future. The primary reason is latency. To have a unified memory would require centralization. Even if you were to constrain the distance (i.e., between the end-devices and the centralized memory) to be a single rack, the latency increase caused by the extra circuitry and physical length of the transport media (at 5ns per meter) could be detrimental to performance. However, the big problem with resource sharing is contention. Whether it be congestion in the network or contention at the centralized resource access point (interface), sharing resources requires special handling that will be challenging in the general case. For example, with 10 “compute” nodes attempting to access a pool of memory on a CXL Type 3 device, many of the nodes will end up waiting an unacceptably long period of time for a response.

If by unified memory, you mean creating a new “capacity” tier of memory that is more performant than SSD and less performant than DRAM, then CXL Type 3 devices appear to be the way the industry will address that use case, but it may be a while before we see mass adoption.

Q: Do you see the hardware design to more specialized into the AI/ML phases (training, inference, etc.)? But today’s enterprise deployments you can have the same hardware performing several tasks in parallel.

A: Yes, not only have specialized HW offerings (e.g., accelerators) already been introduced (such as in consumer laptops combining CPUs with inference engines), but also specialized configurations that have been optimized for specific use cases (e.g., inferencing) to be introduced as well. The reason is related to the diverse requirements for each use case. For more information, see the OCP Global Summit 23 presentation “Meta’s evolution of network AI” (specifically starting at time stamp 4:30). They describe how different use cases stress the infrastructure in different ways. That said, there is value in accelerators and hardware being able to address any of the work types for AI so that a given cluster can run whichever mix of jobs is required at a given time.

Q: Google leaders like Amin Vahdat have been casting doubts on the possibility of significant acceleration far from the CPU. Can you elaborate further on positioning data-centric compute in the face of that challenge?

A: This is a multi-billion-dollar question! There isn’t an obvious answer today. You could imagine building a data processing pipeline with data transform accelerators ‘far’ from where the training and inferencing CPU/accelerators are located. You could build a full “accelerator only” training pipeline if you consider a GPU to be an accelerator not a CPU. The better way to think about this problem is to consider that there is no single answer for how to build ML infrastructure. There is also no single definition of CPU vs accelerator that matters in constructing useful AI infrastructure solutions. The distinction comes down to the role of the device within the infrastructure. With emerging ‘chiplet’ and similar approaches we will see the lines and distinctions blur further. What is significant in what Vahdat and others have been discussing: fabric/network/memory construction plus protocols to improve bandwidth, limit congestion, and reduce tail latency when connecting the data to computational elements (CPU, GPU, AI accelerators, hybrids) will see significant evolution and development over the next few years.

 

Leave a Reply

Your email address will not be published. Required fields are marked *