Ethernet in the Age of AI Q&A - SNIA on Data, Networking & Storage

AI is having a transformative impact on networking. It’s a topic that the SNIA Data, Storage & Networking Community covered in our live webinar, “Ethernet in the Age of AI: Adapting to New Networking Challenges.” The presentation explored various use cases of AI, the nature of traffic for different workloads, the network impact of these workloads, and how Ethernet is evolving to meet these demands. The webinar audience was highly engaged and asked many interesting questions. Here are the answers to them all.

Q. What is the biggest challenge when designing and operating an AI Scale out fabric?

A. The biggest challenge in designing and operating an AI scale-out fabric is achieving low latency and high bandwidth at scale. AI workloads, like training large neural networks, demand rapid, synchronized data transfers between thousands of GPUs or accelerators. This requires specialized interconnects, such as RDMA, InfiniBand, or NVLink, and optimized topologies like fat-tree or dragonfly to minimize communication delays and bottlenecks.

Balancing scalability with performance is critical; as the system grows, maintaining consistent throughput and minimizing congestion becomes increasingly complex. Additionally, ensuring fault tolerance, power efficiency, and compatibility with rapidly evolving AI workloads adds to the operational challenges.

Unlike standard data center networks, AI fabrics handle intensive east-west traffic patterns that require purpose-built infrastructure. Effective software integration for scheduling and load balancing is equally essential. The need to align performance, cost, and reliability makes designing and managing an AI scale-out fabric a multifaceted and demanding task.

Q. What are the most common misconceptions about AI scale-out fabrics?

A. The most common misconception about AI scale-out fabrics is that they are the same as standard data center networks. In reality, AI fabrics are purpose-built for high-bandwidth, low-latency, east-west communication between GPUs, essential for workloads like large language model (LLM) training and inference. Many believe increasing bandwidth alone solves performance issues, but factors like latency, congestion control, and topology optimization (e.g., fat-tree, dragonfly) are equally critical.

Another myth is that scaling out is straightforward—adding GPUs without addressing communication overhead or load balancing often leads to bottlenecks. Similarly, people assume all AI workloads can use a single fabric, overlooking differences in training and inference needs.

AI fabrics also aren’t plug-and-play; they require extensive tuning of hardware and software for optimal performance.

Q. How do you see the future of AI Scale-out fabrics evolving over the next few years?

A. AI scale-out fabrics is going to have more and more Ethernet. Ethernet-based fabrics, enhanced with technologies like RoCE (RDMA over Converged Ethernet), will continue to evolve to deliver the low latency and high bandwidth required for large-scale AI applications, particularly in training and inference of LLMs.

Emerging standards like Ethernet 800GbE and beyond will provide the throughput needed for dense, GPU-intensive workloads. Advanced congestion management techniques, such as DCQCN, Multipathing, Packet trimming etc, will improve performance in Ethernet-based fabrics by reducing packet loss and latency.

Ethernet’s cost-effectiveness, ubiquity, and compatibility with hybrid environments will make it a key enabler for AI scale-out fabrics in both cloud and on-premises deployments.

The convergence of CXL over Ethernet may eventually enable memory pooling and shared memory access across components within scale-up systems, supporting the increasing memory demands of LLMs.

The need for having Ethernet for scale-up is going to be on the rise as well.

Q. What are the best practices for staying updated with the latest trends and developments? Can you recommend any additional resources or readings for further learning?

A. There are several papers and research articles on the internet, some of them are listed in the webinar slide deck. Following Ultra Ethernet Consortium and SNIA are the best ways to learn about networking related updates.

A. Is NVLink a standard?

A. No, NVLink is not an open standard. It is a proprietary interconnect technology developed by NVIDIA. It is specifically designed to enable high-speed, low-latency communication between NVIDIA GPUs and, in some cases, between GPUs and CPUs in NVIDIA systems.

Q. What’s the difference between collections and multicast?

A. It is tempting to think that collections and multicast are similar, for example the collectives like Broadcast. But they are in principle different and address different requirements. Collections are high-level operations for distributed computing, while multicast is a low-level network mechanism for efficient data transmission.

Q. What’s the support lib/tool/kernel module for enabling Node1 GPU1-> Node2 GPU2->GPU fabric -> Node2 GPU2? It seems some Host level knowledge, not TOR level.

A. Yes, the topology discovery and optimal path for routing the GPU messages from the source depends on the Host software and is not TOR dependent.

The GPU applications end up using the MPI APIs for communication between the nodes in the cluster. These MPI APIs are made aware of the GPU topologies by the respective extension libraries provided by the GPU vendor.

For instance, NVIDIA’s NCCL and AMD’s RCCL libraries provide option to mention static GPU topology in the system through an XML file (via NCCL_TOPO_FILE or RCCL_TOPO_FILE) that can be loaded when initializing the stack. The MPI GPU aware library extensions from NVIDIA/AMD would then leverage this provided topology information to send the messages to the appropriate GPU.

An example NCCL topology is here: https://github.com/nebius/nccl-topology/blob/main/nccl-topo-h100-v1.xml.

There are utilities such as nvidia-smi/rocm-smi that are used in the initial discovery. The automatic topology detection and calculation of optimal paths for MPI could be made available as part of GPU vendor’s CCL library as well. For instance, NCCL provides such functionality by reading the /sys from the host and building PCI topology of GPU/NICs.

The SNIA Data, Storage & Networking Community provides vendor-neutral education on a wide range of topics. Follow us on LinkedIn and @SNIA for upcoming webinars, articles, and content.

Leave a Reply