Hidden Costs of AI Q&A

At our recent SNIA Networking Storage Forum webinar, “Addressing the Hidden Costs of AI,” our expert team explored the impacts of AI, including sustainability and areas where there are potentially hidden technical and infrastructure costs. If you missed the live event, you can watch it on-demand in the SNIA Educational Library. Questions from the audience ranged from training Large Language Models to fundamental infrastructure changes from AI and more. Here are answers to the audience’s questions from our presenters.

Q: Do you have an idea of where the best tradeoff is for high IO speed cost and GPU working cost? Is it always best to spend maximum and get highest IO speed possible?

A: It depends on what you are trying to do If you are training a Large Language Model (LLM) then you’ll have a large collection of GPUs communicating with one another regularly (e.g., All-reduce) and doing so at throughput rates that are up to 900GB/s per GPU! For this kind of use case, it makes sense to use the fastest network option available. Any money saved by using a cheaper/slightly less performant transport will be more than offset by the cost of GPUs that are idle while waiting for data.

If you are more interested in Fine Tuning an existing model or using Retrieval Augmented Generation (RAG) then you won’t need quite as much network bandwidth and can choose a more economical connectivity option.

It’s worth noting that a group of companies have come together to work on the next generation of networking that will be well suited for use in HPC and AI environments. This group, the Ultra Ethernet Consortium (UEC), has agreed to collaborate on an open standard and has wide industry backing. This should allow even large clusters (1000+ nodes) to utilize a common fabric for all the network needs of a cluster.

Q: We (all industries) are trying to use AI for everything.  Is that cost effective?  Does it cost fractions of a penny to answer a user question, or is there a high cost that is being hidden or eaten by someone now because the industry is so new?

A: It does not make sense to try and use AI/ML to solve every problem. AI/ML should only be used when a more traditional, algorithmic, technique cannot easily be used to solve a problem (and there are plenty of these). Generative AI aside, one example where AI has historically provided an enormous benefit for IT practitioners is Multivariate Anomaly Detection. These models can learn what normal is for a given set of telemetry streams and then alert the user when something unexpected happens. A traditional approach (e.g., writing source code for an anomaly detector) would be cost and time prohibitive and probably not be anywhere nearly as good at detecting anomalies.

Q: Can you discuss typical data access patterns for model training or tuning? (sequential/random, block sizes, repeated access, etc)?

A: There is no simple answer as the access patterns can vary from one type of training to the next. Assuming you’d like a better answer than that, I would suggest starting to look into two resources:

  1. Meta’s OCP Presentation: “Meta’s evolution of network for AI” includes a ton of great information about AI’s impact on the network.
  2. Blocks and Files article: “MLCommons publishes storage benchmark for AI” includes a table that provides an overview of benchmark results for one set of tests.

Q: Will this video be available after the talk? I would like to forward to my co-workers. Great info.

A: Yes. You can access the video and a PDF of the presentations slides here.

Q: Does this mean we’re moving to fewer updates or write once (or infrequently) read mostly storage model?  I’m excluding dynamic data from end-user inference requests.

A: For the active training and finetuning phase of an AI model the data patterns are very read heavy. There is quite a lot of work done before a training or finetuning job begins that is much more balanced between read & write. This is called the “data preparation” phase of an AI pipeline. Data prep takes existing data from a variety of sources (inhouse data lake, dataset from a public repo, or a database) and performs data manipulation tasks to accomplish data labeling and formatting at a minimum. So, tuning for just read may not be optimal.

Q: Fibre Channel seems to have a lot of the characteristics required for the fabric. Could a Fibre Channel fabric over NVMe be utilized to handle the data ingestion for AI component on dedicated adapters for storage (disaggregate storage)?

A: Fibre Channel is not a great fit for AI use cases for a few reasons:

  • With AI, data is typically accessed as either Files or Objects, not Blocks, and FC is primarily used to access block storage.
  • If you wanted to use FC in place of IB (for GPU to GPU traffic) you’d need something like an FC-RDMA to make FC suitable.
  • All of that said, FC currently maxes out at 128GFC and there are two reasons why this matters:
    1. AI optimized storage starts at 200Gbps and based on some end user feedback, 400Gbps is already not fast enough.
    2. GPU to GPU traffic bandwidth requirements require up to 900GB/s (7200Gbps) of throughput per GPU, that’s about 56 128GFC interfaces per GPU.

Q: Do you see something like GPUDirect storage from NVIDIA becoming the standard?  So does this mean NVMe will win? (over FC or TCP?)  Will other AI chip providers have to adopt their own GPUDirect-like protocol?

A: It’s too early to say whether or not GPUDirect storage will become a de facto standard or if alternate approaches (e.g., pNFS) will be able to satisfy the needs of most environments. The answer is likely to be “both”.

Q: You’ve mentioned demand for higher throughput for training, and lower latency for inference. Is there a demand for low cost, high capacity, archive tier storage?

A: Not specifically for AI. Depending on what you are doing, training and inference can be latency or throughput sensitive (sometimes both). Training an LLM (which most users will never actually attempt to do) requires massive throughput from storage for reads and writes, literally the faster the better when loading data into the GPUs or when the GPUs are saving checkpoints. An inference workload wouldn’t require the same throughput as training would but to the extent that it needs to access storage, it would certainly benefit from low latency. If you are trying to optimize AI storage for anything but performance (e.g., cost), you are probably going to be disappointed with overall performance of the system.

Q: What are the presenters’ views about industry trend to run workload or train a model? is it in the cloud datacenters like AWS or GCP or On-prem?

A: It truly depends on what you are doing. If you want to experiment with AI (e.g., an AI version of a “Hello World” program), or even something a bit more involved, there are lots of options that allow you to use the cloud economically. Check out this collection of colab notebooks for an example and give it a try for yourself. Once you get beyond simple projects, you’ll find that using cloud-based services will become prohibitively expensive and you’ll quickly want to start running you training jobs on-prem, the downside to this is the need to manage the infrastructure elements yourself, this assumes that you can even get the right GPUs, although there are reports that supply issues are easing in this space. The bottom line is, whether or not to run on-prem or in the cloud is still a question of answering the question, can you realistically get the same ease of use and freedom from HW maintenance from your own infrastructure as you could from a CSP.  Sometimes the answer is yes.

Q: Does AI accelerator in PC (recently advertised for new CPUs) have any impact/benefit on using large public AI models?

A: AI accelerators in PCs will be a boon for all of us as it will enable inference at the edge. It will also allow exploration and experimentation on your local system for building your own AI work. You will, however, want to focus on small or mini models at this time. Without large amounts of dedicated GPU memory to help speed things up only the small models will run well on your local PC. That being said, we will continue to see improvements in this area and PCs are a great starting point for AI projects.

Q: Fundamentally — Is AI radically changing what is required from storage? Or is it simply accelerating some of the existing trends of reducing power, higher density SSDs, and pushing faster on the trends in computational storage, new NVMs transport modes (such as RDMA), and pushing for ever more file system optimizations?

A: From the point of view of a typical enterprise storage deployment (e.g., Block storage being accessed over an FC SAN), AI storage is completely different. Storage is accessed as either Files or Objects, not as blocks and the performance requirements already exceed the maximum speeds that FC can deliver today (i.e., 128GFC). This means most AI storage is using either Ethernet or IB as a transport. Raw performance seems to be the primary driver in this space right now rather than reducing power consumption or Increasing density. You can expect protocols such as GPUDirect and pNFS to become increasingly important to meet performance targets.

Q: What are the innovations in HDDs relative to AI workloads? This was mentioned in the SSD + HDD slide.

A: The point of the SSD + HDD slide was to point out the introduction of SSDs:

  1. dramatically improved overall storage system efficiency, leading to a dramatic performance boost. This performance boost impacted the amount of data that a single storage port could transmit onto a SAN and this had a dramatic impact on the need to monitor for congestion and congestion spreading.
  2. didn’t completely displace the need for HDDs, just as GPUs won’t replace the need for CPUs. They provide different functions and excel at different types of jobs.

Q: What is the difference between (1) Peak Inference, (2) Mainstream Inference, (3) Baseline Inference, and (4) Endpoint Inference, specifically from a cost perspective?

A: This question was answered Live during the webinar (see timestamp 44:27) the following is a summary of the responses:
Endpoint inference is inference that is happening on client devices (e.g., laptops, smartphones) where much smaller models that have been optimized for the very constrained power envelope of these devices. Peak inference can be thought about as something like Chat GPT or Bings AI chatbot, where you need large / specialized infrastructure (e.g., GPUs, specialized AI Hardware accelerators). Mainstream and Baseline inference is somewhere in between where you’re using much smaller models or specialized models. For example, you could have a mistral 7 billion model which you have fine-tuned for your enterprise use case of document summarization or to find insights in a sales pipeline, and these use cases can employ much smaller models and hence the requirements can vary. In terms of cost the deployment of these models for edge inference would be low as compared to peak inference like a chat GPT which would be much higher. In terms of infrastructure requirements some of the Baseline and mainstream inference models can be served just by using a CPU alone or with a CPU plus a GPU, or with a CPU plus a few GPUs, or CPU plus a few AI accelerators. CPUs available today do have built AI accelerators which can provide an optimized cost solution for Baseline and mainstream inference which will be the typical scenario in many enterprise environments.

Q: You said utilization of network and hardware is changing significantly but compared to what? Traditional enterprise workloads or HPC workloads?

A: AI workloads will drive network utilization unlike anything the enterprise has ever experienced before. Each GPU (of which there are currently up to 8 in a server) can currently generate 900GB/s (7200 Gbps) of GPU to GPU traffic. To be fair, this GPU to GPU traffic can and should be isolated to a dedicated “AI Fabric” that has been specifically designed for this use. Along these lines new types of network topologies are being used. Rob mentioned one of them during his portion of the presentation (i.e., the Rail topology). Those end users already familiar with HPC will find many of the same constraints and scalability issues that need to be dealt with in HPC environments also impact AI infrastructure.

Q: What are the key networking considerations for AI deployed at Edge (i.e. stores, branch offices)?

A: AI at the edge is a talk all on its own. Much like we see large differences between training, fine tuning, and inference in the data center, inference at the edge has many flavors and performance requirements that differ from use case to use case. Some examples are a centralized set of servers ingesting the camera feeds for a large retail store, aggregating them, and making inferences as compared to a single camera watching an intersection and using an on-chip AI accelerator to make streaming inferences. All forms of devices from medical test equipment, your car, or your phone are all edge devices with wildly different capabilities.

 

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *