#GPUClustersStorage | Explore Tumblr posts and blogs

govindhtech · 10 months ago

Text

GPU Clusters Storage Scaling Issues: SSD & Network Limits

Large-capacity SSD presentations were everywhere at this year’s Future of Memory and Storage 2024 (FMS 2024) conference in Santa Clara, California. A year ago, many consumers felt that 64TB was too big of a capacity. SSD roadmaps for goods coming out in the next few years revealed capacities as high as 128TB and 256TB in several presentations and exhibit booths at FMS 2024. Why the abrupt alteration? Have all of us in the flash industry gone insane following a challenging year financially? What is happening?

An explanation is required

The fast advent of generative AI provides the same rationale for this rapid shift as it does for many other current developments in the IT industry. Within the storage market, there have been rumors that fast, low-cost SSDs will eventually replace HDDs as they become too slow. The problem has been that HDDs are cheap, and resourceful storage software developers are always coming up with innovative ways to squeeze out just enough performance out of them.

That is, until the advent of massive GPU clusters, which devour massive amounts of data for training very quickly. Large language models (LLM) with exponential growth require an increasing amount of data for training. GPU clusters process information faster than conventional CPUs. HDDs are finding it impossible to keep up with this enormous increase, even if users attempt to stripe data across thousands of HDDs. To do it would simply demand too much power and space.

Why not utilize HDDs for large-scale data storage and SSDs close to the GPU for speed? Not only an app, but a workflow is what generative AI is. Data must be ingested, curated, formatted for training, fed repeatedly to GPUs, and checked periodically to prevent restarts. Public LLMs require user data for optimization and fine-tuning, and application-specific data must be quickly accessed for retrieval-augmented generation (RAG) during inferencing. Transferring data across various storage systems is a complicated, costly, and power inefficient process that takes attention away from improving models and making use of preexisting ones.

That’s where inexpensive, high-capacity SSDs come into play. IOPS, or input/output operations per second, is a common metric used to assess SSD performance in compute systems. The throughput per capacity (MB/s / TB) is a measure of device performance for storage systems. The system requirement for big GPU training clusters can reach up to 100 MB/s of bandwidth per terabyte of store capacity. These massive storage systems require system capacities ranging from petabytes to exabytes, which means hundreds to tens of thousands of separate drives are needed to store the text, photos, and videos for multimodal models.

With bandwidths up to 50 times higher than HDDs, SSDs allow for the same system throughput to be achieved with fewer SDDs than with many HDDs. Because of their lesser quantity, they must have a larger capacity than HDDs in order to meet system capacity needs. To what extent is it bigger?

That is contingent upon network bandwidth and performance needs. Although ultrafast networks are typically used to connect these storage devices to GPU clusters, the combined bandwidth of those networks is still far less than that of the SSDs. Capabilities up to 64TB are frequently the limit for the greatest GPU clusters. When SSDs with capacities of 128TB or even 256TB are offered for systems or smaller clusters with lower performance requirements, some users wish to expand those capabilities up.

SSDs in networked systems consume a great deal less power than traditional computer programs since they don’t run at full speed. Furthermore, in comparison to conventional mainstream computing SSDs, design concessions are made to lower costs because speed and high write cycle support are not critical.

What is a GPU cluster?

A GPU cluster is an assembly of several Graphics Processing Units (GPUs) connected to each other so as to function as a single computing unit. These clusters are made to tackle challenging computational tasks like deep learning, artificial intelligence (AI), large-scale data analysis, and scientific simulations that call for a lot of parallel processing capacity. GPU clusters are frequently utilized in data centers, research facilities, and other enterprises that need large amounts of computational power for jobs that are larger than what can be handled by a single GPU.

SSDs and HDDs

Here’s the main explanation

Because of this, the storage systems are easier to manage when using all SSDs as opposed to a mix of SSDs and HDDs. They also have fewer drives and storage servers, lower energy consumption, fewer racks, higher reliability, longer useful lifetimes, better latency characteristics, and less idle time on GPU clusters waiting for data.

Now, where does it go?

Large-capacity, reasonably priced SSDs are becoming the preferred storage solution for large GPU clusters and GPU-as-a-service cloud providers. These initial uses demonstrate how the SSD’s benefits over the HDD in terms of performance, power, and capacity justify its higher price. Over the next few years, it is expected that other high-performance use cases where HDDs are slower in MB/s / TB will switch to SSDs. Although it’s fantastic to be less expensive, idling CPUs, GPUs, and other accelerators is costly in terms of power and system costs if customers are unable to fulfill their performance requirements.

We’ve been familiar with the storage and memory hierarchy for a long time; we’ve been adding new technologies to it all the time and changing the locations of the pyramid’s blocks. In response to the growing need for this new class of SSDs that balances cost, power, and performance for storage applications needing huge capacities, a Micron has now added capacity SSDs as a new brick to the pyramid.