New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance

Storage system providers showcase innovative solutions to keep pace with faster accelerators.

New MLPerf Storage v1.0 Benchmark Results Show Storage Systems Play a Critical Role in AI Model Training Performance

Kelly Berschauer, kelly@mlcommons.org

Today, MLCommons^® announced results for its industry-standard MLPerf^® Storage v1.0 benchmark suite, which is designed to measure the performance of storage systems for machine learning (ML) workloads in an architecture-neutral, representative, and reproducible manner. The results show that as accelerator technology has advanced and datasets continue to increase in size, ML system providers must ensure that their storage solutions keep up with the compute needs. This is a time of rapid change in ML systems, where progress in one technology area drives new demands in other areas. High-performance AI training now requires storage systems that are both large-scale and high-speed, lest access to stored data becomes the bottleneck in the entire system. With the v1.0 release of MLPerf Storage benchmark results, it is clear that storage system providers are innovating to meet that challenge.

Version 1.0 storage benchmark breaks new ground

The MLPerf Storage benchmark is the first and only open, transparent benchmark to measure storage performance in a diverse set of ML training scenarios. It emulates the storage demands across several scenarios and system configurations covering a range of accelerators, models, and workloads. By simulating the accelerators' "think time" the benchmark can generate accurate storage patterns without the need to run the actual training, making it more accessible to all. The benchmark focuses the test on a given storage system’s ability to keep pace, as it requires the simulated accelerators to maintain a required level of utilization.

Three models are included in the benchmark to ensure diverse patterns of AI training are tested: 3D-UNet, Resnet50, and CosmoFlow. These workloads offer a variety of sample sizes, ranging from hundreds of megabytes to hundreds of kilobytes, as well as wide-ranging simulated “think times” from a few milliseconds to a few hundred milliseconds.

The benchmark emulates NVIDIA A100 and H100 models as representatives of the currently available accelerator technologies. The H100 accelerator reduces the per-batch computation time for the 3D-UNet workload by 76% compared to the earlier V100 accelerator in the v0.5 round, turning what was typically a bandwidth-sensitive workload into much more of a latency-sensitive workload.

In addition, MLPerf Storage v1.0 includes support for distributed training. Distributed training is an important scenario for the benchmark because it represents a common real-world practice for faster training of models with large datasets, and it presents specific challenges for a storage system not only in delivering higher throughput but also in serving multiple training nodes simultaneously.

V1.0 benchmark results show performance improvement in storage technology for ML systems

The broad scope of workloads submitted to the benchmark reflect the wide range and diversity of different storage systems and architectures. This is testament to how important ML workloads are to all types of storage solutions, and demonstrates the active innovation happening in this space.

“The MLPerf Storage v1.0 results demonstrate a renewal in storage technology design,” said Oana Balmau, MLPerf Storage working group co-chair. “At the moment, there doesn’t appear to be a consensus ‘best of breed’ technical architecture for storage in ML systems: the submissions we received for the v1.0 benchmark took a wide range of unique and creative approaches to providing high-speed, high-scale storage.”

The results in the distributed training scenario show the delicate balance needed between the number of hosts, the number of simulated accelerators per host, and the storage system in order to serve all accelerators at the required utilization. Adding more nodes and accelerators to serve ever-larger training datasets increases the throughput demands. Distributed training adds another twist, because historically different technologies – with different throughputs and latencies – have been used for moving data within a node and between nodes. The maximum number of accelerators a single node can support may not be limited by the node’s own hardware but instead by the ability to move enough data quickly to that node in a distributed environment (up to 2.7 GiB/s per emulated accelerator). Storage system architects now have few design tradeoffs available to them: the systems must be high-throughput and low-latency, to keep a large-scale AI training system running at peak load.

“As we anticipated, the new, faster accelerator hardware significantly raised the bar for storage, making it clear that storage access performance has become a gating factor for overall training speed,” said Curtis Anderson, MLPerf Storage working group co-chair. “To prevent expensive accelerators from sitting idle, system architects are moving to the fastest storage they can procure – and storage providers are innovating in response.”

MLPerf Storage v1.0

The MLPerf Storage benchmark was created through a collaborative engineering process across more than a dozen leading storage solution providers and academic research groups. The open-source and peer-reviewed benchmark suite offers a level playing field for competition that drives innovation, performance, and energy efficiency for the entire industry. It also provides critical technical information for customers who are procuring and tuning AI training systems.

The v1.0 benchmark results, from a broad set of technology providers, demonstrate the industry’s recognition of the importance of high-performance storage solutions. MLPerf Storage v1.0 includes over 100 performance results from 13 submitting organizations: DDN, Hammerspace, Hewlett Packard Enterprise, Huawei, IEIT SYSTEMS, Juicedata, Lightbits Labs, MangoBoost, Nutanix, Simplyblock, Volumez, WEKA, and YanRong Tech.

“We’re excited to see so many storage providers, both large and small, participate in the first-of-its-kind v1.0 Storage benchmark,” said David Kanter, Head of MLPerf at MLCommons. “It shows both that the industry is recognizing the need to keep innovating in storage technologies to keep pace with the rest of the AI technology stack, and also that the ability to measure the performance of those technologies is critical to the successful deployment of ML training systems. As a trusted provider of open, fair, and transparent benchmarks, MLCommons ensures that technology providers know the performance target they need to meet, and consumers can procure and tune ML systems to maximize their utilization – and ultimately their return on investment.”

We invite stakeholders to join the MLPerf Storage working group and help us continue to evolve the benchmark. Future work includes improving and increasing accelerator emulations and AI training scenarios.

View the Results

To view the results for MLPerf Storage v1.0, please visit the Storage benchmark results.

About MLCommons

MLCommons is the world leader in building benchmarks for AI. It is an open engineering consortium with a mission to make AI better for everyone through benchmarks and data. The foundation for MLCommons began with the MLPerf benchmarks in 2018, which rapidly scaled as a set of industry metrics to measure machine learning performance and promote transparency of machine learning techniques. In collaboration with its 125+ members, global technology providers, academics, and researchers, MLCommons is focused on collaborative engineering work that builds tools for the entire AI industry through benchmarks and metrics, public datasets, and measurements for AI Safety.

For additional information on MLCommons and details on becoming a member, please visit MLCommons.org or contact participation@mlcommons.org.