NVIDIA H100 GPU Performance Shatters Machine Learning Benchmarks For Model Training

2022-11-23 22:08:21 By : Mr. Steven Huang

NVIDIA’s Hopper H100 Tensor Core GPU made its first benchmarking appearance earlier this year in MLPerf Inference 2.1. No one was surprised that the H100 and its predecessor, the A100, dominated every inference workload. The H100 set world records in all of them and NVIDIA is the only company to have submitted to every workload for every MLPerf round.

A few weeks ago, a new set of MLCommons training results were released, this time for MLPerf 2.1 Training, which the NVIDIA H100 and A100 also dominated. Nano Core For Rectifiers

NVIDIA H100 GPU Performance Shatters Machine Learning Benchmarks For Model Training

Unfortunately, NVIDIA’s dominance of MLPerf benchmarking suites for inference and training has deflected submissions and reports by many important AI companies.

The industry would benefit from the participation of more organizations as we have seen in other sectors like CPUs, it drives competition and innovation. Broad involvement in benchmarking suites is significant because machine learning is growing exponentially. Almost every industry segment uses machine learning for a wide range of applications. As usage increases, so does model size. Since 2018, MLCommons has held testing rounds that alternate between MLPerf Training and MLPerf Inference testing rounds.

In the four years between the first MLPerf test in 2018 and this year’s results, machine learning model size has increased by five orders of magnitude. With the increased model size and larger data sets, standardized tools like MLPerf Training and MLPerf Inference are more crucial than ever. Machine learning model performance must be measured before it can be improved.

Summary of benchmarks used in MLPerf Training v2.1 ... [+]

MLPerf Training and MLPerf Inference use the same eight workloads shown in the above graphic. Mini Go is an exception because it is only used to evaluate reinforcement learning. Each benchmark test is defined by its own specific dataset and quality target. The Key is how much time it takes to train the model using the specified dataset with the specified quality target.

MLPerf is vital to AI and machine learning because it is an industry-standard benchmark with peer review results that provides valid comparisons for model training and inference. It is supported by Amazon, Arm, Baidu, Google, Harvard University, Intel, Meta, Microsoft, Stanford University, and the University of Toronto.

Multiple single models form high performance, multiple models

Real-world AI applications use multiple models

It is common for multiple AI models to be chained together to satisfy a single input. An example of multimodal networks is the verbal request in the above graphic. The question requires ten machine learning models to produce an answer. Not only must multiple models operate sequentially, but it must also deliver real-time solutions.

Some cloud services also use multiple networks to deliver services accelerated by NVIDIA GPUs. All of NVIDIA's networks and application frameworks are available on its MLPerf repo, on NGC (NVIDIA’s online container repository), and its GitHub repo.

A100 and H100 benchmark training performance

As shown in the MLPerf Training 2.1 performance chart, H100 provided up to 6.7 x more performance for the BERT benchmark compared to how the A100 performed on its first MLPerf submission in 2019.

A100 is still producing record results and high performance with improved performance of up to 2.5X. This gain is the result of software optimization. It will likely be an NVIDIA offering for quite some time.

H100 superior performance on the BERT NLP model is attributed to its Transformer Engine. The A100 does not have a training engine. The new engine, combined with NVIDIA Hopper FP8 Tensor Cores, delivers up to 9x faster AI training and 30x faster AI inference speedups on large language models than the A100. The H100 is based on Hopper architecture and uses fourth-gen tensor cores.

Training speed is crucial and necessary because of AI model size. NVIDIA’s transformer engine achieves additional speed using 16-bit floating-point precision and a new 8-bit floating-point data format. This combination increases Tensor Core throughput by 2x and reduces memory requirements by 2x compared to 16-bit floating-point.

Those improvements, plus advanced Hopper software algorithms, speed up AI performance and capabilities allowing the H100 to train models within days or hours instead of months. The faster a model can move into operation, the earlier its ROI can begin contributing to the bottom line.

The Hopper architecture can dynamically determine if FP8 or 16-bit calculations are needed for accuracy. As the transformer engine trains layer by layer, it analyzes the data to determine if reduced precision should be used. Depending on the degree of usage, reduced precision can cause rounding errors which affect model accuracy.

MLPerf training tests measure the time to solution, so a model not only has to run fast, but it also has to converge. Therefore, it is essential to remember that many errors can prevent a model from converging.

NVIDIA’s transformer engine technology was designed for large transformer-based networks like BERT. However, it is not restricted to NLP. It can be applied to other areas, such as stable diffusion.

Stable Diffusion is a deep learning, compute-intensive text-to-image model released this year. It can generate detailed images or videos conditioned on text descriptions. It can also be applied to tasks such as inpainting, outpainting, and generating image-to-image translations using a text prompt.

Time to train at scale

Time to train at scale ... [+]

NVIDIA A100 was the only platform to run all workloads in the time to train at scale. NVIDIA was able to train every workload at scale in under 5 minutes except for Mini Go, which took about 17 minutes.

Mini Go uses reinforcement learning which is very compute-intensive. It takes longer to train the network because it requires playing Mini Go turn-by-turn, then rolling it back through the network after each turn.

Training at scale demonstrates that A100 remains a solid platform for training. H100 is a solution for the most advanced models, such as language models with massive datasets and billions of hyperparameters.

While Intel and Habana didn't turn in record-setting performances, its participation was nevertheless important for the ecosystem and the future of MLPerf.

H100 Sets News Per-Accelerator Records for AI Training

This graphic shows relative per accelerator speedup normalized to A100. The H100 (in preview) was submitted for every benchmark and scored superior performance for each. It was 2.6X faster than the A100, which has made significant software gains.

Habana Gaudi2 submitted for Resnet-50 and BERT, and Intel's Sapphire Rapids submitted for DLRM, ResNet-50, and BERT.

Habana Gaudi2 performed marginally better than A100 on BERT and about 0.75 better than A100 for ResNet-50. Intel acquired Habana in late 2019 for $2 billion. Gaudi2 is Habana’s second-generation deep learning processor. It has 24 tensor cores and 96 GB of memory.

Dave Salvator, Director of AI, Benchmarking and Cloud for NVIDIA, is expecting higher performance from the H100 in the future.

“The H100 turned in a very compelling performance,” he said. “But in the future, we will make software gains with the H100 as we did with the A100. This is the first round we’re submitting H100 for training, and it won’t be the last.”

Benchmarking information for MLPerf HPC 2.0 ... [+]

MLPerf HPC 2.0 measures the time to train supercomputer models for scientific applications. Additionally, there is an optional throughput measurement for multi-user supercomputing systems. This round was the third iteration of MLPerf HPC. Like MLPerf for training and inference, MLPerf HPC is considered an industry-standard system performance measure for workloads performed on supercomputers.

For this round, five of the world's largest supercomputers submitted 20 results: Dell (first time for submission), Fujitsu/RIKEN, Helmholz AI, NVIDIA, and Texas Advanced Computing Center (TACC).

This is version 2.0 of the benchmarks, however, there have been no major changes since these same three workloads were run in 1.0. MLPerf HPC benchmarks measure training time and throughput for three high-performance simulations that have adopted machine learning techniques – Cosmoflow, DeepCAM, and OpenCatalyst.

Because of climate change, a great deal of concentrated work is being done on weather and climate modeling. NVIDIA is also working on a digital twin of the planet called Earth Two. This giant climate model simulates the entire world.

NVIDIA HPC Platform Performance Leadership

MLPerf HPC 2.0 has two performance metrics:

Although the NVIDIA A100 Tensor Core GPU and the NVIDIA DGX-A100 SuperPOD are almost three years old, MLPerf 2.0 performance shows that A100 is still the highest performing system for training HPC use cases.

HPC results are for NVIDIA Selene, an implementation of the DGX SuperPOD and demonstrate the A100’s potential. Other supercomputing sites using NVIDIA technology are also delivering good performance.

It is important to mention that NVIDIA was the only organization to run all AI training workloads for this and all previous MLPerf Training and inference rounds. It has delivered consistent leadership results from the first MLPerf Training 0.5 in December 2018 to the latest MLPerf Training 2.1 that was released a few weeks ago.

For training, inference, and HPC, MLPerf has proven NVIDIA has the broadest ecosystem support for all the deep learning frameworks. It is advantageous for customers that NVIDIA GPUs are available from all major cloud providers and all major systems for on-prem solutions. Those application frameworks allow customers to deploy solutions rapidly.

NVIDIA has an end-to-end open platform with software that helps expand the full potential of its hardware. NVIDIA’s full-stack solution includes application frameworks such as Merlin and Nemo. With Nemo Megatron service, it is possible to leverage huge language models using custom datasets.

Moor Insights & Strategy, like all research and tech industry analyst firms, provides or has provided paid services to technology companies. These services include research, analysis, advising, consulting, benchmarking, acquisition matchmaking, and speaking sponsorships. The company has had or currently has paid business relationships with 8×8, Accenture, A10 Networks, Advanced Micro Devices, Amazon, Amazon Web Services, Ambient Scientific, Anuta Networks, Applied Brain Research, Applied Micro, Apstra, Arm, Aruba Networks (now HPE), Atom Computing, AT&T, Aura, Automation Anywhere, AWS, A-10 Strategies, Bitfusion, Blaize, Box, Broadcom, C3.AI, Calix, Campfire, Cisco Systems, Clear Software, Cloudera, Clumio, Cognitive Systems, CompuCom, Cradlepoint, CyberArk, Dell, Dell EMC, Dell Technologies, Diablo Technologies, Dialogue Group, Digital Optics, Dreamium Labs, D-Wave, Echelon, Ericsson, Extreme Networks, Five9, Flex, Foundries.io, Foxconn, Frame (now VMware), Fujitsu, Gen Z Consortium, Glue Networks, GlobalFoundries, Revolve (now Google), Google Cloud, Graphcore, Groq, Hiregenics, Hotwire Global, HP Inc., Hewlett Packard Enterprise, Honeywell, Huawei Technologies, IBM, Infinidat, Infosys, Inseego, IonQ, IonVR, Inseego, Infosys, Infiot, Intel, Interdigital, Jabil Circuit, Keysight, Konica Minolta, Lattice Semiconductor, Lenovo, Linux Foundation, Lightbits Labs, LogicMonitor, Luminar, MapBox, Marvell Technology, Mavenir, Marseille Inc, Mayfair Equity, Meraki (Cisco), Merck KGaA, Mesophere, Micron Technology, Microsoft, MiTEL, Mojo Networks, MongoDB, MulteFire Alliance, National Instruments, Neat, NetApp, Nightwatch, NOKIA (Alcatel-Lucent), Nortek, Novumind, NVIDIA, Nutanix, Nuvia (now Qualcomm), onsemi, ONUG, OpenStack Foundation, Oracle, Palo Alto Networks, Panasas, Peraso, Pexip, Pixelworks, Plume Design, PlusAI, Poly (formerly Plantronics), Portworx, Pure Storage, Qualcomm, Quantinuum, Rackspace, Rambus, Rayvolt E-Bikes, Red Hat, Renesas, Residio, Samsung Electronics, Samsung Semi, SAP, SAS, Scale Computing, Schneider Electric, SiFive, Silver Peak (now Aruba-HPE), SkyWorks, SONY Optical Storage, Splunk, Springpath (now Cisco), Spirent, Splunk, Sprint (now T-Mobile), Stratus Technologies, Symantec, Synaptics, Syniverse, Synopsys, Tanium, Telesign,TE Connectivity, TensTorrent, Tobii Technology, Teradata,T-Mobile, Treasure Data, Twitter, Unity Technologies, UiPath, Verizon Communications, VAST Data, Ventana Micro Systems, Vidyo, VMware, Wave Computing, Wellsmith, Xilinx, Zayo, Zebra, Zededa, Zendesk, Zoho, Zoom, and Zscaler. Moor Insights & Strategy founder, CEO, and Chief Analyst Patrick Moorhead is an investor in dMY Technology Group Inc. VI, Dreamium Labs, Groq, Luminar Technologies, MemryX, and Movandi.

Moor Insights & Strategy founder, CEO, and Chief Analyst Patrick Moorhead is an investor in dMY Technology Group Inc. VI, Dreamium Labs, Groq, Luminar Technologies, MemryX, and Movand

NVIDIA H100 GPU Performance Shatters Machine Learning Benchmarks For Model Training

Amorphous Metal Core Transformers Note: Moor Insights & Strategy writers and editors may have contributed to this article.