Nvidia and Intel show machine learning performance improvements in latest MLPerf Training 2.1 results

Join us on November 9 to learn how to successfully innovate and achieve efficiency by improving and scaling citizen developers at the Low-Code/No-Code Summit. Register here.

MLCommons is out today with its latest set of machine learning (ML) MLPerf benchmarks, once again showing how hardware and software for artificial intelligence (AI) are getting faster.

MLCommons is a vendor-independent organization that aims to provide standardized tests and benchmarks to help assess the state of ML software and hardware. Under the test name of MLPerf, MLCommons collects different ML benchmarks several times during the year. In September, MLPerf Inference results were published, showing gains in how different technologies have improved inference performance.

New MLPerf benchmarks being reported today include the Training 2.1 benchmark, which is for ML training; HPC 2.0 for large systems, including supercomputers; and Tiny 1.0 for small, embedded deployments.

“The key reason we’re benchmarking is to drive transparency and measure performance,” said David Kanter, CEO of MLCommons, during a news conference. “All of this is based on the key notion that once you can measure something, you can start thinking about how to improve it.”


Summit Low-Code/No-Code

Learn how to build, scale, and govern low-code programs in a simple way that builds success for everyone this November 9th. Sign up for your free pass today.

register here

How the MLPerf Training Benchmark Works

Looking at the training benchmark in particular, Kanter said that MLPerf isn’t just about hardware, it’s also about software.

In ML systems, models must first be trained on data in order to operate. The training process benefits from accelerating hardware as well as optimized software.

Kanter explained that the MLPerf Training benchmark starts with a default dataset and model. Organizations then train the model to meet a target quality threshold. Among the top metrics captured by the MLPerf Training benchmark is time to train.

“When you look at the results, and this applies to any submission, whether it’s training, tiny, HPC, or inference, all the results are submitted to say something,” Kanter said. “Part of this exercise is figuring out what that something is that they say.”

Metrics can identify relative levels of performance and also serve to highlight improvement over time for both hardware and software.

John Tran, Senior Director of Deep Learning Libraries and Hardware Architecture at Nvidia and President of MLPerf Training at MLCommons, highlighted the fact that there were a number of software-only submissions for the latest benchmark.

“I find it continually interesting how we have so many software-only shipments and they don’t necessarily need help from hardware vendors,” Tran said. “I think it’s great and shows the benchmark’s maturity and usefulness to people.”

Intel and Habana Labs promote training with Gaudí2

The importance of software was also highlighted by Jordan Plawner, sr. director of AI products at Intel. During the MLCommons press call, Plawner explained what he sees as the difference between ML inference and training workloads in terms of hardware and software.

“Training is a distributed workload issue,” Plawner said. “Training is more than just hardware, more than just silicon; it is the software, it is also the network and the execution of workloads of distributed class”.

In contrast, Plawner said that ML inference can be a single-node problem that doesn’t have the same distributed aspects, providing a lower barrier to entry for vendor technologies than ML training.

In terms of results, Intel is well represented in the latest MLPerf Training benchmarks with its Gaudi2 technology. Intel acquired Habana Labs and its Gaudi technology for $2 billion in 2019 and has helped enhance the company’s capabilities in recent years.

Habana Labs’ most advanced silicon is now the Gaudi2 system, which was announced in May. The latest Gaudi2 results show gains over the first set of benchmarks that Habana Labs reported with the MLPerf Training update in June. According to Intel, Gaudi2 improved TensorFlow training time by 10% for the BERT and ResNet-50 models.

Nvidia H100 outperforms its predecessor

Nvidia also reports strong gains for its technologies in the latest MLPerf training benchmarks.

Test results for Nvidia’s Hopper-based H100 with MLPerf Training show significant gains over the previous generation A100-based hardware. In an Nvidia briefing call on the MLCommons results, Dave Salvator, Nvidia’s director of AI, benchmarking and cloud, said the H100 provides 6.7 times the performance of the first A100 introduction for same benchmarks several years ago. Salvator said that a key part of what makes the H100 work so well is the integrated transformer motor that is part of the Nvidia Hopper chip architecture.

While the H100 is now Nvidia’s leading hardware for ML training, that’s not to say that the A100 hasn’t improved its MLPerf training results either.

“The A100 continues to be a really compelling product for training and over the last few years we have been able to scale its performance by more than two times just with software optimizations,” said Salvator.

Overall, whether it’s with new hardware or ongoing software optimizations, Salvator expects there to be a steady stream of performance improvements for ML training in the months and years to come.

“AI’s appetite for performance is limitless and we continue to need more and more performance to be able to work with growing data sets in a reasonable amount of time,” Salvator said.

The need to be able to train a model faster is critical for several reasons, including the fact that training is an iterative process. Data scientists often need to train and then retrain models to get the desired results.

“That ability to train faster makes a difference in not only being able to work with larger networks, but also being able to employ them faster and have them work for you in creating value,” Salvator said.

The VentureBeat Mission is to be a digital public square for technical decision makers to learn about transformative business technology and transact. Discover our informative sessions.

Leave a Comment