Microsoft and Nvidia partner to build an AI supercomputer in the cloud

Check out the Low-Code/No-Code Summit on-demand sessions to learn how to successfully innovate and achieve efficiencies by enhancing and scaling citizen developers. Watch now.

A supercomputer, providing massive amounts of computing power to tackle complex challenges, is typically out of reach for the average enterprise data scientist. However, what if you could use cloud resources instead? That’s the logic Microsoft Azure and Nvidia are taking with this week’s announcement designed to coincide with the SC22 supercomputing conference.

Nvidia and Microsoft announced that they are building a “massive AI cloud computer.” The supercomputer in question, however, is not an individually named system, like the Frontier system at Oak Ridge National Laboratory or the Perlmutter system, which is the world’s fastest Artificial Intelligence (AI) supercomputer. Rather, the new AI supercomputer is a set of capabilities and services within Azure, powered by Nvidia, for high-performance computing (HPC) uses.

“There is widespread adoption of AI in enterprises across a wide range of use cases, so addressing this demand requires really powerful AI cloud computing instances,” said Paresh Kharya, senior director of accelerated computing at Nvidia, to VentureBeat. “Our collaboration with Microsoft allows us to provide a very compelling solution for companies looking to create and deploy AI at scale to transform their businesses.”

The hardware that goes into the Microsoft Azure AI supercomputer

Microsoft is no stranger to Nvidia’s AI acceleration technology, which is already used by large organizations.


smart security summit

Learn about the critical role of AI and ML in cybersecurity and industry-specific case studies on December 8. Sign up for your free pass today.

Register now

In fact, Kharya noted that Microsoft’s Bing uses Nvidia-powered instances to help speed up search, while Microsoft Teams uses Nvidia GPUs to help convert speech to text.

Nidhi Chappell, Partner/General Manager of Specialized Compute at Microsoft, explained to VentureBeat that Azure’s AI-optimized virtual machine (VM) offerings, like the current generation NDm A100 v4 series of VMs, start with a single virtual machine. (VM) and eight Nvidia Ampere A100 Tensor Core GPUs.

“But just like the human brain is made up of interconnected neurons, our NDm A100 v4-based clusters can scale up to thousands of GPUs with an unprecedented 1.6 Tb/s interconnect bandwidth per virtual machine.” Chappell said. “Tens, hundreds or thousands of GPUs can work together as part of an InfiniBand cluster to achieve any level of AI ambition.”

What’s new is that Nvidia and Microsoft are doubling down on their partnership, with even more powerful AI capabilities.

>>Don’t miss our new special issue: Zero Trust: The new security paradigm.<

Kharya said that as part of the renewed collaboration, Microsoft will add the new Nvidia H100 GPUs to Azure. In addition, Azure will upgrade to Nvidia’s next-generation Quantum 2 InfiniBand, which doubles available bandwidth to 400 Gigabits per second (Gb/s). (The current generation of Azure instances is based on 200 Gb/s Quantum InfiniBand technology.)

Microsoft DeepSpeed ​​gets a hopper boost

The Microsoft-Nvidia partnership isn’t just about hardware. It also has a very strong software component.

The two vendors have already worked together using Microsoft’s DeepSpeed ​​deep learning optimization software to help train the Nvidia Megatron-Turing Natural Language Generation (MT-NLG) large language model.

Chappell said that as part of the renewed collaboration, the companies will optimize Microsoft’s DeepSpeed ​​with the Nvidia H100 to accelerate transformer-based models used for large language models, generative AI, and writing computer code, among others. Applications.

“This technology applies 8-bit floating-point precision capabilities to DeepSpeed ​​to dramatically speed up AI computations for transformers, to twice the performance of 16-bit operations,” Chappell said.

AI cloud supercomputer for generative AI research

Nvidia will now also use Azure to help with its own research on generative AI capabilities.

Kharya noted that a number of generative AI models for creating interesting content have emerged recently, such as Stable Diffusion. He said that Nvidia is working on its own approach, called eDiff-I, for generating images from text prompts.

“AI research requires large-scale computing: you need to be able to use thousands of GPUs that are connected by the highest bandwidth, low-latency networks, and have a very high-performance software stack that makes all of this infrastructure work.” Kharya said. . “So this partnership expands our ability to train and provide computing resources to our research. [and] software development teams to create generative AI models, as well as offer services to our clients.”

VentureBeat’s mission is to be a digital public square for technical decision makers to gain insights into transformative business technology and transact. Discover our informative sessions.

Leave a Comment