According to the report, The main test for comparing computer systems in training machine learning has now moved into the era of generative AI. MLPerf, which conducts this test, recently included a challenge for training large language models like GPT-3.
Now, they’ve added a new test for a text-to-image generator called Stable Diffusion. Computers using Intel and Nvidia chips participated in this new benchmark. The competition between Intel and Nvidia in training GPT-3 continued, and this time, Google also joined in.
Each of the three participants dedicated massive computing systems to the task, with Nvidia’s supercomputer, equipped with 10,000 GPUs, being the largest ever tested. Such size is crucial in generative AI. Even Nvidia’s biggest system would have required eight days to finish its large language model job.
In total, 19 companies and institutions submitted over 200 results, indicating a 2.8-fold performance improvement over the last five months and a remarkable 49-fold improvement since MLPerf started five years ago.
Nvidia, Microsoft Test 10,752-GPU Monsters
Nvidia continued to excel in the MLPerf benchmarks, showcasing systems featuring its powerful H100 GPUs. The highlight came from Eos, Nvidia’s new AI supercomputer boasting an impressive 10,752 GPUs.
When tasked with the GPT-3 training benchmark, Eos completed the job in just under 4 minutes. Microsoft’s Azure, a cloud computing service, tested a system of the same size and trailed Eos by only a few seconds. (Azure powers GitHub’s coding assistant CoPilot and OpenAI’s ChatGPT.)
Eos’s GPUs collectively deliver a staggering 42.6 billion billion floating-point operations per second (exaflops). These GPUs are interconnected using Nvidia’s Quantum-2 Infiniband, capable of transferring 1.1 million billion bytes per second.
Dave Salvatore, Nvidia’s director of AI benchmarking and cloud computing, expressed awe, stating, “Some of these speeds and feeds are mind-blowing. This is an incredibly capable machine.”
Eos increased the number of H100 GPUs bound into a single machine threefold. This threefold increase resulted in a 2.8-fold improvement in performance, achieving a scaling efficiency of 93 percent. Effective scaling is crucial for the ongoing enhancement of generative AI, which has been growing 10-fold every year.
Eos tackled the GPT-3 benchmark, but it’s important to note that it’s not a complete training of GPT-3. MLPerf designed it to be accessible to many companies. Instead of full training, it involves reaching a specific checkpoint to demonstrate that the training would achieve the required accuracy given enough time.
However, these training processes do take time. Extrapolating from Eos’s 4-minute completion time indicates that it would take 8 days to finish the training, even on what might be the most powerful AI supercomputer ever built. A more reasonably-sized computer, equipped with 512 H100s, would take 4 months.
Intel Continues to Close in
Intel provided results for systems utilizing the Gaudi 2 accelerator chip and those without any accelerator, relying solely on its 4th generation Xeon CPU. A significant change from the previous training benchmarks was Intel activating Gaudi 2’s 8-bit floating-point (FP8) capabilities.
The adoption of lower precision numbers, like FP8, has been a primary driver of the improvement in GPU performance over the last decade. The utilization of FP8 in parts of GPT-3 and other transformer neural networks, where low precision doesn’t compromise accuracy, has already proven valuable in Nvidia’s H100 results. Now, Gaudi 2 is experiencing a similar performance boost.
“We anticipated a 90 percent improvement by activating FP8,” says Eitan Medina, Chief Operating Officer at Intel’s Habana Labs. “We not only met but exceeded expectations a 103 percent reduction in time-to-train for a 384-accelerator cluster.”
Intel’s CPU Systems and Future Prospects with Gaudi 3
This latest outcome positions the Gaudi 2 system at just under one-third the speed of an Nvidia system on a per-chip basis, and three times faster than Google’s TPUv5e. In the new image generation benchmark, Gaudi 2 also demonstrated approximately half the speed of the H100. While FP8 was only enabled for the GPT-3 benchmark in this round, Medina mentions that his team is actively working on implementing it for other benchmarks.
Medina persisted in arguing that Gaudi 2 offers a notably lower cost compared to the H100, providing an advantage in the overall metric of price and performance. He anticipates that this advantage will further expand with the upcoming generation of Intel accelerator chips, specifically Gaudi 3. Expected to enter volume production in 2024, Gaudi 3 will be constructed using the same semiconductor manufacturing process as the Nvidia H100.
In a distinct submission, Intel provided outcomes for systems relying solely on CPUs. Once more, these results demonstrated training times ranging from minutes to hours across various benchmarks.
Beyond the MLPerf benchmarks, Intel also presented data indicating that a 4-node Xeon system, incorporating the AMX matrix engine in its chips, can fine-tune the image generator stable diffusion in less than five minutes. Fine-tuning involves adapting an already-trained neural network to specialize it for a specific task.
For instance, Nvidia’s AI chip design is a fine-tuning of an existing large language model known as NeMo.
The intense competition in AI benchmarking. Nvidia’s powerful H100 and Eos face tough rivals in Intel’s Gaudi 2 and upcoming Gaudi 3. Innovations like FP8 precision and CPU-only systems signal a quest for efficiency. MLPerf’s ongoing advancements promise more cost-effective and superior AI performances in the future.