At the end of August, MLCommons® released new results for its MLPerf® Inference v4.1 benchmark suite, which measures machine learning and AI performance. This round continues to provide open and transparent results across various domains like computer vision, language, speech, commerce, and image integration. However, it stands out due to a significant increase in the number of accelerator technologies submitted. Six new processors, either available now or soon to be shipped, have submitted results, showing that AI systems are evolving rapidly.
Increasing the Pace of AI Innovation: MLPerf Inference v4.1
A key highlight of MLPerf Inference v4.1 is the introduction of the mixture of experts (MoE) model. Previous rounds included large language models like GPT-J and Llama 2 70B, but the MoE model uses multiple smaller “expert” models to generate output. This method provides accurate answers with better performance, as only a fraction of the parameters are used for each query. This new approach is accelerating innovation in AI.
QCT Data Center Infrastructures Provide Choice to Meet Diverse Customer Needs
QCT submitted results of four setups to MLPerf Inference v4.1, including three server models powered by four processor/accelerator combinations, to meet different levels of customers’ AI needs. To be specific, the four setups are: QuantaGrid S74G-2U powered by the NVIDIA GH200 Grace Hopper Superchip, QuantaGrid D54U-3U with two 5th Gen Intel® Xeon® Scalable processors and four NVIDIA H100 PCIe accelerator cards, QuantaGrid D54U-3U with two 5th Gen Intel® Xeon® Scalable processors and four NVIDIA L40S PCIe cards, and QuantaGrid D54X-1U powered by two 5th Gen Intel® Xeon® Scalable processors.
- QCT QuantaGrid S74G-2U with NVIDIA GH200 Grace Hopper Superchip was a highlight amongst those systems listed. Using NVIDIA® NVLink®-C2C, the superchip delivers a CPU+GPU coherent memory model for accelerated AI and HPC applications, which is 7X faster than PCIe 5.0 with a coherent interface of 900 gigabytes per second (GB/s). Along with HBM3 and HBM3e GPU memory, QuantaGrid S74G-2U supercharges a wide range of AI tasks in the data center categories.
- The QuantaGrid D54U-3U is an acceleration server designed for AI. Supporting dual 5th Gen Intel® Xeon® Scalable processors, this 3U system features support for four dual-width accelerator cards or up to eight single-width accelerator cards, providing a comprehensive and flexible architecture optimized for various AI applications and large language models. QCT validated the results of one system supporting four NVIDIA H100 PCIe GPUs and another with four NVIDIA L40S PCIe cards. Figure 2. QCT QuantaGrid D54U-3U configured to support dual Intel® Xeon® Scalable processors and four dual-width GPUs
- Another Intel configuration server listed was the QCT QuantaGrid D54X-1U with dual Intel® Xeon® Scalable processors in CPU-only inference scenarios. QCT’s server with a CPU-only configuration was validated for its capability to perform excellently in general-purpose AI workloads with Intel® AMX instruction sets.
QCT believes that open and transparent performance benchmarking not only accelerates innovations for AI & HPC, but also provides a trustworthy basis for customers to make informed purchase decisions based on their needs. As a leading data center solutions provider, QCT remains dedicated to providing a wide spectrum of peer-reviewed, industry-validated hardware systems and solutions to academic, industrial and enterprise users, to enable AI transformations everywhere.
Click to view MLPerf Inference v4.1 results: https://mlcommons.org/benchmarks/inference-datacenter/