QCT QuantaGrid D75H-10U Unlocks Next-Gen AI Inference Efficiency in MLPerf Inference v6.0

As AI adoption accelerates, Quanta Cloud Technology (QCT) delivers infrastructures designed to adapt to real-world workload behavior. Our MLPerf Inference v6.0 results highlight not only technical excellence, but also our ability to offer optimized system configurations tailored to diverse AI use cases, enabling meaningful business outcomes.

The latest MLPerf Inference v6.0 benchmarks highlight this transformation, emphasizing real-world workloads such as generative AI, vision-language models, and text-to-video applications.

Recognizing that no single architecture fits all workloads, QCT provides an expansive portfolio of hardware designs optimized for different performance and deployment requirements. The QuantaGrid D75H-10U is one such example—purpose-built for large-scale AI inference and training—while the broader QCT lineup enables customers to select the right architecture to match specific workload characteristics and operational goals.

*Figure 1. QuantaGrid D75H-10U powered by 2x Intel Xeon 6 processors and 8x NVIDIA B300-SXM GPUs*

The QCT QuantaGrid D75H-10U is a cutting-edge AI server platform designed to deliver exceptional performance and scalability for modern data centers. Powered by dual Intel® Xeon® 6 processors, it supports eight NVIDIA HGX B300 SXM GPUs and integrates PCIe Gen6 to enable ultra-fast 800G east-west data transfer, making it ideal for building massive-scale GPU clusters. The system features eight OSFP ports serving up to 800G with NVIDIA Spectrum-X Ethernet or NVIDIA Quantum-X800 InfiniBand, ensuring high-bandwidth connectivity for demanding workloads. With an impressive 72 PFLOPS FP8 performance for AI training and 144 PFLOPS FP4 for inference, QuantaGrid D75H-10U is tailored for hyperscalers to accelerate large language model (LLM) training and inference, while delivering breakthrough performance on complex workloads such as agentic AI, AI reasoning, and real-time video generation.

*Figure 2. QuantaGrid D75H-10U modularized 8-GPU sled design for optimal flexibility and easy serviceability*

From a design perspective, its value lies in optimized scalability, high-density accelerator integration, and next-generation networking capabilities, enabling efficient scale-out deployments.

Benchmarks submitted:

deepseek-r1
llama 3.1 405b

In the large language model (LLM) landscape, the QuantaGrid D75H-10U demonstrated its leadership. The QuantaGrid D75H-10U (based on NVIDIA HGX B300) obtained the highest single-server inference throughput for the Llama 3.1 405B model^[1] in the server scenario under the closed division (1,484.40 Tokens/s), outperforming other vendors by up to 11% among systems using the same NVIDIA HGX B300 configuration.

QCT demonstrates strong hardware design expertise and proven best-practice capabilities to evaluate the next generation of AI applications through platforms like its QuantaGrid D75H-10U. By integrating advanced processor architectures, high-density accelerator support, and next-generation interconnect technologies, QCT enables enterprises to efficiently deploy scalable, high-performance AI clusters optimized for demanding workloads such as LLM training and inference.

Looking ahead, QCT remains committed to delivering optimized, workload-driven system designs that align with the evolving requirements of next-generation AI. By continuing to innovate in areas such as high-speed networking, system efficiency, and accelerator integration, QCT is well-positioned to support the rapid advancement of AI technologies and empower customers to achieve greater performance, scalability, and operational efficiency in future AI deployments.

To view the results for MLPerf Inference v6.0, please visit: https://mlcommons.org/benchmarks/inference-datacenter/

^[1] MLPerf® Inference: Datacenter v6.0 Closed. Llama3.1 405b Server benchmark. Submission ID 6.0-0089, MLCommons.

Leave a Reply Cancel reply