All categories

White Paper on the InfiniBand Leaf Spine Network Architecture for AI Computing Power


Release time:2026-03-11


Abstract
As the demand for computing power in large model training and high-performance computing (HPC) grows exponentially, the bottlenecks of traditional network architectures in terms of bandwidth, latency, and scalability have become increasingly prominent. This article details the two-layer spine-leaf (Spine-Leaf) InfiniBand network architecture built on the NVIDIA Quantum-2 platform, and deeply analyzes its core advantages in the AI computing power center scenario, including ultimate performance, linear scalability, high reliability and ease of maintenance, as well as future-oriented evolution capabilities, providing a reference for building modern and high-performance AI infrastructure. 

I. Introduction: Network Challenges in the AI Era
In training scenarios of models like GPT-4 and LLaMA, thousands of GPUs need to complete the synchronization and exchange of TB-level data within milliseconds. The traditional three-layer network architecture not only introduces additional forwarding delays but also easily forms a bandwidth bottleneck at the core layer, resulting in low GPU utilization and an infinitely prolonged training period. Therefore, building a high-speed network specifically optimized for AI computing power has become the core task for data center upgrades.

II. Architecture Design: Two-Layer Leaf-Spine (Spine-Leaf) InfiniBand H200 Network

2.1 Core Components and Hierarchical Division

This topology adopts a clear two-layer design, dividing the network into the spine layer, the leaf layer, and the server access layer. Each layer has clear responsibilities and works collaboratively.

Table

Layer Core Device Quantity Optical Module Specification Core Responsibilities
Spine Layer NVIDIA Quantum-2 MQM9790 32 800Gbps OSFP 2xFR4/DR4/SR4 Core forwarding across the network, enabling non-blocking full interconnectivity between leaf switches
Leaf Layer NVIDIA Quantum-2 MQM9790 64

Uplink: 800Gbps OSFP 2xFR4/DR4/SR4

Downlink: 800Gbps OSFP 2xSR4

Uplink to spine layer

Server access and traffic aggregation

Server layer GPU server + ConnectX-7 256 400Gbps OSFP SR4 Provide computing and storage capabilities and connect to the leaf layer

 

To achieve modular expansion, the entire network is divided into 8 standard PODs (Point of Delivery). Each POD contains 8 leaf switches and 32 GPU servers, forming independent computing and network units.

Single POD size: 8 leaf switches + 32 GPU servers
Total network size: 8 PODs × 32 servers/POD = 256 GPU servers
Connection relationship: Each server is connected to the leaf switch of the corresponding POD through 8 400Gbps links, and each leaf switch is fully connected to all 32 spine switches in the entire network. 

III. Core Advantages: Four Pillars Supporting AI Computing Power

3.1 Ultimate Performance: Breaking the Data Transmission Bottleneck
Ultra-high Bandwidth: A single server can achieve a total access bandwidth of 3.2 Tbps through 8 400Gbps links. The core forwarding layer is constructed using 800Gbps links between the spine and leaf switches, ensuring that data is "well-fed and fast-transmitted".
Microsecond Delay: InfiniBand technology controls the end-to-end communication delay within microseconds, significantly reducing the waiting time between GPUs and significantly improving training efficiency.
Non-blocking Forwarding: The fully interconnected design ensures that communication between any two servers only requires 4 hops, avoiding the "detour" and bottleneck problems found in traditional networks.

3.2 Linear Expansion: Computing Power and Network Grow Together
Modular POD Design: New computing power only requires the deployment of a new POD without the need to modify the existing architecture, achieving linear expansion of computing power and network capacity.
Elastic Expansion Capability: By increasing the number of spine/leaf switches, the network scale can be easily expanded from hundreds of servers to thousands, meeting the needs of future ultra-large-scale AI clusters.

3.3 High Reliability and Easy Maintenance: Ensuring Business Continuity
Multiple Link Redundancy: Servers, leaf switches, and spine switches are all connected with multiple links. A single point of failure does not affect the overall business.
Simplified Operations and Maintenance: The two-layer architecture is clear and straightforward, facilitating efficient fault location; standardized PODs and a unified hardware platform significantly reduce deployment and maintenance costs.

3.4 Future-oriented: Protecting Long-Term Investments
Technical Forwardness: Adopting the Quantum-2 platform and ConnectX-7 network cards that support the InfiniBand NDR standard, it has the ability to smoothly evolve to 1.6 Tbps and higher speeds.
Compatible with Next-Generation Hardware: Open architecture design is compatible with future GPU, DPU, and other new computing power hardware, ensuring that the network infrastructure can keep up with the rapid iteration of AI technology.
 

IV. Application Scenarios: Empowering AI and Supercomputing Fields
Large Model Training: Supporting the high-speed collaboration of thousands of GPUs, reducing the training period from months to weeks.
Scientific Computing: In fields such as weather forecasting and gene sequencing, real-time processing and analysis of TB-level data can be achieved.
Autonomous Driving Simulation: Providing low-latency and high-bandwidth network support for massive scene simulations, accelerating algorithm iterations.

V. Conclusion
Based on the NVIDIA Quantum-2 platform's two-layer leaf-spine InfiniBand network architecture, through ultimate performance, linear expansion, high reliability and easy maintenance, as well as future-oriented design, it has perfectly solved the network challenges of the AI era. It is not only the ideal choice for currently building high-performance AI computing power centers, but also a key infrastructure for protecting users' long-term investments and supporting the development of next-generation AI technologies.

Submit
%{tishi_zhanwei}%

Leave Message

If you have already experienced our product, please let us know your true feelings. Your satisfaction is our driving force for progress, while your suggestions are our valuable asset for continuous improvement.

Contact Us

Contact Us

Email: alen@wisdom-c.com

中企跨境-全域组件 制作前进入CSS配置样式

在线客服添加返回顶部

右侧在线客服样式 1,2,3 1

图片alt标题设置: NexusWise Cloud Computing Co., Ltd.

表单验证提示文本: Content cannot be empty!

循环体没有内容时: Sorry,no matching items were found.

CSS / JS 文件放置地

Welcome to leave an online message, we will contact you promptly

%{tishi_zhanwei}%