
Huawei's scale-up world size refers to the number of AI chips that can be integrated into a single compute domain. The Atlas 950 SuperPod packs up to 8,192 Ascend 950DT NPUs. Meanwhile, the Atlas 960 SuperPod will scale to 15,488 NPUs, all connected via Huawei's proprietary UnifiedBus (UB) interconnect with 2.1 µs latency and up to 2 TB/s chip-to-chip bandwidth or RoCE, using industry-standard components, but with lower performance.
These SuperPods are meant to function as one logical system, optimized for large-model training and inference, with synchronized compute, unified memory access, and token throughput scaling beyond 80 million tokens per second. At this point, we can only wonder whether this will indeed work as planned.
To contrast, Nvidia currently limits its scale-up world size to 72 GPU packages per NVL72 GB200/GB300 racks, all connected with NVLink 5.0 and NVSwitch within a single rack. For future systems like NVL144 or NVL576 (Blackwell and Blackwell Ultra) , Nvidia also maintains a modular pod-based structure, with no extension of NVLink domains beyond one rack. Interestingly, the number of logical GPU packages remains unchanged at 72.
In terms of scale-out world size, Huawei connects dozens of SuperPods into a SuperCluster using UnifiedBus over Ethernet (UBoE) or RoCE, enabling deployments like the Atlas 950 SuperCluster with 524,288 NPUs, and the upcoming Atlas 960 SuperCluster with over 1 million NPUs. These clusters aim to operate cohesively, with improved fault tolerance, low inter-pod latency, and the ability to train multi-trillion-parameter models when interconnected using UBoE, according to Huawei.
Nvidia's design offers flexibility, modularity, and ease of integration, but lacks Huawei’s end-to-end coherence and latency control at extreme scale, potentially limiting performance scaling for systems with hundreds of thousands or millions of GPUs. Then again, Nvidia and its clients may not need clusters with over a million compute GPUs (we are talking about GPU packages rather than GPU chiplets) for AI, given the fact that Nvidia’s GPUs that Huawei’s NPUs will compete against in 2027 – 2028 will be inherently more powerful.
Without any doubt, orchestrating hundreds of thousands of AI accelerators is an incredible engineering achievement. But scaling out hundreds of thousands of NPUs is not only complicated from a hardware development point of view, but it also complicates software development.
Nvidia's clusters are generally easier to program for because they require fewer accelerators to reach a target performance level, thanks to the high compute density of each GPU, like an NVL72 pod, which integrates 72 Blackwell GPUs connected via NVLink 5.0 and NVSwitch.
These pods operate as single, tightly coupled domains with shared memory coherence, reducing the need for complex distributed parallelism. Many large-scale AI workloads, including multi-trillion-parameter model training, can run effectively on just a few NVL72 pods, enabling developers to work within stable, local system boundaries.
Nvidia's modular scale-out model — NVL72/NVL144 (Rubin)/NVL576 (Rubin Ultra) → into a cluster, makes distribution more manageable.
Software stacks like NCCL, Megatron-LM, TensorRT-LLM, and DeepSpeed can assume consistent interconnect topologies and latency domains, with limited cross-pod communication. Taking into account Nvidia's vertically integrated and mature CUDA ecosystem, developers benefit from unified tooling, extensive documentation, and robust abstractions, making it possible to scale AI workloads with minimal custom engineering.
Huawei, by contrast, aims for scale through very large monolithic systems, such as the Atlas 950 SuperPod (8,192 NPUs) and Atlas 960 SuperPod (15,488 NPUs), which function as single logical compute domains. These SuperPods use Huawei’s UnifiedBus (UB) interconnect with 2.1 µs latency and up to 2 TB/s of tight chip-to-chip bandwidth to several thousand NPUs.
Token throughput is projected to exceed 80 million tokens/s (for Atlas 960 SuperPods), and memory access is synchronized across the entire system. This architecture supports tightly coupled training and inference at a massive scale, but also introduces far greater complexity in synchronization, memory partitioning, and job orchestration within each node.
In the scale-out model, Huawei connects multiple SuperPods via UBoE (UnifiedBus over Ethernet) or RoCE to build SuperClusters with 524,288 NPUs (Atlas 950) or over 1 million NPUs (Atlas 960). This large-scale interconnection requires developers to write software that performs well across tens or hundreds of thousands of accelerators, even for workloads that Nvidia can handle within a few pods.
While Huawei's vertical integration and proprietary toolchain (e.g., MindSpore) offer optimization opportunities, the lack of software maturity (according to Chinese companies, which still prefer to use Nvidia hardware despite issues with availability) and the massive scale involved make distributed scheduling, failure handling, and workload decomposition significantly harder, especially for tight synchrony requirements in multi-trillion-parameter models.
At Huawei Connect 2025, the company revealed its shift to system-level scaling via massive AI clusters in a bid to stay competitive in the rapidly developing AI industry. Huawei is unable to access advanced foundry nodes or HBM4 memory. But, it introduced a quite impressive Ascend NPU roadmap that includes the 950PR, 950DT, 960, and 970, all based on a new SIMD+SIMT architecture, featuring support for modern low-precision formats (FP8, MXFP4, HiF8, HiF4), and using proprietary memory like HiBL 1.0 and HiZQ 2.0.
Starting with the Ascend 950 series (1 PFLOPS FP8), Huawei’s SuperPods will scale up to 15,488 NPUs per SuperPod system and over half a million NPUs per SuperCluster. Such massive clusters enable Huawei to achieve multi-ZettaFLOPS performance levels comparable to those of clusters used by market leaders like Google , Meta, OpenAI, and xAI.
However, Huawei's large, monolithic clusters present major software scaling challenges, unlike Nvidia’s modular NVL72/NVL144/NVL576 systems, which are easier to program due to consistent pod sizes, mature tooling, and fewer nodes needed to reach the same performance targets.
Anton Shilov Social Links Navigation Contributing Writer Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://www.tomshardware.com/tech-industry/artificial-intelligence/SPONSORED_LINK_URL
- https://www.tomshardware.com/tech-industry/artificial-intelligence/huawei-ascend-npu-roadmap-examined-company-targets-4-zettaflops-fp4-performance-by-2028-amid-manufacturing-constraints#main
- https://www.tomshardware.com
- Intel admits it needs more Core Ultra 200-series wafers — 'If we had more Lunar Lake wafers, we would be selling more Lunar Lake'
- 3D-printed part failure causes light aircraft crash after plastic air intake melts during flight — pilot escapes with minor injuries
- Benevolent Facebook trader exchanges 192GB of DDR5 worth $2,200 for one RTX 5070 Ti worth roughly $800 — says selling at such a high price would have been 'unet
- Intrepid modder builds an M4-powered 4K iMac G3 with 3D-printed parts — guts 90s all-in-one and replaces internals with a Mac mini and an OLED screen
- The Ultimate Black Friday Deal Is Here
Informational only. No financial advice. Do your own research.