China bypasses US GPU bans with 1.54-exaflops ‘LineShine’ supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores

China bypasses US GPU bans with 1.54-exaflops 'LineShine' supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores

The machine delivers 1.54 ExaFLOP/s of BF16 training performance and peaks at 2.16 ExaFLOP/s during training of a 6.3-billion-parameter Earth observation generative compression model. Since companies like xAI do not publish peak performance of their AI clusters that use hundreds of thousands of AI GPUs from Nvidia, we cannot compare the performance of LineShine to that of Colossus or other advanced AI clusters. Yet, theoretical peak performance of xAI's Colossus is believed to be 497.9 ExaFLOPS, so even with a model FLOPS utilization of around 15% (like the LineShine does), it can deliver around 75 ExaFLOPS.

When it comes to theoretical peak FP64 performance, these 40,960 LX2 processors can deliver 2.47 ExaFLOPS, though we have no idea about the actual FP64 throughput of the machine, as it heavily depends on multiple factors.

CPU-only AI and HPC supercomputers offer several advantages over conventional heterogeneous CPU+GPU systems, specifically for complex scientific tasks that combine AI training with massive data ingestion, preprocessing, storage interaction, simulation, and orchestration.

Huawei unveils new "Atlas 350" AI accelerator with 1.56 PFLOPS of FP4 compute & up to 112 GB of HBM

Huawei could seize China’s AI chip crown in 2026 as Nvidia’s H200 shipments stall in regulatory limbo

Arm's $2 billion in AGI CPU sales are still not enough to penetrate 5% of overall market share, analyst reveals

Since everything runs on the same processor and memory space, they avoid many of the complications associated with heterogeneous computing, such as costly and bandwidth-hungry CPU-to-GPU data transfers, complex programming models, GPU memory limitations, and accelerator-specific software stacks.

Furthermore, homogeneous CPU-based systems can expose much larger coherent memory pools by combining HBM with large DDR capacities, which is useful for handling massive scientific datasets, retrieval-augmented generation, and long-context windows.

In addition, they are attractive for AI-for-science applications that involve irregular control flow, distributed I/O, communication-heavy pipelines, and execution patterns that do not map efficiently to GPUs.

Also, CPU-only systems can integrate more naturally with traditional HPC environments and perform regular supercomputer tasks (e.g., simulations), which is particularly useful for those who need both AI training/inference and HPC.

Last but not least, such systems reduce dependence on foreign accelerators and platforms like Nvidia's GPUs and the CUDA software ecosystems, which is important for China.

There is a big tradeoff, though: CPU-only systems are usually less power-efficient and deliver lower dense AI throughput than GPU-based supercomputers, which is why the industry bets on heterogeneous CPU+GPU architectures.

Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.

Anton Shilov is a contributing writer at Tom\u2019s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends. ","collapsible":{"enabled":true,"maxHeight":250,"readMoreText":"Read more","readLessText":"Read less"}}), "https://slice.vanilla.futurecdn.net/13-4-23/js/authorBio.js"); } else { console.error('%c FTE ','background: #9306F9; color: #ffffff','no lazy slice hydration function available'); } Anton Shilov Social Links Navigation Contributing Writer Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.

Findecanor In simplified terms: SME works not unlike a "Tensor core", by doing matrix multiplication. SVE is able to works not unlike a traditional massively parallel GPU "warp" by having predicated instructions. This approach is not unusual: Both Apple and Intel have had their own matrix extensions (both called "AMX" for some reason). Apple's latest CPUs support SVE, and Intel's AVX-512 has predication but fixed-width vector length. Apple has had "AMX" support in mobile processors since 2019 in addition to their "Neural Engine". It was their own design and hidden behind a framework, so what anyone outside of Apple knows about it has been reverse-engineered. Apple M4 and M5 have SME support and I'm a bit confused whether that is separate from AMX or if Apple is reusing the AMX moniker to mean SME. Intel's AMX was announced in 2020 but first released in Xeon processors in 2023. Many RISC-V CPUs developed for AI are built from the same principles. RISC-V allows each vendor to have their own proprietary extension: and there exist at least half a dozen of matrix extensions. Some RISC-V AI processors are heterogeneous, with some cores for general-purpose code and others being "AI cores" that are RISC-V with wider vectors unit and a matrix extension of some sorts. RISC-V's working groups are working on distilling these into two future official extensions: one bigger able to be shared by multiple cores and one more coupled to a single core. Reply

PEnns Interesting. But I thought our country's leaders and all the tech bros went to China last week….to sell them more GPUs (among other stuff)! And now we find out why China doesn't care for their wares anymore! Reply

bit_user The article said: When it comes to performance, a single LX2 processor delivers 60.3 TFLOPS FP64 performance, 240 TFLOPS BF16/FP16 throughput, and 960 TOPS INT8 performance. Matrix or vector, though? If it's vector, then that's about 200 fp64 GFLOPS per ARM core. If each ARM core has 2048 bits of aggregate vector FMA width, then you get 64 floating point ops per cycle and a clock speed of about 3.1 GHz, which I think is plausible for 5 nm-class node. So, it could be vector. That would put it at 33% higher fp64 floating point ops per cycle than Zen 5, which is also made on a late-generation, 5 nm-family node. Assuming it's vector FLOPS, that would work out to about 15.1 floating point ops per byte (HBM-only), which puts it at the high end of the range (lower is better). Source: https://chipsandcheese.com/p/sc25-estimating-amds-upcoming-mi430xsThat means it could be fairly bandwidth starved, before we even consider the extra memory traffic that might be involved in migrating data between HBM and DDR5. The article said: Since everything runs on the same processor and memory space, they avoid many of the complications associated with heterogeneous computing, such as costly and bandwidth-hungry CPU-to-GPU data transfers, complex programming models, GPU memory limitations, … Wow, it seems like you had already forgotten what you wrote above: The article said: Each chiplet contains four HBM domains and four DDR domains; there are 16 NUMA domains per processor. HBM access is highly sensitive to locality, whereas DDR memory access is more uniform within a die and is shared between clusters. Such behavior forced developers to design topology-aware memory placement and scheduling techniques … which are executed by a dedicated SDMA engine to move data between DDR and HBM. … The paper notes that sustaining high utilization of the SME matrix engines required extensive co-design of kernels, runtime scheduling, cache residency management, and tensor placement across the HBM and DDR hierarchy. Arranging and migrating that data, queuing & handling DMA operations, and customized thread scheduling is a lot like the kind of headaches you have to deal with in a hybrid CPU + GPU architecture! The article said: Furthermore, homogeneous CPU-based systems can expose much larger coherent memory pools by combining HBM with large DDR capacities, which is useful for handling massive scientific datasets, retrieval-augmented generation, and long-context windows. Except that you already said each CPU can host only 256 GB of DDR5. If you compare that to a Grace-Blackwell, that CPU can host 480 GB. Verra will be able to host up to 1.5 TB per CPU. And however many of these CPUs you can directly link, I'm sure it won't be nearly as many as a NVLink mesh can handle. NVLink is cache-coherent. So, LX2 can't solve that deficit by scaling – it's always going to be at ~50% the capacity of a Grace-Blackwell superchip and 15% the capacity of a Vera-Rubin superchip. The article said: Last but not least, such systems reduce dependence on foreign accelerators and platforms like Nvidia's GPUs and the CUDA software ecosystems, which is important for China. That's obviously the reason, though. Even Fujitsu, who did the last chart-topping CPU-only supercomputer (i.e. Fugaku) has pivoted towards a CPU + GPU approach. They can do the math. GPUs are just better at HPC and AI. Today's GPUs and programming frameworks are also more streamlined and flexible than in older generations, which means higher utilization and fewer of the downsides attributed to hybrid setups. Reply

frankens Usa is stupid. Block a country, and they will bypass us and make thier own. They are great copiers of tech.. and optimizers… make something else.. faster and more efficient… and cheaper. Usa in the end will be stuck with last years quality of product while they advance. This only my opinion. (Just look at.. over there, you can buy an electric car.. fancy ones.. for under $10k) Reply

bit_user frankens said: Usa is stupid. Block a country, and they will bypass us and make thier own. Over the past 2+ decades, China was already trying to build its own indigenous supply chain. Even in supercomputing, they already built the Sunway TaihuLight a decade ago! https://en.wikipedia.org/wiki/Sunway_TaihuLight Sure, the trade policy and HPC export embargoes might've sped things up, a little bit, but it's nothing China wouldn't have done anyway. frankens said: Usa in the end will be stuck with last years quality of product while they advance. Depends on whether we walk away from the computing industry, the same way we did with other vital industries, like rare earth metals (where the US was once the global leader). However, the political climate in the US has changed since those days, with a much greater will to protect domestic industries. frankens said: (Just look at.. over there, you can buy an electric car.. fancy ones.. for under $10k) With subsidies at multiple levels, but yes. Reply

10basetom Admin said: China's National Supercomputing Center in Shenzhen takes a page from Japan's Fugaku supercomputer and Fujitsu's A64FX processor, build LineShine supercomputer based entirely on Armv9-based LineShine LX2 CPUs. China bypasses US GPU bans with 1.54-exaflops 'LineShine' supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores : Read more The irony is that the GPU ban and tariffs have resulted in a more innovative China. Reply

bit_user 10basetom said: The irony is that the GPU ban and tariffs have resulted in a more innovative China. Well, the embargoes on ASML and TSMC certainly threw a wrench in things. It'd be really interesting to know where these chips are made. They're not supposed to be able to make them at TSMC, but they've gotten around that before. I'm a little skeptical they're being made at SMIC. In terms of innovation, they basically copied Fujitsu's A64FX and Japan's Fugaku, right down to the use of ARM ISA. However, that's not to say it doesn't represent further indigenous capability than we've seen thus far. But, it wouldn't even be Huawei's first ARMv9-A server core. Reply

twin_savage I don't want to knock the Chinese efforts too much because progress is progress, but to put things in perspective, this entire installation is on par with about 5 of Nvidia's newest racks at BF16 workload. It seems like the gap in compute between the USA and China is widening instead of shrinking the past several years. Reply

micheal_15 Remembe this is china where they claim to pretty much have a CPU working at a trillion exaflops, costs 1 yen and all games run at 100 billion FPS at 32k resolution….. Basically china just makes stuff up Reply

qxp bit_user said: Arranging and migrating that data, queuing & handling DMA operations, and customized thread scheduling is a lot like the kind of headaches you have to deal with in a hybrid CPU + GPU architecture! There is one big difference – you can simply write a regular C code and let C compiler and operating system do their thing with a mix of HBM and DMA and vector units. Then you run what you have, profile your code and only optimize places that matter (say 90-99% of cpu time spent) and leave the rest as is. With GPU you have to decide on how your split the tasks beforehand, so you are forced into early optimization decisions. bit_user said: Except that you already said each CPU can host only 256 GB of DDR5. If you compare that to a Grace-Blackwell, that CPU can host 480 GB. Verra will be able to host up to 1.5 TB per CPU. I would be perfectly happy to just have the CPUs with 32GB HBM, never mind the rest of DDR5. Reply

China bypasses US GPU bans with 1.54-exaflops ‘LineShine’ supercomputer — CPU-only monster packs 2.4 million Huawei-designed Armv9 cores

Key considerations

Reference reading

More on this site

Leave a Comment Cancel reply

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply