
Nvidia claims software and hardware upgrades allow Blackwell Ultra GB300 to dominate MLPerf benchmarks
From a performance standpoint, the larger configuration with 144 Trainium3 delivers 362.5 MXFP8/MXFP4 PetaFLOPS (dense) performance, which (on par with GB300 NVL72), 96.624 PFLOPS of BF16/FP16/TF32 throughput, and 26.352 PFLOPS in FP32. The system is also equipped with 21 TB of HBM3E memory, featuring an aggregate memory bandwidth of 705.6 TB/s, leaving Nvidia's GB300 NVL72 behind in this metric.
In general, Trn3 Gen2 UltraServer appears very competitive against Nvidia's GB300 NVL72 in terms of FP8 performance. FP8 is about to get more popular for training, so betting on this format makes a lot of sense. Of course, Nvidia has an ace up its sleeve in the form of NVFP4 , which is positioned both for inference and training, and armed with this format, the company's Blackwell-based machines are unbeatable. The same applies to BF16, which got faster compared to Trainium2, but not enough to beat Nvidia's Blackwell.
Overall, while the AWS Trn3 Gen2 UltraServer with 144 Trainium3 accelerator looks quite competitive when it comes to FP8 compared to Nvidia's Blackwell-based NVL72 machines, Nvidia's solution is more universal in general.
In addition to rolling out new AI hardware, AWS announced a broad expansion of its AWS Neuron software stack at its annual re:Invent conference this week. AWS positions this release as a shift toward openness and developer accessibility, so the update promises to make Trainium platforms easier to adopt, let standard machine learning frameworks run directly on Trainium hardware, give users deeper control over performance, and even expose low-level optimization paths for experts.
A major addition is native PyTorch integration through an open-source backend named TorchNeuron. Using PyTorch's PrivateUse1 mechanism, Trainium now appears as a native device type, which enables existing PyTorch code to execute without modification. TorchNeuron also supports interactive eager execution, torch.compile, and distributed features such as FSDP and DTensor, and it works with popular ecosystems including TorchTitan and Hugging Face Transformers. Access to this feature is currently restricted to select users as part of the private preview program.
AWS also introduced an updated Neuron Kernel Interface (NKI) that gives developers direct control over hardware behavior, including instruction-level programming, explicit memory management, and fine-grained scheduling, exposing Trainium's instruction set to kernel developers. In addition, the company has released the NKI Compiler as open source under Apache 2.0. The programming interface is available publicly, while the compiler itself remains in limited preview.
AWS also released its Neuron Explorer, a debugging and tuning toolkit that lets software developers and performance engineers improve how their models run on Trainium. This is done by tracing execution from high-level framework calls, all the way down to individual accelerator instructions, while offering layered profiling, source-level visibility, integration with development environments, and AI-guided suggestions for performance tuning.
Finally, AWS introduced its Neuron Dynamic Resource Allocation (DRA) to integrate Trainium directly into Kubernetes without the need for custom schedulers. Neuron DRA relies on the native Kubernetes scheduler and adds hardware-topology awareness to enable complete UltraServers to be allocated as a single resource and then flexibly assign hardware for each workload. Neuron DRA supports Amazon EKS, SageMaker HyperPod, and UltraServer deployments, and is provided as open-source software with container images published in the AWS ECR public registry.
Both Neuron Explorer and Neuron DRA are designed to simplify cluster management and give users fine-grained control over how Trainium resources are assigned and used. In a nutshell, AWS is moving closer to making its Trainium-based platforms much more ubiquitous than they are today, in an effort to make them more competitive against CUDA-based offerings from Nvidia.
This week, Amazon Web Services released its 3 rd Generation Trainium accelerator for AI training and inference, as well as accompanying Trn3 UltraServers rack-scale solutions. For the first time, Trn3 Gen2 UltraServers rack-scale machines will rely solely on AWS in-house hardware, including CPU, AI accelerators, switching hardware, and connectivity fabrics, signalling that the company has adopted Nvidia's vertical integration hardware strategy.
AWS claims that its Trainium3 processor offers roughly 2X higher performance and 4X better energy efficiency than Trainium2 as each accelerator delivers up to 2.517 PFLOPS (MXFP8) — beating Nvidia's H100, but trailing B200 — and is accompanied by 144 GB of HBM3E with 4.9 TB/s of bandwidth. Meanwhile, Trn3 Gen2 UltraServers scale to 144 accelerators for about 0.36 ExaFLOPS FP8 performance, which brings it on par with Nvidia's GB300 NVL72 rack-scale solution. Nonetheless, Nvidia's hardware still looks more universal than AWS's.
To catch up with Nvidia, Amazon also announced major updates to its Neuron software stack to make Trainium-based platforms easier to use, allow standard machine-learning frameworks to run natively on the hardware, give developers greater control over performance, and open access to low-level tuning for experts.
Anton Shilov Social Links Navigation Contributing Writer Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://www.tomshardware.com/tech-industry/artificial-intelligence/SPONSORED_LINK_URL
- https://www.tomshardware.com/tech-industry/artificial-intelligence/amazon-launches-trainium3-ai-accelerator-competing-directly-against-blackwell-ultra-in-fp8-performance-new-trn3-gen2-ultraserver-takes-vertical-scaling-notes-from-nvidias-playbook#main
- https://www.tomshardware.com
- Cloudflare says it has fended off 416 billion AI bot scrape requests in five months — CEO warns of dramatic shift for internet business model
- Nvidia refuses to replace RTX 5080 FE GPU's broken 16-pin power connector retention clip — the owner says Nvidia is trying to 'burn my house down'
- AMD's imminent Ryzen 7 9850X3D chip shows up on Geekbench with 5.6 GHz boost clocks — Scores slightly lower than 9800X3D in multi-core tests, higher in single-c
- SoftBank CEO Masayoshi Son wants to build ‘Trump Industrial Parks’ across the nation, report claims — project proposes using federal land to build manufacturing
- Nvidia reinstates 32-bit PhysX support for RTX 50 series as part of its latest Game Ready driver rollout — 9 titles included in initial release
Informational only. No financial advice. Do your own research.