
Nvidia claims software and hardware upgrades allow Blackwell Ultra GB300 to dominate MLPerf benchmarks
From a performance standpoint, the larger configuration with 144 Trainium3 delivers 362.5 MXFP8/MXFP4 PetaFLOPS (dense) performance, which (on par with GB300 NVL72), 96.624 PFLOPS of BF16/FP16/TF32 throughput, and 26.352 PFLOPS in FP32. The system is also equipped with 21 TB of HBM3E memory, featuring an aggregate memory bandwidth of 705.6 TB/s, leaving Nvidia's GB300 NVL72 behind in this metric.
In general, Trn3 Gen2 UltraServer appears very competitive against Nvidia's GB300 NVL72 in terms of FP8 performance. FP8 is about to get more popular for training, so betting on this format makes a lot of sense. Of course, Nvidia has an ace up its sleeve in the form of NVFP4 , which is positioned both for inference and training, and armed with this format, the company's Blackwell-based machines are unbeatable. The same applies to BF16, which got faster compared to Trainium2, but not enough to beat Nvidia's Blackwell.
Overall, while the AWS Trn3 Gen2 UltraServer with 144 Trainium3 accelerator looks quite competitive when it comes to FP8 compared to Nvidia's Blackwell-based NVL72 machines, Nvidia's solution is more universal in general.
In addition to rolling out new AI hardware, AWS announced a broad expansion of its AWS Neuron software stack at its annual re:Invent conference this week. AWS positions this release as a shift toward openness and developer accessibility, so the update promises to make Trainium platforms easier to adopt, let standard machine learning frameworks run directly on Trainium hardware, give users deeper control over performance, and even expose low-level optimization paths for experts.
A major addition is native PyTorch integration through an open-source backend named TorchNeuron. Using PyTorch's PrivateUse1 mechanism, Trainium now appears as a native device type, which enables existing PyTorch code to execute without modification. TorchNeuron also supports interactive eager execution, torch.compile, and distributed features such as FSDP and DTensor, and it works with popular ecosystems including TorchTitan and Hugging Face Transformers. Access to this feature is currently restricted to select users as part of the private preview program.
AWS also introduced an updated Neuron Kernel Interface (NKI) that gives developers direct control over hardware behavior, including instruction-level programming, explicit memory management, and fine-grained scheduling, exposing Trainium's instruction set to kernel developers. In addition, the company has released the NKI Compiler as open source under Apache 2.0. The programming interface is available publicly, while the compiler itself remains in limited preview.
AWS also released its Neuron Explorer, a debugging and tuning toolkit that lets software developers and performance engineers improve how their models run on Trainium. This is done by tracing execution from high-level framework calls, all the way down to individual accelerator instructions, while offering layered profiling, source-level visibility, integration with development environments, and AI-guided suggestions for performance tuning.
Finally, AWS introduced its Neuron Dynamic Resource Allocation (DRA) to integrate Trainium directly into Kubernetes without the need for custom schedulers. Neuron DRA relies on the native Kubernetes scheduler and adds hardware-topology awareness to enable complete UltraServers to be allocated as a single resource and then flexibly assign hardware for each workload. Neuron DRA supports Amazon EKS, SageMaker HyperPod, and UltraServer deployments, and is provided as open-source software with container images published in the AWS ECR public registry.
Both Neuron Explorer and Neuron DRA are designed to simplify cluster management and give users fine-grained control over how Trainium resources are assigned and used. In a nutshell, AWS is moving closer to making its Trainium-based platforms much more ubiquitous than they are today, in an effort to make them more competitive against CUDA-based offerings from Nvidia.
This week, Amazon Web Services released its 3 rd Generation Trainium accelerator for AI training and inference, as well as accompanying Trn3 UltraServers rack-scale solutions. For the first time, Trn3 Gen2 UltraServers rack-scale machines will rely solely on AWS in-house hardware, including CPU, AI accelerators, switching hardware, and connectivity fabrics, signalling that the company has adopted Nvidia's vertical integration hardware strategy.
AWS claims that its Trainium3 processor offers roughly 2X higher performance and 4X better energy efficiency than Trainium2 as each accelerator delivers up to 2.517 PFLOPS (MXFP8) — beating Nvidia's H100, but trailing B200 — and is accompanied by 144 GB of HBM3E with 4.9 TB/s of bandwidth. Meanwhile, Trn3 Gen2 UltraServers scale to 144 accelerators for about 0.36 ExaFLOPS FP8 performance, which brings it on par with Nvidia's GB300 NVL72 rack-scale solution. Nonetheless, Nvidia's hardware still looks more universal than AWS's.
To catch up with Nvidia, Amazon also announced major updates to its Neuron software stack to make Trainium-based platforms easier to use, allow standard machine-learning frameworks to run natively on the hardware, give developers greater control over performance, and open access to low-level tuning for experts.
Anton Shilov Social Links Navigation Contributing Writer Anton Shilov is a contributing writer at Tom’s Hardware. Over the past couple of decades, he has covered everything from CPUs and GPUs to supercomputers and from modern process technologies and latest fab tools to high-tech industry trends.
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://www.tomshardware.com/tech-industry/artificial-intelligence/SPONSORED_LINK_URL
- https://www.tomshardware.com/tech-industry/artificial-intelligence/amazon-launches-trainium3-ai-accelerator-competing-directly-against-blackwell-ultra-in-fp8-performance-new-trn3-gen2-ultraserver-takes-vertical-scaling-notes-from-nvidias-playbook#main
- https://www.tomshardware.com
- NVIDIA and AWS Expand Full-Stack Partnership, Providing the Secure, High-Performance Compute Platform Vital for Future Innovation
- Robots’ Holiday Wishes Come True: NVIDIA Jetson Platform Offers High-Performance Edge AI at Festive Prices
- AI boom forces delays on Transcend SSDs, SD cards and flash drives — SanDisk and Samsung short on supplying NAND chips
- Asus reportedly halts ROG Matrix RTX 5090 shipments — $4,000 halo card could be dead in its tracks for now due to quality control issue (Updated)
- Best Cyber Monday and Black Friday tech deals you can still get live — PC hardware deals on GPUs, CPUs, SSDs, and more that you can still get at Amazon, Newegg,
Informational only. No financial advice. Do your own research.