Nvidia unveils details of new 88-core Vera CPUs positioned to compete with AMD and Intel – new Vera CPU rack features 256 liquid-cooled chips that deliver up to

Nvidia unveils details of new 88-core Vera CPUs positioned to compete with AMD and Intel – new Vera CPU rack features 256 liquid-cooled chips that deliver up to

Nvidia reportedly boosts Vera Rubin performance to ward hyperscalers off AMD Instinct AI accelerators

The chip fully supports Confidential Computing, a notable advance over Grace that allows for fully protected CPU+GPU domains. The CPU also features an NVLink-C2C die-to-die interface with up to 1.8 TB/s of throughput, a doubling of Grace’s 900 GB/s interconnect and seven times faster than PCIe 6.0. It also supports two-processor (2P) configurations.

Overall, Vera supports the full suite of technologies expected from a modern data center processor, including PCIe 6.0 and CXL 3.1 support, but with a bandwidth and latency-focused compute design that positions its uniquely well for use in AI workflows.

Grace has already served as a fundamental building block in many Nvidia GPU+CPU systems , including some of the fastest AI supercomputers on the planet , but Nvidia’s expanded goal is to leverage Vera in pure-play CPU racks that can be more widely deployed.

The Vera CPU rack meets that goal with 256 liquid-cooled Vera CPUs paired with 74 Bluefield-4 DPUs and ConnectX SuperNIC networking. The rack weighs in with up to 400 TB of LPDDR5 and 300 TB/s of aggregate memory throughput. That feeds the 45,056 threads, which Nvidia says supports 22,500 concurrent CPU environments running independently.

Nvidia shared benchmarks in a wide range of workloads, touting from a 1.8x to 2.2x performance improvement over Grace in scripting, compilation, data analytics, graph analytics, and HPC workloads, among others.

Naturally one would expect this system to be deployed at Meta, which recently announced its partnership with Nvidia for CPU-only systems, but Nvidia says it will also offer the Vera CPU rack system to hyperscalers, including Oracle, Coreweave, Nebius, Alibaba, and others.

A broad range of OEMs and ODMs will also provide single- and dual-socket servers for the broader market for a wide range of use cases, including industry heavyweights like Dell , HPE, Lenovo, Supermicro, Foxconn, and many others. The Vera CPUs will also be used for Nvidia HGX NVL8 systems.

Perhaps most importantly, these racks will also serve as an integral part of Nvidia’s broader Vera Rubin platform, which features seven chips in total, including the Rubin GPU, NVLink6 Switch for rack-scale interconnect, ConnectX-9 SuperNIC for networking, Bluefield 4 DPU, Spectrum-X 102.4T Co-packaged Optics switch, and Nvidia’s Groq 3 LPUs.

The Vera CPUs are in full production now and are slated for deliveries beginning in the second half of this year.

Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.

Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware. ","collapsible":{"enabled":true,"maxHeight":250,"readMoreText":"Read more","readLessText":"Read less"}}), "https://slice.vanilla.futurecdn.net/13-4-18/js/authorBio.js"); } else { console.error('%c FTE ','background: #9306F9; color: #ffffff','no lazy slice hydration function available'); } Paul Alcorn Social Links Navigation Editor-in-Chief Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

thestryker It will be interesting to see what sort of impact these make on the market as they seem quite good on paper unless you need super high core counts. I didn't see anything regarding the amount of PCIe lanes available and it'll be interesting to see what the core latencies end up looking like. Just to put the memory bandwidth per core into some context: The highest x86 available today in that general core count is Intel's 72 and 96 core Xeon 6 using 12 channel MRDIMMs. This equates to ~11.7 GB/s and 8.8 GB/s per core respectively. Reply

Notton "To meet those goals, the company designed an 88-core CPU with 144 threads, an increase over the first-gen Grace’s 72 cores." The slide says 88 cores and 176 threads. Also the die shot on the slide, if it's the actual product, looks like it's configured as 13×7 = 91. I'm guessing they disable the cores that don't work? Reply

bit_user The article said: the firm does stipulate that the new Olympus cores found on Vera are ‘Nvidia designed,’ signaling that the company has made custom modifications to the reference design . What??? No, it does not mean that! There's been every indication these are fully custom cores, and not Nvidia's first! https://en.wikipedia.org/wiki/Project_Denver https://www.uio.no/studier/emner/matnat/ifi/IN5050/v25/slides/denver_carmel_uarch.pdf The article said: The Arm v9.2-A Olympus cores feature spatial multi-threading, which physically isolates the various components of the pipeline by not time-slicing the key elements, like the execution units, caches and register files, with the other thread running on the same core. This almost sounds like they're trying to spin a weakness into a strength. The reason other CPUs have things like low watermarks to constrain how much one thread can use of a competitively-shared resource is so that the other thread doesn't get starved out. On the whole, that's better for system throughput, even if it means shaving off a little bit of one thread's performance, every now and then. The article said: Spatial Multi-Threading increases Instruction Level Parallelism (ILP), throughput, and performance predictability by pulling instructions from other threads when execution elements are idle, thus ensuring full utilization. Zen 5 (not sure about earlier ones) does have a specific operational mode for providing exclusive access to a single thread. This mode gets entered and exited dynamically, at runtime, based on how many threads are scheduled on a given core, at a given point in time. This gives the OS thread scheduler the ability to boost select threads by prioritizing them for exclusive use of a core. The article said: in a standard SMT implementation the threads essentially take turns running on a single core. They do not. The best info on how Zen 5 partitions resources between SMT threads is probably here: https://chipsandcheese.com/p/a-video-interview-with-mike-clark-chief-architect-of-zen-at-amd Only very simple SMT implementations, like some GPUs, will do a true round-robin between unblocked threads. The article said: Nvidia arranges all 88 cores in a single domain, so there are no latency-inducing NUMA eccentricities to be found, in stark contrast to current high core-count x86 competitors. This has dramatic implications for latency, predictability, bandwidth, and ease-of-programmability. That's an exaggeration. Modern x86 server CPUs show very few NUMA effects, until you get to multi-socket. At that point, Nvidia's Rubin will be suffering as well. https://chipsandcheese.com/p/evaluating-uniform-memory-access The article said: NVLink-C2C die-to-die interface with up to 1.8 TB/s of throughput, a doubling of Grace’s 900 GB/s interconnect and seven times faster than PCIe 6.0. It also supports two-processor (2P) configurations. Worth noting that those figures include both directions. At least the PCIe 6.0 figure was computed the same way. The NVLink-C2C interface supports 2P configurations, but you can scale up to way more CPUs over regular NVLink. The article said: That feeds the 45,056 threads, which Nvidia says supports 22,500 concurrent CPU environments running independently. Oh, that's interesting. Makes me wonder if both SMT threads per CPU must be in the same Confidential Computing domain. Reply

bit_user thestryker said: Just to put the memory bandwidth per core into some context: The highest x86 available today in that general core count is Intel's 72 and 96 core Xeon 6 using 12 channel MRDIMMs. This equates to ~11.7 GB/s and 8.8 GB/s per core respectively. Well, Intel's current server platform seems more focused on scaling up to 128 P-cores and feeding x96 PCIe 5.0 lanes. https://www.intel.com/content/www/us/en/products/sku/240777/intel-xeon-6980p-processor-504m-cache-2-00-ghz/specifications.html When Nvidia compared them, they probably used the 128-core 6980P to get their figure of "3x memory bandwidth per core". Nvidia will always find the most flattering point of comparison. If the per-CPU cost of deploying Rubin is much lower, then you don't need as much performance per-CPU, because you can just scale up to more of them. NVLink supports that, via better scalability. Reply

thestryker bit_user said: When Nvidia compared them, they probably used the 128-core 6980P to get their figure of "3x memory bandwidth per core". Nvidia will always find the most flattering point of comparison. That still doesn't math out because even comparing RDIMMs instead of MRDIMMs it's 4.8 GB/s, but I wouldn't put it past them to round up for marketing materials. Reply

Key considerations

  • Investor positioning can change fast
  • Volatility remains possible near catalysts
  • Macro rates and liquidity can dominate flows

Reference reading

More on this site

Informational only. No financial advice. Do your own research.

Leave a Comment