Nvidia unveils details of new 88-core Vera CPUs positioned to compete with AMD and Intel – new Vera CPU rack features 256 liquid-cooled chips that deliver up to

The chip fully supports Confidential Computing, a notable advance over Grace that allows for fully protected CPU+GPU domains. The CPU also features an NVLink-C2C die-to-die interface with up to 1.8 TB/s of throughput, a doubling of Grace’s 900 GB/s interconnect and seven times faster than PCIe 6.0. It also supports two-processor (2P) configurations.

Overall, Vera supports the full suite of technologies expected from a modern data center processor, including PCIe 6.0 and CXL 3.1 support, but with a bandwidth and latency-focused compute design that positions its uniquely well for use in AI workflows.

Grace has already served as a fundamental building block in many Nvidia GPU+CPU systems , including some of the fastest AI supercomputers on the planet , but Nvidia’s expanded goal is to leverage Vera in pure-play CPU racks that can be more widely deployed.

The Vera CPU rack meets that goal with 256 liquid-cooled Vera CPUs paired with 74 Bluefield-4 DPUs and ConnectX SuperNIC networking. The rack weighs in with up to 400 TB of LPDDR5 and 300 TB/s of aggregate memory throughput. That feeds the 45,056 threads, which Nvidia says supports 22,500 concurrent CPU environments running independently.

Nvidia shared benchmarks in a wide range of workloads, touting from a 1.8x to 2.2x performance improvement over Grace in scripting, compilation, data analytics, graph analytics, and HPC workloads, among others.

Naturally one would expect this system to be deployed at Meta, which recently announced its partnership with Nvidia for CPU-only systems, but Nvidia says it will also offer the Vera CPU rack system to hyperscalers, including Oracle, Coreweave, Nebius, Alibaba, and others.

A broad range of OEMs and ODMs will also provide single- and dual-socket servers for the broader market for a wide range of use cases, including industry heavyweights like Dell , HPE, Lenovo, Supermicro, Foxconn, and many others. The Vera CPUs will also be used for Nvidia HGX NVL8 systems.

Perhaps most importantly, these racks will also serve as an integral part of Nvidia’s broader Vera Rubin platform, which features seven chips in total, including the Rubin GPU, NVLink6 Switch for rack-scale interconnect, ConnectX-9 SuperNIC for networking, Bluefield 4 DPU, Spectrum-X 102.4T Co-packaged Optics switch, and Nvidia’s Groq 3 LPUs.

The Vera CPUs are in full production now and are slated for deliveries beginning in the second half of this year.

Follow Tom's Hardware on Google News , or add us as a preferred source , to get our latest news, analysis, & reviews in your feeds.

Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware. ","collapsible":{"enabled":true,"maxHeight":250,"readMoreText":"Read more","readLessText":"Read less"}}), "https://slice.vanilla.futurecdn.net/13-4-18/js/authorBio.js"); } else { console.error('%c FTE ','background: #9306F9; color: #ffffff','no lazy slice hydration function available'); } Paul Alcorn Social Links Navigation Editor-in-Chief Paul Alcorn is the Editor-in-Chief for Tom's Hardware US. He also writes news and reviews on CPUs, storage, and enterprise hardware.

thestryker It will be interesting to see what sort of impact these make on the market as they seem quite good on paper unless you need super high core counts. I didn't see anything regarding the amount of PCIe lanes available and it'll be interesting to see what the core latencies end up looking like. Just to put the memory bandwidth per core into some context: The highest x86 available today in that general core count is Intel's 72 and 96 core Xeon 6 using 12 channel MRDIMMs. This equates to ~11.7 GB/s and 8.8 GB/s per core respectively. Reply

Notton "To meet those goals, the company designed an 88-core CPU with 144 threads, an increase over the first-gen Grace’s 72 cores." The slide says 88 cores and 176 threads. Also the die shot on the slide, if it's the actual product, looks like it's configured as 13×7 = 91. I'm guessing they disable the cores that don't work? Reply

bit_user The article said: the firm does stipulate that the new Olympus cores found on Vera are ‘Nvidia designed,’ signaling that the company has made custom modifications to the reference design . What??? No, it does not mean that! There's been every indication these are fully custom cores, and not Nvidia's first! https://en.wikipedia.org/wiki/Project_Denver https://www.uio.no/studier/emner/matnat/ifi/IN5050/v25/slides/denver_carmel_uarch.pdf The article said: The Arm v9.2-A Olympus cores feature spatial multi-threading, which physically isolates the various components of the pipeline by not time-slicing the key elements, like the execution units, caches and register files, with the other thread running on the same core. This almost sounds like they're trying to spin a weakness into a strength. The reason other CPUs have things like low watermarks to constrain how much one thread can use of a competitively-shared resource is so that the other thread doesn't get starved out. On the whole, that's better for system throughput, even if it means shaving off a little bit of one thread's performance, every now and then. The article said: Spatial Multi-Threading increases Instruction Level Parallelism (ILP), throughput, and performance predictability by pulling instructions from other threads when execution elements are idle, thus ensuring full utilization. Zen 5 (not sure about earlier ones) does have a specific operational mode for providing exclusive access to a single thread. This mode gets entered and exited dynamically, at runtime, based on how many threads are scheduled on a given core, at a given point in time. This gives the OS thread scheduler the ability to boost select threads by prioritizing them for exclusive use of a core. The article said: in a standard SMT implementation the threads essentially take turns running on a single core. They do not. The best info on how Zen 5 partitions resources between SMT threads is probably here: https://chipsandcheese.com/p/a-video-interview-with-mike-clark-chief-architect-of-zen-at-amd Only very simple SMT implementations, like some GPUs, will do a true round-robin between unblocked threads. The article said: Nvidia arranges all 88 cores in a single domain, so there are no latency-inducing NUMA eccentricities to be found, in stark contrast to current high core-count x86 competitors. This has dramatic implications for latency, predictability, bandwidth, and ease-of-programmability. That's an exaggeration. Modern x86 server CPUs show very few NUMA effects, until you get to multi-socket. At that point, Nvidia's Rubin will be suffering as well. https://chipsandcheese.com/p/evaluating-uniform-memory-access The article said: NVLink-C2C die-to-die interface with up to 1.8 TB/s of throughput, a doubling of Grace’s 900 GB/s interconnect and seven times faster than PCIe 6.0. It also supports two-processor (2P) configurations. Worth noting that those figures include both directions. At least the PCIe 6.0 figure was computed the same way. The NVLink-C2C interface supports 2P configurations, but you can scale up to way more CPUs over regular NVLink. The article said: That feeds the 45,056 threads, which Nvidia says supports 22,500 concurrent CPU environments running independently. Oh, that's interesting. Makes me wonder if both SMT threads per CPU must be in the same Confidential Computing domain. Reply

bit_user thestryker said: Just to put the memory bandwidth per core into some context: The highest x86 available today in that general core count is Intel's 72 and 96 core Xeon 6 using 12 channel MRDIMMs. This equates to ~11.7 GB/s and 8.8 GB/s per core respectively. Well, Intel's current server platform seems more focused on scaling up to 128 P-cores and feeding x96 PCIe 5.0 lanes. https://www.intel.com/content/www/us/en/products/sku/240777/intel-xeon-6980p-processor-504m-cache-2-00-ghz/specifications.html When Nvidia compared them, they probably used the 128-core 6980P to get their figure of "3x memory bandwidth per core". Nvidia will always find the most flattering point of comparison. If the per-CPU cost of deploying Rubin is much lower, then you don't need as much performance per-CPU, because you can just scale up to more of them. NVLink supports that, via better scalability. Reply

thestryker bit_user said: When Nvidia compared them, they probably used the 128-core 6980P to get their figure of "3x memory bandwidth per core". Nvidia will always find the most flattering point of comparison. That still doesn't math out because even comparing RDIMMs instead of MRDIMMs it's 4.8 GB/s, but I wouldn't put it past them to round up for marketing materials. Reply

abufrejoval bit_user said: This almost sounds like they're trying to spin a weakness into a strength. The reason other CPUs have things like low watermarks to constrain how much one thread can use of a competitively-shared resource is so that the other thread doesn't get starved out. On the whole, that's better for system throughput, even if it means shaving off a little bit of one thread's performance, every now and then. These are CUDA CPUs! CUDA is all about marshalling large number of cores to work in concert without stepping on each other or getting out of line: my mental image is that of a crowd of people in a rice paddy, seeding or weeding or harvesting, where if someone needs to grab a bite or take a leak and that requires stepping out of line, that brings down the efficiency of the entire crowd, so they need to ensure that doesn't happen. "classic" round-robin SMT takes advantage of the fact that threads aren't unison, but do very different things on very different parts of memory, mix FP and logic, because they are there to take advantage of pipeline and memory stalls to keep otherwise unused ALUs busy, especially on bigger designs like IBM's Power. But that's not the type of workload that's running on Vera: Vera is running the meta layers of CUDA, (was it called Dynamo?), which is still about marshalling, except that it's marshalling GPUs (or racks of them) not GPU cores. Heterogeneosity (apart from mixing experts) is wholly undesired within scale-out workloads, optimizing the scale-up part on the other hand is what they wanted to support better. IMHO Vera isn't hubris or thinking they could do the better generic CPU core. It's really mostly about general purpose CPUs not really fitting their needs, mostly in terms of I/O, but down to how CPU cores themselves make optimal use of their transistor budgets to support the workers on the rice paddy. Reply

bit_user abufrejoval said: These are CUDA CPUs! No. CUDA has a very specific meaning, in that it refers to an API and hardware which is capable of being utilized via that API. abufrejoval said: CUDA is all about marshalling large number of cores to work in concert without stepping on each other or getting out of line: Not everything with a lot of cores supports CUDA. Compared to a GPU, they would not be very efficiently utilized via CUDA, if it did have a CPU backend which ran atop them. abufrejoval said: my mental image is that of a crowd of people in a rice paddy, seeding or weeding or harvesting, where if someone needs to grab a bite or take a leak and that requires stepping out of line, that brings down the efficiency of the entire crowd, so they need to ensure that doesn't happen. For several generations, Nvidia has been working to make SIMD more efficient. You might be interested in reading about some of those efforts. https://chipsandcheese.com/p/shader-execution-reordering-nvidia-tackles-divergence abufrejoval said: that's not the type of workload that's running on Vera: Vera is running the meta layers of CUDA, (was it called Dynamo?), which is still about marshalling, except that it's marshalling GPUs (or racks of them) not GPU cores. What do you mean by "the metal layers of CUDA"? In GPU contexts, these CPUs exist for implementing processing that's inefficient to run on GPUs (i.e. don't map well to CUDA). They're complementary to GPUs! They also do control & management plane operations + I/O and host memory pools that can be distributed throughout the NVLink fabric. In CPU contexts, like what Meta is using them for, they're for running generic cloud workloads. abufrejoval said: Heterogeneosity (apart from mixing experts) is wholly undesired within scale-out workloads, optimizing the scale-up part on the other hand is what they wanted to support better. If all they wanted was more GPUs, they'd have stuck to to making GPUs and would have homogeneous compute fabrics. abufrejoval said: IMHO Vera isn't hubris or thinking they could do the better generic CPU core. It is, because it's ARM ISA. ARM does not let customers implement their own nonstandard extensions. The only sort-of exceptions to that were where Apple made extensions which ARM later added to the ISA (or maybe it was a concurrent thing). The only reasons to implement it themselves are because they think they can better taylor it to their workloads and/or save money on die & licensing costs. Reply

JarredWaltonGPU I know it's not going to happen any time soon, but I can't help but wonder what the potential performance of a 32-core/64-thread variant tailored to the consumer market might be — or even a 16-core/32-thread version. Obviously, getting enough memory and bandwidth is a consideration, but get something like that up and running Windows and Linux, and maybe we could have a serious x86 competitor for Steam Decks and the like. On a related note, I can't imagine a world where the MediaTek partnership with Nvidia results in CPUs that are anywhere near as competitive as Vera is supposed to be in the data center space. Hopefully, MediaTek and Nvidia prove me wrong and deliver something interesting and fast. LOL Reply

Nvidia unveils details of new 88-core Vera CPUs positioned to compete with AMD and Intel – new Vera CPU rack features 256 liquid-cooled chips that deliver up to

Key considerations

Reference reading

More on this site

Leave a Comment Cancel reply

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply