GTC 2026: Ian Buck press Q&A transcript — VP of Hyperscale and HPC speaks out on shelving CPX and shipping LPU decode this year

Ian Buck: I'm never going to say no to how fast we can innovate. But I think we can do a lot better [with LPU decode first].

Journalist 2 : Jensen was asked about the target market, target use case. And he said he was very careful about how he answered that. He didn't want to position it as a direct drop-in replacement for x86.

Ian Buck: We only are going to build one Vera SKU […] other people are going to build x86 SKUs […] the world is not going to be served by one SKU of CPU. And that's not our intention. The intention is that we'd like to solve a workload problem. It's not designed to be a dollar-per-vCPU chip. The amount of technology and, frankly, just the cost to build something that solves that critical workload makes it not for that market; it's a bad gaming chip.

But it is inspired by single-threaded performance. You may not need 88 cores, but it's actually a unique workload because in agentic AI, it's in the critical path for both training and running these models. When you're training a model to code, for example, you start from a model that needs to learn how to code better. And as you're training on Vera Rubin, you're halfway through training, and you say: go write a program that computes the Fibonacci series, or solves a New York Times crossword puzzle.

The AI model will then try to write that program. It then needs to score how well it did. We're not going to run that Python on the GPU. It's a CPU job. The GPU tells the CPU to go run it. So it opens up a sandbox, boots a Linux instance, starts the Python interpreter, executes, compiles, and runs that code. And it's got to score it — how well did it do? Lines of code, accuracy, did it crash? — very quickly. So that all those results from the training run can get back to the GPU in order to do the next iteration of training.

It is in the critical path. There are ways of overlapping. You'll hear about off-policy, where maybe you're training on the N-minus-one data, doing pipelining. But you can't do too much of that because you get model drift.

So what the world is asking for, what it needs, is a really fast CPU that can generate a lot of training data while you're training in order to make the model faster, and never let [the] GPU go idle. This might be a $30 billion, gigawatt data center of GPUs. I'm not going to skimp on the CPU side and have it sit idle, or have the potential of that model come up short because I couldn't run the compilation for too long and had to cut it off.

And then finally, when you actually deploy AI after you're done training, it's not just the AI model. The GPUs are telling the CPUs what to do. They run a SQL query, or they render an image, or they go to a website — all that’s happening on the CPUs. The more tool calling that can happen in fixed power, the more efficient it can be and still maintain these interactive use cases, the more valuable those tokens are.

And lastly, as we get to the agent world, where it's not just us doing chatbot with humans in the loop, we're going to have agents talking to agents at machine speed. You just took humans out of the loop again. That can happen as fast as the computers can compute.

Journalist 2: So, just to be clear, your customers — your ODMs, your Dell , HPs — if they want to build a system, that’s what they’ll get?

Ian Buck: They can build it like this [referring to the reference board brought to the Q&A] or, we will ship the chip itself.

Journalist 2: So, in theory then, your partners could go off the reservation and build a gaming PC, or whatever they wanted to do with it?

Ian Buck: They could. I think they’re all highly motivated to build what Nvidia recommends [and take advantage of] the opportunities with agentic use cases.

Journalist 3: Ian, what's become of the partnership with Intel ? Last year, you guys announced a partnership.

Ian Buck: We didn't talk about it in this keynote, but it's progressing. Fusion is a key part of that strategy. It's an IP block plus a chiplet that allows CPUs like x86 to talk across NVLink to our GPUs, or even other accelerators. We've announced multiple partnerships including Intel, and that is definitely progressing. It takes a little while to integrate at the silicon level. Obviously, it's pretty intimate integration. But I think we'll see some more announcements about that shortly.

Journalist 4: Is that partnership going to involve implementing Nvidia IP on Intel process technology? And if so, who's going to be doing the lifting there?

Ian Buck: There's a separation between the manufacturing of who builds the chip or chiplets from the IP integration. The integration I talk about is the IP hooking into the fabric of the processor. This [the Vera module] is actually multiple chiplets. You've got multiple I/O dies, memory interface tiles, as well as the core. If you look at the right angle, you can see one, two, three, four, five, six pieces of silicon come together. So who builds which piece, in which factory, and who does the integration, that's up to the partners. It'll be different for each integration.

Journalist 2: We asked Jensen about that, and he said, look, our bits will be coming from TSMC, the Intel stuff will be coming from wherever they choose to get it.

Journalist 4: I think we're trying to determine, is this a toe in the water to develop Nvidia IP on Intel process technology? I asked Jensen yesterday, and he said he's not excited.

Ian Buck: Obviously those questions are his [Jensen’s] domain. He's a good person to be asking about those questions.

Journalist 4: Looking at the disaggregated architecture that you've implemented with LPX, it does strike me as a situation where it almost makes more sense to pair CPX with LPX racks rather than relying on an HBM-based product like Rubin.

Ian Buck: CPX is still a good idea. It is the opportunity to improve token throughput, to get to that next tier of agents talking to agents that need to run a 1 trillion, 2 trillion parameter model with 400,000 to 500,000 KV input context at token rates of about 1,000 tokens per second, because there's no human in the loop. Input tokens do impact decode speed […] so 400,000 tokens of context significantly changes the token rate.

When we talk about pivoting from CPX to LPU, that's where the focus was. Right now there's a limit to how many chips […] we want to do this this year. We want to do this with Vera Rubin. Just because of that effort, this will help those agentic AI frontier labs be able to take that level of intelligence to market.

The volume AI market is offline inference, non-reasoning chatbots, recommendation systems, reasoning chatbots, multimodal, deep research. This [LPX] will not add value to all of those. Everything can be served on [Vera Rubin NVL72]. But that next tier is super important as we turn the corner, and it was important to make sure we had that brought to market this year.

CPX is an optimization, it’s still a good idea [and] it would help break down the cost of the pre-fill stage, but sing these GPUs for the pre-fill portion of the workload is sufficient right now.

Journalist 4: If CPX is not coming until 2027, but Vera as a standalone is available sooner than that, is the Vera CPU going to be available sooner [unintelligible] will there be something that doesn’t use AI to compare it with before 2027?

Ian Buck: The LPU racks, Groq, they would run the whole model on the LPU racks alone. That capability exists. But the challenge with doing that is you had to feed not only the entire model, but all of the KV cache and all of the multiple queries on an SRAM chip that only has 500 megabytes. This [Vera Rubin GPU] has 280 gigabytes.

So as models got bigger and contexts got larger, and you just had to keep all of that state around, as well as do all that attention math, it gets costly to have that many LPUs run a trillion-parameter model with the weights plus KV cache. It didn't need to be paired with anything. But it was very expensive. And Jensen showed that in the chart as well. You could get to 1,000 tokens per second, but the economics of doing that with that many chips just don't work.

It has nothing to do with prefill. Pre-fill is just step one, how quickly can you get to your first token. After you've done that, there's pre-fill, GPUs, or whatever you're using in pre-fill. It doesn't matter. Your token rate is all about the number of processors you're using to generate every token after that. So it's not as simple as that. If you just did [Groq 3] LPX, you would need a lot of chips because of all that context.

By combining the LPX with Vera Rubin, we don't need all that. We just do on the LPU what it's good at, which is basically the memory bandwidth, seven times faster than HBM. That lets the mixture-of-experts layers that are inside each expert group run here. The whole rest of the model, all the attention math, can run on the GPUs.

So instead of dozens of racks of LPX, we can deliver that level of performance with just two racks of LPX and one rack of Vera Rubin. And as a result, the token rate gets to 1,000 tokens per second, but the economics go back to the sweet spot. Tokens will be higher value for sure, tens or hundreds of tokens per second rather than thousands. And you can also deploy at data center scale to serve a market. Building it once, serving a few customers in a highly constrained environment is nice, but that doesn't create a market. You have to build an architecture that, by combining LPX with Vera Rubin at one-to-one, or one-to-two, or maybe one-to-four rack ratios, can activate a market to deploy a 100-megawatt data center, a 500-megawatt, a gigawatt data center and serve those models economically.

Journalist 4 : And just to be clear, all those benefits you're talking about are on the decode side, after we're over the pre-fill hump?

Ian Buck: Pre-fill […] it’s just the first token. How long does it take to get the first token? That's all CPX was trying to optimize. It's an important problem, but you can solve it with existing hardware. You can solve it with NVL72, with the older architectures. We can reduce the time to first token, and we can also solve it today by just adding a few more […] it parallelizes very easily. But it's just the first token. [LPU decode] will increase the speed of every token after that.

Journalist 3: I was wondering if you could talk a little bit more about how the LPX is going to connect to other chips, both in the Nvidia ecosystem and outside the Nvidia ecosystem? What about working with CPUs that other companies might make or that customers might procure from elsewhere?

Ian Buck: When we licensed the IP, obviously there was limited stuff that we could change. But there were some last-minute changes that we were able to make to bring it to market. So this is the version [Groq 3/LP30], which is almost largely what it was. We're still using the chip-to-chip signaling that was already there.

Journalist 4 : So there's no Nvidia NVLink chip-to-chip on it yet.

Key considerations

Investor positioning can change fast
Volatility remains possible near catalysts
Macro rates and liquidity can dominate flows

Reference reading

More on this site

Informational only. No financial advice. Do your own research.

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply