Rethinking AI TCO: Why Cost per Token Is the Only Metric That Matters

Is FP4 precision supported ? Can the inference stack make use of FP4 while maintaining high accuracy?

Does the inference runtime support speculative decoding or multi-token prediction to increase user interactivity?

Does the serving layer support disaggregated serving, KV-aware routing, KV-cache offloading and other optimizations?

Does the platform support the unique workload requirements of agentic AI — including ultralow latency, high throughput and large input sequence lengths?

Does the platform support the full lifecycle, from training and post-training to high-scale inference, across all model architectures, to ensure infrastructure fungibility and high utilization?

Every one of these algorithmic, hardware and software optimizations must be active and integrated, or the denominator collapses. A “cheaper” GPU that delivers significantly fewer tokens per second results in a much higher cost per token. AI infrastructure that gets it right across the full stack ensures that every optimization enhances the others.

The following data for the DeepSeek-R1 AI model demonstrates the difference between theoretical and actual business outcomes.

Looking at compute cost alone, the NVIDIA Blackwell platform appears to cost roughly 2x more than NVIDIA Hopper — but compute cost says nothing about the output that investment buys. An analysis of mere FLOPS per dollar suggests a 2x NVIDIA Blackwell advantage compared with the NVIDIA Hopper architecture. However, the actual outcome is orders of magnitude different: Blackwell delivers more than 50x greater token output per watt than Hopper, resulting in nearly 35x lower cost per million tokens.

Note: Data is sourced from NVIDIA analysis and the SemiAnalysis InferenceX v2 benchmark.

This massive divergence proves NVIDIA Blackwell delivers a massive leap in business value over the earlier Hopper generation that far outpaces any increase in system cost.

Comparing AI infrastructure based on compute cost or theoretical FLOPS per dollar isn’t just insufficient; it doesn’t provide an accurate representation of inference economics. As the data demonstrates, an accurate evaluation of AI infrastructure’s revenue potential and profitability requires a shift from input metrics to cost per token and delivered token output.

NVIDIA delivers the industry’s lowest token cost and highest token throughput through extreme codesign across compute, networking, memory, storage, software and partner technologies. Moreover, the constant optimization of open source inference software such as vLLM, SGLang, NVIDIA TensorRT-LLM and NVIDIA Dynamo built on the NVIDIA platform means that on existing NVIDIA infrastructure, token output continues to increase and the cost per token continues to decline long after it’s acquired.

Leading cloud providers and NVIDIA cloud partners are already delivering this advantage at scale. Partners such as CoreWeave , Nebius , Nscale and Together AI have deployed NVIDIA Blackwell infrastructure and optimized their stacks to bring enterprises the lowest token cost available today, with the full benefit of NVIDIA’s hardware, software and ecosystem codesign behind every interaction served.

Learn about the breakthroughs shaping the next chapter of AI anytime, anywhere.

Key considerations

Investor positioning can change fast
Volatility remains possible near catalysts
Macro rates and liquidity can dominate flows

Reference reading

More on this site

Informational only. No financial advice. Do your own research.

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply