Google’s TurboQuant reduces AI LLM cache memory capacity requirements by at least six times — up to 8x performance boost on Nvidia H100 GPUs, compresses KV cach

Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times — up to 8x performance boost on Nvidia H100 GPUs, compresses KV cach

KV caches store previously computed attention data so that LLMs don’t have to recompute it at each token generation step. These caches are becoming major memory bottlenecks as context windows grow larger, and while traditional vector quantization methods can reduce the size of these caches, they introduce a small memory overhead of a few extra bits per value from the quantization constants that must be stored alongside the compressed data. That sounds small, but they’re compounding alongside larger context windows.

TurboQuant eliminates that overhead via a two-stage process. The first uses a technique called PolarQuant, which converts data vectors from standard Cartesian coordinates into polar coordinates. This separates each vector into a radius (representing magnitude) and a set of angles (representing direction). Because the angular distributions are predictable and concentrated, PolarQuant skips the expensive per-block normalization step that conventional quantizers require. This leads to high-quality compression with zero overhead from stored quantization constants.

You may like Deepseek research touts memory breakthrough, decoupling compute power and RAM pools to bypass GPU constraints 'Thermodynamic computing' could slash energy use of AI image generation by a factor of ten billion, study claims Nvidia launches Vera Rubin NVL72 AI supercomputer at CES The second stage applies a 1-bit error correction layer using an algorithm called Quantized Johnson-Lindenstrauss (QJL). QJL projects the residual quantization error into a lower-dimensional space and reduces each value to a single sign bit, eliminating systematic bias in attention score calculations at negligible additional cost.

Google tested all three algorithms across long-context benchmarks , including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models Gemma and Mistral. TurboQuant achieved perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing KV memory by at least six times. On the LongBench suite, which covers question answering, code generation, and summarization, TurboQuant matched or outperformed the KIVI baseline across all tasks.

The algorithm also showed strong results in vector search. Evaluated against Product Quantization and RabbiQ on the GloVe dataset, TurboQuant achieved the highest 1@k recall ratios despite those baselines relying on larger codebooks and dataset-specific tuning. Google noted that TurboQuant requires no training or fine-tuning and incurs negligible runtime overhead, making it suitable for deployment in production inference and large-scale vector search systems.

The paper, co-authored by research scientist Amir Zandieh and VP Vahab Mirrokni, will be presented at ICLR 2026 next month.

Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.

Key considerations

Investor positioning can change fast
Volatility remains possible near catalysts
Macro rates and liquidity can dominate flows

Reference reading

More on this site

Informational only. No financial advice. Do your own research.

Key considerations

Reference reading

More on this site

Related posts:

Leave a Comment Cancel reply