
KV caches store previously computed attention data so that LLMs don’t have to recompute it at each token generation step. These caches are becoming major memory bottlenecks as context windows grow larger, and while traditional vector quantization methods can reduce the size of these caches, they introduce a small memory overhead of a few extra bits per value from the quantization constants that must be stored alongside the compressed data. That sounds small, but they’re compounding alongside larger context windows.
TurboQuant eliminates that overhead via a two-stage process. The first uses a technique called PolarQuant, which converts data vectors from standard Cartesian coordinates into polar coordinates. This separates each vector into a radius (representing magnitude) and a set of angles (representing direction). Because the angular distributions are predictable and concentrated, PolarQuant skips the expensive per-block normalization step that conventional quantizers require. This leads to high-quality compression with zero overhead from stored quantization constants.
You may like Deepseek research touts memory breakthrough, decoupling compute power and RAM pools to bypass GPU constraints 'Thermodynamic computing' could slash energy use of AI image generation by a factor of ten billion, study claims Nvidia launches Vera Rubin NVL72 AI supercomputer at CES The second stage applies a 1-bit error correction layer using an algorithm called Quantized Johnson-Lindenstrauss (QJL). QJL projects the residual quantization error into a lower-dimensional space and reduces each value to a single sign bit, eliminating systematic bias in attention score calculations at negligible additional cost.
Google tested all three algorithms across long-context benchmarks , including LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval, using open-source models Gemma and Mistral. TurboQuant achieved perfect downstream scores on needle-in-a-haystack retrieval tasks while compressing KV memory by at least six times. On the LongBench suite, which covers question answering, code generation, and summarization, TurboQuant matched or outperformed the KIVI baseline across all tasks.
The algorithm also showed strong results in vector search. Evaluated against Product Quantization and RabbiQ on the GloVe dataset, TurboQuant achieved the highest 1@k recall ratios despite those baselines relying on larger codebooks and dataset-specific tuning. Google noted that TurboQuant requires no training or fine-tuning and incurs negligible runtime overhead, making it suitable for deployment in production inference and large-scale vector search systems.
The paper, co-authored by research scientist Amir Zandieh and VP Vahab Mirrokni, will be presented at ICLR 2026 next month.
Get Tom's Hardware's best news and in-depth reviews, straight to your inbox.
Key considerations
- Investor positioning can change fast
- Volatility remains possible near catalysts
- Macro rates and liquidity can dominate flows
Reference reading
- https://www.tomshardware.com/tech-industry/artificial-intelligence/SPONSORED_LINK_URL
- https://www.tomshardware.com/tech-industry/artificial-intelligence/googles-turboquant-compresses-llm-kv-caches-to-3-bits-with-no-accuracy-loss#main
- https://www.tomshardware.com
- Snap Decisions: How Open Libraries for Accelerated Data Processing Boost A/B Testing for Snapchat
- Start your AM5 build off right with an incredible deal on a Ryzen 9 9900X processor — get $185 off and walk away with the centerpiece of your new or upgraded PC
- US Senators call for a halt to Nvidia GPU exports in the wake of the Super Micro scandal — looming Chip Security Act may put a wrench into Huang's China ambitio
- Blowing Off Steam: How Power-Flexible AI Factories Can Stabilize the Global Energy Grid
- Unlock the ultimate PC maintenance combo with this electric screwdriver and air duster at a huge $70 saving right now — $89 Amazon bundle pairs Hoto's epic 25-b
Informational only. No financial advice. Do your own research.