Table of Contents
TurboQuant achieves up to 8x speed improvements on modern GPUs without sacrificing model accuracy
Google Research has announced TurboQuant, a compression algorithm that could meaningfully change the economics of running large AI models. According to Google’s benchmarks, it shrinks memory usage by at least 6x and delivers up to 8x speed improvements on modern GPUs — with no accuracy loss. If those numbers pan out in production workloads, the implications for everything from chatbots to search infrastructure are pretty significant.
The announcement comes as AI labs are buying up memory allotments years in advance, and supply issues around memory continue to plague every other aspect of tech. That said, it’s worth noting that Google’s new algorithm isn’t shipping just yet — and a formal presentation of it is scheduled for ICLR 2026 next month.
What TurboQuant solves
Large language models need to remember context as they work through a conversation or a long document. They do this through what’s called a key-value cache, or basically a high-speed data store that tracks previous inputs so the model isn’t recomputing everything from scratch with every new token. The catch is that this cache balloons quickly as inputs get longer, consuming GPU memory at a rate that becomes a real bottleneck. That bottleneck constrains how many users you can serve at once, how long a document a model can handle, and ultimately, how much it costs to operate.
TurboQuant goes after this problem directly. It compresses the data living in that cache, shrinking it dramatically while preserving the information the model actually needs to function. The key practical advantage here is that it works instantly, without the need to fine-tune, train on a specific data-set, or be customized for a specific architecture. It’s essentially plug-and-play, which is a meaningful improvement over older compression techniques that demand extensive setup before they’re useful. TurboQuant will be presented at ICLR 2026 alongside two companion methods — PolarQuant and QJL — that together form the broader compression approach.
Performance improvements
TurboQuant compresses the memory cache from the standard 32 bits per value down to just 3 bits, representing a 6x reduction in memory footprint at minimum. At 4-bit precision, it achieves up to 8x speed gains on Nvidia H100 GPUs relative to an uncompressed baseline.
What makes these results more notable is Google’s claim of zero accuracy loss across every model tested. The benchmarks span a variety of models — and matched full-precision uncompressed performance on tests running up to 104,000 tokens under 4x compression. On the Needle-in-a-Haystack benchmark, which evaluates whether a model can locate specific information buried deep in a long context window, TurboQuant hit 100% retrieval accuracy. Google also reports it outperformed KIVI, the current standard baseline published at ICML 2024, at 3-bit precision across multiple established benchmarks.
There are applications beyond language models, too. In vector search the gains are even more dramatic along one axis. Indexing time for high-dimensional vectors dropped to 0.0013 seconds, down from hundreds of seconds with existing methods.
These are Google’s own benchmarks, though. Independent verification and production deployment tend to paint a more nuanced picture, and whether these results translate across the full diversity of real-world workloads remains an open question.
Real-world impacts
The commercial stakes here are potentially huge. Google’s core revenue engines, like Search, YouTube recommendations, and ad targeting, all lean heavily on exactly the kinds of vector search and language model inference that TurboQuant optimizes. Even a fraction of these benchmark gains translating to production could meaningfully cut infrastructure costs or let Google ship more capable AI features without proportionally scaling up hardware.
Some industry figures have drawn a direct connection between TurboQuant and the competitive pressure Google faces from efficiency-focused efforts elsewhere. Cloudflare CEO Matthew Prince, among others, framed this as Google’s answer to developments like DeepSeek’s approach of delivering strong AI performance on lower-cost infrastructure. Whether or not Google explicitly set out to respond to DeepSeek, the fact is that companies are racing to make AI cheaper, not just more powerful.
TurboQuant is a research breakthrough with academic papers still pending publication. It hasn’t been widely deployed in Google’s products or made available for broad industry use. For now, TurboQuant is an exciting proof of concept, albeit one that could have a significant impact on how AI infrastructure works. But, it still needs to prove itself outside the lab.