Skip to content

I didn't understand TurboQuant, so I wrote this explainer

Every team that tried to compress an LLM’s working memory below 4 bits hit the same two walls. The bookkeeping data you need to reconstruct the values eats a third of your bit budget. And even when you cram the numbers down, the compression warps the dot products that drive attention. The model starts paying attention to the wrong things.

At 8 bits these problems are minor. At 3 bits they’re fatal. Multiple research groups over 2024-2025 (KIVI, KVQuant, QuaRot) all hit the same floor.

Google’s TurboQuant breaks through it. 6x compression, zero accuracy loss, no retraining. Posted to arXiv in April 2025, accepted to ICLR 2026, mostly unnoticed for a year. The technique is two stages, both grounded in decades-old math, and both worth understanding if you care about where inference is headed.

The memory that costs more than the model

When a transformer generates text, it doesn’t re-read the entire conversation at each token. That would mean recomputing every attention score for every prior token, which is O(n²) work.

Instead, the model caches two vectors at every layer for every token: the Key (what the token represents for attention matching) and the Value (what the token contributes when attended to). This is the KV cache, the model’s working memory.

Deep Dive: What attention actually computes

You’re at a crowded party and someone says “Python.” You don’t replay every conversation you’ve ever had. Your brain has an index of topics (keys) and associated memories (values). “Python” matches against “programming” in your index, and you pull the relevant context: that project at work, the tutorial you read last week.

Attention works the same way. Each new token generates a query: “what’s relevant to me?” That query gets compared against every previous token’s key via a dot product (high score = relevant). The scores determine how much of each previous token’s value gets mixed into the output. Cache the keys and values, and you only compute each once.

The size follows a formula:

2 × layers × heads × head_dim × bytes_per_element × batch_size × seq_length

For Llama 3 70B at FP32 precision, 128k tokens, single user: ~40 GB. Just the cache. Scale to 10 concurrent users and you need 400 GB of cache alone. This is why inference providers charge per token and long-context queries cost a premium.

The deeper constraint: LLM generation is memory-bandwidth bound. Each token involves small matrix-vector multiplications that finish fast. The bottleneck is loading the KV cache from memory to compute. An H100 SXM moves data at ~3.35 TB/s. A smartphone: 50-90 GB/s. That 40-60x gap defines who can run what.

The two walls, in detail

Compressing 32-bit floats to fewer bits is obvious. Several teams tried. Each made real progress, and each stopped at the same place.

KIVI (ICML 2024, github) found that keys have outlier channels but values don’t. It quantizes keys per-channel, values per-token. Tuning-free, works at 2 bits.

KVQuant (NeurIPS 2024, github) combined per-channel quantization, non-uniform datatypes, and dense-and-sparse quantization for outliers. One A100 could serve a 1M-token context on Llama 7B.

QuaRot (NeurIPS 2024, github) applies Hadamard rotations to smooth vectors before quantization. It uses random orthogonal transforms to eliminate outliers so standard uniform quantization works at lower bit-widths.

What stopped them:

The metadata tax. To quantize a block of numbers, you store a scale factor and zero point for that block. At 8-bit quantization, this overhead is negligible. At 3 bits, it consumes a third to half of your bit budget. You’re spending bits on bookkeeping, not data.

Deep Dive: What scale and zero point mean

Say you have a list of temperatures: 18.7, 21.3, 19.1, 22.8, 20.5. They range from about 18 to 23. You want to store each in just 3 bits, which gives you 8 possible values (0 through 7).

You record two numbers: the minimum (18, the zero point) and the step size (roughly 0.7, the scale). Then each temperature becomes a bin number: 18.7 → 1, 21.3 → 5, 19.1 → 2, and so on. To reconstruct, multiply the bin by the scale and add the zero point: 5 × 0.7 + 18 = 21.5. Close to the original 21.3, not exact.

Every block of numbers has a different range, so every block needs its own scale and zero point stored at full precision. That’s the metadata tax.

Inner product bias. Quantizers optimized for reconstruction quality (mean squared error) introduce systematic bias in dot products. Attention scores are dot products between query and key vectors. A quantizer that reconstructs individual vectors well can still corrupt attention. The model starts attending to the wrong things.

Deep Dive: Why dot products matter for attention

A dot product measures how much two vectors point in the same direction. Same direction: large positive number. Perpendicular: zero. Opposite: large negative.

Attention uses this as a relevance score. The query vector for the current token says “I’m looking for X.” Each key vector says “I contain Y.” The dot product between them answers “how relevant is Y to X?” High score means this token gets more influence on the output. Low score means it gets ignored.

Bias these scores and the model pays attention to the wrong tokens.

PolarQuant: kill the metadata

TurboQuant is two stages. The first, PolarQuant (a separate paper, AISTATS 2026, overlapping authors), changes the coordinate system.

Standard quantization works in Cartesian coordinates. Each dimension is an independent axis. To quantize, you need the value range per block (scale) and where zero is (zero point). That metadata costs bits.

PolarQuant converts vectors to polar coordinates: a magnitude and a direction on the unit sphere. Instead of “go 3 blocks East and 4 blocks North,” polar says “go 5 blocks at 37 degrees.”

Deep Dive: The coordinate shift, step by step

Start with the 2D case from the analogy. The vector (3, 4) in Cartesian. To quantize, you find the range of values in each block, divide into bins, and store the range as metadata (scale and zero point). Different vectors produce different ranges, so every block needs its own metadata.

Polar coordinates represent the same point as (r=5, θ=53°). In 2D, this doesn’t help much. You’d still need the range of angles.

High dimensions change the picture. A typical attention head operates in 128 dimensions. Here’s what PolarQuant does:

  1. Decompose. Split each vector into its magnitude (one scalar, r) and its direction (a unit vector on the 127-dimensional sphere).

  2. Rotate. Multiply the unit vector by a random orthogonal matrix. This preserves length and all dot products with other vectors. It shuffles which direction each coordinate axis points.

  3. Observe. After rotation, each coordinate of the unit vector follows a distribution that depends only on the dimension (128), not on the data. In 128 dimensions, coordinates cluster around zero with standard deviation of about 1/√128 ≈ 0.088. This holds for every input vector, regardless of what it originally looked like.

Since the post-rotation distribution is known before you see any data, you design one quantization grid and reuse it for everything. No per-block metadata needed.

Why does this eliminate the metadata? Roll one die and you get anything from 1 to 6. Roll 128 dice and average them, and you’ll get something very close to 3.5 every time. The rotation does the same thing to each coordinate of the vector. This is the concentration of measure: in high dimensions, randomly rotated coordinates cluster around a predictable value. After rotation, the distribution is known in advance.

PolarQuant eliminates quantization metadata by switching coordinate systems. Red cells (scale + zero point) disappear entirely.
The polar transform in action: Cartesian quantization pays a metadata tax on every block. Polar coordinates eliminate it.

Since the distribution is predictable, you can map to a fixed quantization grid without computing per-block statistics. No scale factors. No zero points. And because coordinates are nearly independent post-rotation, vector quantization reduces to scalar quantization. Quantize each coordinate separately. The rotation is invertible, so you reconstruct the original geometry by applying the inverse.

QJL: the one-bit correction

PolarQuant eliminates metadata but introduces bias in the magnitude. The radius component carries systematic error that distorts dot-product estimates. This is exactly the inner product bias problem.

The fix uses a 1984 result. A shadow of a 3D object onto a wall loses a dimension but preserves the relative positions of its features. The Johnson-Lindenstrauss Lemma says you can do the same thing from 128 dimensions down to a handful, and the distances between points barely change. The target dimension is O(log n / ε²), independent of the original dimension.

TurboQuant’s QJL (Quantized Johnson-Lindenstrauss) applies this to the residual. Subtract PolarQuant’s approximation from the original vector. Project the residual using a random JL transform. Then quantize each element to a single sign bit: +1 or -1.

Why does one bit work? After a random projection, the sign of each coordinate is an unbiased estimator of the inner product direction. With enough coordinates, the law of large numbers kicks in, and the aggregate sign-bit estimate converges to the true dot product. Mathematically unbiased for inner products, zero metadata overhead.

Deep Dive: Why a single bit per coordinate is enough

Imagine you want to know if two people have similar taste in movies. You could ask them to rate 100 movies on a 10-point scale and compare. But that’s expensive: 100 × 10 points of data per person.

Instead, ask each person 100 yes/no questions: “Did you like this movie?” Each answer is one bit. Any single yes/no is a crude signal. But across 100 questions, the fraction of matching answers reliably tells you how similar their tastes are. More questions = more accuracy, and each question costs almost nothing to store.

That’s what QJL does. After the random projection, each coordinate of the residual gets compressed to its sign: positive → 1, negative → 0. Each sign bit is a noisy yes/no vote on the direction of the error. With enough votes, the aggregate converges to the true correction. The math guarantees this: the expected value of the sign-bit estimate equals the true dot product.

At 3 bits, standard quantizers split their budget between data and bookkeeping. TurboQuant allocates every bit to actual information.

The bit budget splits: (target bits - 1) for PolarQuant, 1 bit for QJL. At 3 total bits: 2 bits of polar compression + 1 bit of residual correction. There’s a theoretical floor on how few bits can represent this data without losing information, a speed-of-light for compression. TurboQuant runs at about 37% of that limit.

The numbers

The team (Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni at Google Research) tested on Llama-3.1-8B-Instruct, Gemma, and Mistral across LongBench, Needle-in-Haystack, ZeroSCROLLS, RULER, and L-Eval.

MetricResult
KV cache bit-width3 bits (from 32)
Memory reduction6x vs. FP32 baseline
Attention speedup (H100)8x vs. 32-bit (attention only)
Accuracy at 3.5 bitsStatistically indistinguishable from baseline
Accuracy at 2.5 bitsMarginal degradation
RetrainingNone
Runtime overheadNegligible

Beyond LLMs, TurboQuant outperforms Product Quantization and existing vector quantization methods on billion-scale vector search (GloVe, d=200) with better recall at fewer bits.

What hasn’t been proven

Three gaps.

Small models only. All benchmarks use 8B-parameter models. Whether the accuracy claims hold on 70B+, mixture-of-experts architectures, or million-token contexts is open. Quantization errors compound differently at scale.

The 8x is for attention only. That speedup measures attention logit computation specifically: 4-bit keys vs 32-bit keys on H100. Feed-forward layers, embedding lookups, and sampling still take their usual time. End-to-end wall-clock gains will be lower.

No production implementation. The paper is an algorithm, not a system. No official CUDA kernel, no integration with vLLM, TensorRT-LLM, or SGLang. Mitko Vasilev posted on LinkedIn that he implemented it for vLLM and reports 4M+ tokens of KV cache on an NVIDIA GB10. Community implementations are early.

The rest of the field

KV cache compression is one angle of attack on inference cost. Several others are converging:

Architecture-level: DeepSeek’s Multi-head Latent Attention (MLA, github) bypasses the KV cache problem entirely. Instead of caching full key-value pairs, it compresses them into low-rank latent vectors using learned projections. DeepSeek-V2 reported a 93% KV cache reduction. The tradeoff: MLA requires architectural changes during pre-training. You can’t bolt it onto an existing model. TurboQuant works on any transformer with no retraining.

Allocation: vLLM’s PagedAttention (github) doesn’t compress the cache but eliminates fragmentation, cutting memory waste from 60-80% to under 4%. SGLang’s RadixAttention (github) deduplicates shared prefixes across requests using a radix tree, reusing cached K/V for common context (system prompts, few-shot examples). Up to 6.4x throughput improvement.

Eviction: Methods like H2O (Heavy-Hitter Oracle, github) selectively discard low-impact KV entries rather than compressing them. H2O retains ~20% of tokens, achieving 29x throughput gains on OPT-30B. StreamingLLM (github) enables infinite-length generation by keeping initial and recent tokens. Different philosophy: don’t compress everything, throw away what doesn’t matter.

Hardware: NVIDIA’s NVFP4 bakes FP4 support into Blackwell silicon, cutting KV cache memory by 50% vs FP8. Custom inference chips (Groq’s LPU, Cerebras WSE) sidestep the memory bandwidth problem with different memory hierarchies entirely, using massive on-die SRAM instead of HBM.

Weight quantization (GPTQ, AWQ, GGUF) compresses model parameters, not the KV cache. Complementary: a 4-bit quantized model with a 3-bit KV cache is a valid and useful configuration.

Speculative decoding uses a small draft model to propose multiple tokens, verified in parallel by the target model. 2-3x throughput gains, orthogonal to everything above.

KV compression + allocation optimization + weight quantization + speculative decoding. The inference stack of 2026 has almost nothing in common with 2024.

Where the frontier is open

Explored (production-grade or mature):

  • KV cache quantization: KIVI, KVQuant, QuaRot, GEAR
  • Weight quantization: GPTQ, AWQ, GGUF
  • Attention optimization: FlashAttention (github), PagedAttention
  • KV eviction: H2O, StreamingLLM
  • Rotation-based smoothing: QuaRot, SpinQuant (github), PolarQuant

Active research:

  • Adaptive bit allocation. Different layers and attention heads tolerate different precision. 2 bits for less important heads, 4 for critical ones.
  • Compound quantization. Co-optimizing KV cache + weight + activation quantization, rather than applying each independently. The interaction effects are unexplored.
  • Architecture-compression co-design. MLA-like architectural changes combined with post-training quantization. What’s the Pareto frontier when you can change both the architecture and the compression?

Green-field:

  • Multi-agent KV sharing. When 50 agents work on the same codebase, their caches contain massive redundant context. Quantized shared caches could collapse the memory cost of multi-agent systems. Nobody has published on this.
  • Quantization-native attention. Attention mechanisms designed for quantized representations from scratch, not full-precision attention with compression bolted on.
  • TurboQuant-specific ARM/NPU kernels. The H100 numbers are real, but mobile memory bandwidth is 40-60x lower. TurboQuant on Apple, Qualcomm, and MediaTek silicon would put 3-bit KV caches on phones.
  • Theoretical tightening. TurboQuant sits at ~2.7x the information-theoretic lower bound. Closing that gap is open math.
  • Hardware co-design. Custom silicon for polar-coordinate operations. Current GPUs are optimized for Cartesian arithmetic.
  • Streaming quantization. Agent KV caches grow continuously. On-the-fly quantization with zero latency overhead at the extreme end is unsolved.

What this changes

Training is a one-time cost. Inference is the bill that shows up forever. Every message, every agent call, every time a coding assistant loads a repo into context. The cost of inference has dropped 92% since early 2023, and demand exploded in response. Make a resource cheaper and people use more of it. Jevons paradox. TurboQuant won’t lead to people running the same models on less hardware. They’ll run bigger models, longer contexts, more concurrent sessions, on cheaper hardware.

I’ve written before that AI will follow the same decentralization path as computing: mainframes to minis to PCs to phones. Each transition happened when the cost of running compute locally dropped below the cost of renting it. Compression is one of the forces that tips that balance.

Whether TurboQuant specifically becomes the standard is secondary. The proof that 3-bit KV caches match full precision, backed by information-theoretic bounds and not just benchmarks, sets a new floor. Future methods start here.

AI that runs on your hardware and answers to you is an engineering problem. Work like TurboQuant is how it gets solved.

Keyboard Shortcuts