LLM Inference Optimization: Quantization, Speculative Decoding, and KV Cache Engineering
As ChatGPT, Claude, and similar services reach hundreds of millions of monthly active users, inference cost has become one of AI companies’ largest operational expenses. GPT-4’s one-time training cost is estimated at approximately $100M; the cumulative inference cost of serving billions of daily queries may exceed that over time. Inference efficiency determines AI service sustainability and is among the most actively optimized areas in current AI engineering.
Quantization: Fewer Bits, Similar Performance
Neural network training uses FP32 or FP16/BF16 precision. Quantization compresses these weights to lower precision (INT8, INT4, even INT2), reducing memory footprint and compute within acceptable performance degradation.
INT8 quantization: halves memory from 16-bit, typically achieves 40–60% inference speedup on INT8-capable hardware, with performance degradation typically below 1% on general benchmarks. The `bitsandbytes` library and GGUF format (used in llama.cpp) are the most widely used open-source implementations.
INT4 quantization: compresses a 70B parameter model from ~140GB to ~35–40GB — enabling it to run on a single high-end consumer server (or M2 Ultra Mac with 192GB RAM) for the first time. GPTQ and AWQ (Activation-aware Weight Quantization) are currently the highest-accuracy INT4 methods, with performance loss under 3% on most tasks.
Ollama and llama.cpp have made running quantized models locally exceptionally accessible — the standard toolchain for local open-source LLM deployment. See our local LLM deployment guide for configuration specifics.
Speculative Decoding
Standard LLM inference is autoregressive — one token generated per forward pass. Speculative decoding introduces a small “draft model” that predicts multiple future tokens in parallel; the large model then verifies (accept or reject) the draft sequence in a single pass, increasing throughput without changing output quality.
Medusa (Google) and Spec Decode (DeepMind) are the two main implementation approaches, both validated in production to deliver 2–3× throughput improvement — particularly effective for conversational latency (first-token latency unchanged, subsequent token generation substantially faster).
KV Cache Optimization
Transformer self-attention stores Key-Value matrices for all historical tokens (KV cache) during inference — for a 70B model with 128K context window, the KV cache can reach tens of GB. Key optimizations: KV cache quantization (quantize the cache itself to INT8/INT4); Flash Attention (Dao et al. 2022, reorganizes IO operations to reduce GPU memory access — the most important attention computation optimization, now in every major LLM framework); Continuous Batching (vLLM’s core innovation, dynamic batching to maximize GPU utilization).
vLLM and TGI are the dominant high-performance LLM inference serving frameworks, integrating all of the above.




