Shrinking Giants: Understanding LLM Quantization Models (Q2, Q4, Q6 and Friends)
Why Quantization Matters
Large Language Models (LLMs) are huge. Even a “small” 7B parameter model can chew up 14+ GB in FP16 (16-bit floating point). If you’ve tried running one locally without a beefy GPU, you’ve probably noticed your machine crying in pain—or worse, swapping memory like it’s 2005.
That’s where quantization steps in. Quantization compresses model weights into fewer bits (from 16 or 32 down to 8, 6, 4, or even 2) while trying to preserve accuracy. The result? Models small enough to fit on CPUs, laptops, and even Raspberry Pi (yes, I’ve done it… slowly).
Floating Point Models (FP32, FP16, BF16)
Before quantization, LLMs live in floating point land:
- FP32 – The full precision, 32-bit float. Big, accurate, but memory-hungry. Baseline for training, rarely used for inference at home.
- FP16 / BF16 – Half precision. Cuts memory usage in half, faster on GPUs with Tensor Cores. Sweet spot for most inference on modern NVIDIA cards.
Think of FP models as the “master copy” of your LLM—clean and accurate, but expensive to run.
Quantized Models (Q2, Q4, Q6, Q8…)
Now the fun stuff. Quantized models represent weights with fewer bits, which slashes size and boosts inference speed (especially on CPUs). Common schemes:
- Q8 – 8-bit quantization. Near FP16 accuracy, but still heavy for CPUs. Best if you’ve got a solid GPU with plenty of VRAM.
- Q6 – 6-bit quantization. Balanced: noticeably smaller than Q8, accuracy drop is usually tolerable. Popular on mid-tier GPUs or strong CPUs.
- Q4 – 4-bit quantization. The “sweet spot” for CPU inference. Huge compression (up to ~75%), still usable for chat and most reasoning tasks. Many Ollama/GGUF releases ship Q4 variants for laptops.
- Q2 – 2-bit quantization. Ultra-tiny. Runs on nearly anything (even ARM SBCs), but you pay with accuracy. Fine for experiments, not for production.
How they store info: Instead of keeping each weight as a float (e.g., 32 bits), quantized models map ranges of values to small integers (2, 4, 6, or 8 bits). At runtime, these integers get dequantized back into approximate floats.
Which One Should You Use?
It depends on your hardware:
- High-end GPU (24GB+ VRAM, e.g., RTX 4090) → FP16 or Q8. You’ve got the horsepower; enjoy near-perfect accuracy.
- Mid-range GPU (8–12GB VRAM, e.g., 3060/4060 Ti) → Q6. Fits comfortably, trades little accuracy for speed.
- CPU-only laptops/desktops → Q4. Practical baseline. Most modern CPUs (AVX2/AVX-512) chew through Q4 models decently.
- Raspberry Pi / low-end CPU → Q2. Yes, it works. Yes, it’s slow. Yes, it’s hilarious to watch a Pi argue philosophy.
Model Showdown: Q4_K_M in the Wild
Quantization isn’t just about bits — it’s also about which base model you’re shrinking. Let’s look at four popular models, all in q4_K_M
format (a common 4-bit quantization that balances speed and quality).
🦙 Llama 3.1:8B Instruct (q4_K_M)
- Strengths: General-purpose, strong reasoning, widely supported in tools like Ollama.
- Weaknesses: At 8B parameters, even quantized it needs ~4.5GB RAM.
- Best for: Balanced workloads — coding help, Q&A, reasoning tasks on mid-tier GPUs or beefy CPUs.
💎 Gemma 2:2B Instruct (q4_K_M)
- Strengths: Tiny and fast. Fits in ~2GB RAM when quantized. Great for edge devices or laptops.
- Weaknesses: Accuracy takes a hit — can feel simplistic on reasoning-heavy prompts.
- Best for: Lightweight assistants, translation tasks, mobile/embedded AI.
🐉 Qwen 2.5:14B Instruct (q4_K_M)
- Strengths: Huge knowledge base, excellent at multilingual tasks.
- Weaknesses: Big boy. Even quantized, needs ~7–8GB RAM — not for weak CPUs.
- Best for: Servers, high-VRAM GPUs, or workstation-grade CPUs. If you want richer outputs in multiple languages, this is your pick.
⚡ DeepSeek-R1:8B Llama-Distill (q4_K_M)
- Strengths: Optimized for speed — distilled from Llama, so smaller and faster while keeping much of the quality.
- Weaknesses: Knowledge coverage is narrower than Llama/Qwen.
- Best for: Real-time apps, chatbots, or pipelines where latency matters more than encyclopedic knowledge.
When to Use Which
Model | Size (Quantized Q4_K_M) | Best Use Case | Hardware Target |
---|---|---|---|
Llama3.1:8B | ~4.5 GB | General assistant, coding, reasoning | Mid-tier GPU / 16GB RAM CPU |
Gemma2:2B | ~2 GB | Lightweight tasks, mobile/edge AI | Laptop CPU / SBC (Raspberry Pi) |
Qwen2.5:14B | ~7–8 GB | Knowledge-rich, multilingual | High-VRAM GPU / server CPU |
DeepSeek-R1:8B | ~4.5 GB | Fast response, low-latency apps | Mid-tier GPU / CPU with AVX2 |
Quick Example: Running LLaMA3 on CPU with Ollama
Let’s say you want to run llama3.1:8b
locally without a GPU. You’d grab the Q4 quantized model:
ollama run llama3.1:8b-q4
This trims memory requirements from ~16GB down to ~4.5GB, making it feasible on a modern CPU with 8–16GB RAM.
If you tried Q2, the size drops under 2GB, but responses might start sounding like Yoda after a bad day.
Conclusion
Quantization is the secret sauce that makes local AI possible.
- Floating point models → pristine accuracy, but resource hungry.
- Quantized models (Q2/Q4/Q6/Q8) → trade a few IQ points for massive savings in memory and speed.
- Model choices (Llama, Gemma, Qwen, DeepSeek) → pick based on your hardware and whether you want knowledge depth, speed, or portability.
So the next time you’re deciding between llama3.1:8b-instruct-q4_K_M
or gemma2:2b-instruct-q4_K_M
, remember: it’s not just about size, it’s about matching the model to the machine you have.
⚡ Next post preview: we’ll dive into vLLM, an optimized inference engine that makes quantized and full-precision models fly by using smart scheduling and GPU paging. If quantization makes LLMs fit, vLLM makes them fast. Stay tuned!
Happy Coding!!!