GGUF Quantization: Q4_K_M, Q8_0, IQ4

If you are running models via Ollama or other local runtimes, you've likely encountered names like Q4_K_M, Q8_0, or IQ4_XS. What do they mean? Which one should you choose? Why is Q4 often better than Q8, and when is that not the case?

In this article, I break down quantization without unnecessary theory – with tables, real numbers, and examples from my own projects on a MacBook M1 with 16 GB RAM.

What is Language Model Quantization — Explained Simply

A language model is a set of numerical weights. Each weight determines how a neuron reacts to input data. In full precision (FP16), each weight takes up 16 bits — two bytes. A 7-billion parameter model weighs around 14 GB at this precision.

Quantization is reducing the number of bits per weight. Q8 uses 8 bits, Q4 uses 4 bits, and Q2 uses only 2. The fewer bits, the smaller the file and the less memory is needed to run it.

A good analogy is JPEG vs. RAW photos. RAW offers maximum detail but takes up 25 MB. A JPEG at 85% quality is 3 MB, and most people won't notice the difference. However, if you scale the image or crop it aggressively, artifacts become noticeable. It's similar with models: in regular chat, Q4 and Q8 are practically identical, but in mathematics or complex code, the difference is already palpable.

Key takeaway: Quantization is compression, not degradation. Properly chosen quantization preserves 92–95% of quality while reducing file size by 3–4 times.

Practical example: In my project AskYourDocs, a RAG system processes corporate documents using local Ollama on an M1 chip. The Qwen3-14B model in Q4_K_M takes up ~8.5 GB and fits within 16 GB of unified memory, leaving space for context and the embedding model. In FP16, it would have taken ~28 GB and simply wouldn't have launched.

Why Quantization is Necessary for Local AI

Without quantization, local AI on consumer hardware is simply not feasible. Here are the numbers:

Llama 3.3 70B in FP16 — ~140 GB. Such hardware costs tens of thousands of dollars.
The same Llama 3.3 70B in Q4_K_M — ~40 GB. Runs on a Mac Studio or two RTX 4090s.
A 7B model in FP16 is 14 GB. In Q4_K_M — 4.1 GB. It even runs on a graphics card with 6 GB of VRAM — which models open on 8 and 16 GB RAM.

Besides saving memory, quantization also leads to faster startup: a smaller file loads into memory more quickly. And due to the smaller volume of data read, token generation also speeds up (more on this in the Q4 vs Q8 section).

Key takeaway: Quantization is not a compromise, but the only way to run modern LLMs on a home PC or laptop.

Where it's used: local RAG systems, personal assistants, autonomous agents without the cloud, corporate tools on isolated servers — anywhere the model needs to run without API keys and without sending data externally.

What is the GGUF Format and Why It Won

GGUF (GPT-Generated Unified Format) is a file container for quantized models. It was created by Georgi Gerganov, the author of llama.cpp, in 2023 as a replacement for the older GGML format.

The main advantage of GGUF is its self-sufficiency: a single file contains everything needed to run:

quantized model weights;
the tokenizer and its configuration;
architecture metadata;
hyperparameters.

With the old GGML, you needed separate files for weights and configuration, which complicated distribution. GGUF solved this problem and quickly became the standard: today, it's supported by Ollama, LM Studio, Open WebUI, Jan, KoboldCpp, and most other local runtimes.

Personally, for my daily work with GGUF models, I use LM Studio — on my MacBook M1, generation is noticeably faster than in Ollama, plus it has a convenient interface for quickly switching between models and quantizations. If you haven't tried it yet, I recommend it as a starting point.

In short: GGUF is not a type of quantization, but a container. Quantization (Q4, Q8, etc.) is what's inside the GGUF file.

Example: when you run ollama pull qwen3:14b, Ollama downloads a GGUF file with Q4_K_M quantization by default. But you can explicitly specify another: ollama run hf.co/bartowski/Qwen3-14B-GGUF:Q8_0.

From FP16 to Q2: Quantization Levels in a Table

Each quantization level is a trade-off between quality and size. Here's how a 7B model looks at different levels:

Format	Bits/Weight	7B Model Size	Quality	For Whom
FP16	16	~14 GB	Benchmark (100%)	Researchers, fine-tuning
Q8_0	8	~7.7 GB	~99%	12–16 GB VRAM GPUs
Q6_K	6	~5.9 GB	~98.5%	10–12 GB GPUs, code and logic
Q5_K_M	5.5	~4.8 GB	~97%	8–10 GB GPUs, quality balance
Q4_K_M	4.5	~4.1 GB	~95%	Recommendation for most users
Q3_K_M	3.5	~3.1 GB	~88%	Limited hardware, testing
Q2_K	2.5	~2.7 GB	~75%	Experiments only

File size data is taken from real GGUF measurements from llama.cpp. Quality percentages are approximate perplexity values relative to FP16.

The main point here: the quality difference between Q4_K_M and Q8_0 is about 4%. The size difference is almost double. This is why Q4_K_M wins in most scenarios.

Example: research from early 2026 confirms that logical reasoning is very resistant to quantization, while arithmetic starts to degrade below Q4. For regular chat and translation, Q4 is perfectly sufficient.

What Q4_K_M, Q5_K_M, Q8_0 Mean — Suffixes Explained

This is one of the most common questions among beginners, and almost no article explains it properly. Let's break it down in detail.

Name Structure: Q[bits]_[method]_[variant]

Q4 — the approximate number of bits per weight. It's not always an integer: Q4_K_M actually uses ~4.5 bits due to mixed precision.

_K — K-Quants. This is a smarter quantization scheme that distributes bits unevenly: layers that have the most impact on quality (attention, output projection) receive more bits, while others receive fewer. This allows for higher quality at the same average size compared to the old uniform approach.

_M / _S / _L — the size variant within a single level:

_S (Small) — the smallest size in its class, slightly lower quality.
_M (Medium) — a balance between size and quality. This is the standard recommendation.
_L (Large) — the highest quality in its class, slightly larger file.

_0 — the old uniform scheme (e.g., Q4_0, Q8_0). All weights are quantized equally, without priorities. Q8_0 is widely used due to its simplicity, but Q4_0 has been practically replaced by Q4_K_M at the same size.

Comparison Table of Popular Quantizations

Quantization	7B Size	Quality	Speed	Recommendation
Q4_K_S	~3.8 GB	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	CPU+GPU offloading
Q4_K_M	~4.1 GB	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	Standard for most
Q5_K_M	~4.8 GB	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Code, agents, RAG
Q6_K	~5.9 GB	⭐⭐⭐⭐⭐	⭐⭐⭐	Math, structured output
Q8_0	~7.7 GB	⭐⭐⭐⭐⭐	⭐⭐	Maximum quality, 16+ GB VRAM

Remember: the _M suffix in the name is the most important letter. It signifies that the model uses mixed precision and preserves the most critical layers with higher detail.

Example: if you see two files on Hugging Face — model-Q4_0.gguf and model-Q4_K_M.gguf of approximately the same size — choose Q4_K_M. The old Q4_0 scheme provides noticeably worse quality at the same file size.

IQ-Quantizations: What is IQ4_XS and Why It Matters in 2026

This is a topic that is almost absent from any article. Yet, IQ-quants have already appeared in standard bartowski downloads and are actively discussed in the llama.cpp community.

What is an imatrix (importance matrix)

Standard quantization processes all weights equally. An importance matrix (imatrix) is a preliminary analysis: calibration data is passed through the model, and it's measured which weights have the greatest impact on output quality. These weights are quantized more carefully, while others are quantized more aggressively.

The result: a smaller file with similar or even better quality than K-quants of the same level.

IQ4_XS vs Q4_K_M: A Practical Comparison

Parameter	Q4_K_M	IQ4_XS
Average 7B Size	~4.1 GB	~3.9 GB
Quality (Perplexity)	Baseline	Similar or better*
Generation Speed	Standard	Slightly faster
Prompt Processing Speed	Standard	Slightly slower
Dependency on imatrix	None	High*
CPU Speed	Better	Slower

*IQ4_XS only provides an advantage if a high-quality imatrix file is used. A poorly calibrated IQ-quant can be inferior to Q4_K_M.

For a 70B model, the difference between IQ4_XS and Q4_K_M is 3–4 GB, which can be critical for whether the model fits into memory at all.

Where to Get Verified IQ-Quants

Not all GGUF files on Hugging Face are created equal. Three reliable sources:

bartowski — the most active and reliable uploader, includes imatrix.
mradermacher — broad model coverage.
TheBloke — a classic archive, less active in 2026, but still reliable.

Key takeaway: IQ4_XS is not always better than Q4_K_M. But if you need to fit a larger model into the same memory, IQ4_XS offers a 3–4 GB advantage with similar quality.

Practical scenario: I have a MacBook M1 with 16 GB RAM. Qwen3-14B in Q4_K_M takes ~8.5 GB — it fits. But if you add nomic-embed-text (~280 MB) and an 8K context, swapping begins. IQ4_XS of the same Qwen3-14B takes ~7.9 GB, and the system breathes more freely.

In my work, I use a two-stage strategy: for testing, writing articles, and general tasks — a lighter model in Q4_K_M, which launches quickly and doesn't overload the system. But when an agent moves to calling external tools or complex logic with tool calling — I switch to a stronger model in Q5_K_M or Q6_K, where JSON schema accuracy and step consistency are critical. Q4 sometimes "loses" logic in the middle of the chain here.

How Much RAM and VRAM is Needed: Table and Formula

Here's a practical table with the sizes of popular models in different quantizations. Data is based on measurements from willitrunai.com and official GGUF files:

Model	FP16	Q8_0	Q5_K_M	Q4_K_M	Q3_K_M
7B / 8B	~14 GB	~7.7 GB	~4.8 GB	~4.1 GB	~3.1 GB
14B	~28 GB	~14.9 GB	~9.6 GB	~8.5 GB	~6.4 GB
32B	~64 GB	~32 GB	~21 GB	~18 GB	~13 GB
70B	~140 GB	~75 GB	~49.9 GB	~42.5 GB	~32 GB

Formula for Your Own Calculations

RAM (GB) = Parameters (billion) × Bytes_per_weight × 1.2

Bytes per weight:
  FP16   → 2.0
  Q8_0   → 1.0
  Q5_K_M → 0.69
  Q4_K_M → 0.55
  Q3_K_M → 0.41

Additionally:
  + 1–2 GB for KV cache (with 4K context)
  + 2 GB for OS and other processes

Formula source: localaimaster.com.

Calculation example: Qwen3-14B in Q4_K_M → 14 × 0.55 × 1.2 = 9.24 GB for weights + ~1.5 GB KV cache with 4K context = minimum 11–12 GB VRAM required.

In short: the file size is just the weights. Add the KV cache (depends on context) and 2 GB for the system — and you'll get the actual memory requirement.

Important for MacBook Apple Silicon: unified memory is shared between CPU and GPU, so approximately 75–80% of the total amount is actually available. On an M1 with 16 GB, this is about 12–13 GB effectively for the model.

Choosing Quantization for Your Hardware

A quick practical cheat sheet:

Hardware	Recommendation	Example
4–6 GB VRAM (GTX 1650, RX 6500)	Q4_K_M on 3B models	Phi-4-mini, Gemma2 2B
6–8 GB VRAM (RTX 3060, RX 6600)	Q4_K_M on 7B models	Qwen3-8B, Llama3.1-8B
12 GB VRAM (RTX 3080, RTX 4070)	Q5_K_M on 14B or Q4_K_M on 14B	Qwen3-14B, Phi-4
16–24 GB VRAM (RTX 4080/4090)	Q6_K or Q8_0 on 14B; Q4_K_M on 32B	DeepSeek-R1-32B, Qwen3-32B
MacBook M1/M2 16 GB	Q4_K_M or IQ4_XS on 14B	Qwen3-14B Q4_K_M (~8.5 GB)
MacBook M2/M3 Pro 36 GB	Q4_K_M on 32B or Q8_0 on 14B	Qwen3-32B Q4_K_M (~18 GB)
Mac Studio / M4 Max 64 GB	Q4_K_M on 70B or Q5_K_M on 32B	Llama 3.3 70B Q4_K_M (~40 GB)
32–64 GB RAM (CPU-only)	Q4_K_M maximum, avoid Q8	Qwen3-32B Q4_K_M, slow

What to do if the model doesn't fit entirely into VRAM

If the model is larger than VRAM, llama.cpp and Ollama will automatically offload some layers to the CPU. This is called partial offloading (layer offloading).

When to choose Q4_K_S over Q4_K_M

With partial offloading, each layer moved to the CPU passes through the PCIe bus. A smaller file means less traffic. Therefore, Q4_K_S is recommended for partial offloading — it's about 8% smaller than Q4_K_M with a negligible difference in quality.

KV Cache Quantization: A Hidden Memory Gain

Few people know that in addition to quantizing model weights, you can also quantize the KV cache. This is the memory the model uses to store context during generation.

In Ollama, this is controlled by an environment variable:

OLLAMA_KV_CACHE_TYPE=q8_0 ollama serve

Results of actual measurements based on llama.cpp Discussion #20969:

KV Cache Type	KV Buffer Memory	Savings	Speed with 110K Context
f16 (standard)	768 MB	—	38 tokens/s
q8_0	408 MB	−47%	25 tokens/s
q4_0	216 MB	−72%	24 tokens/s ⚠️

Conclusion from the table: q8_0 KV cache is an almost free win: half the memory with no noticeable loss in quality. q4_0 KV cache is risky with long contexts — at 110K tokens, generation speed drops by 37%, and the quality of structured output and code degrades.

Key takeaway: If you're tight on memory, try OLLAMA_KV_CACHE_TYPE=q8_0 first. It provides ~47% KV memory savings with virtually no loss. Only then consider lower weight quantization.

Where this is especially important: in RAG systems like AskYourDocs, where large document chunks are loaded into context — the KV cache can occupy as much memory as the model itself.

Why Q4_K_M is Recommended More Often Than Q8

At first glance, the logic is simple: Q8 means better quality. But there are three arguments in favor of Q4_K_M that are usually not mentioned.

1. Q8_0 is Slower Than Q4_K_M

LLM generation is limited by memory bandwidth, not computation. The GPU spends more time reading weights from VRAM than performing matrix multiplications. A larger file means more bytes to read per token.

According to the official llama.cpp README for Llama-3.1-8B: Q8_0 generates tokens 29% slower than Q4_K_M.

2. A Larger Model in Q4 > A Smaller Model in Q8

This is the most important rule for choosing quantization, which most people ignore:

A 70B model in Q4 significantly outperforms a 7B model in FP16 with a similar file size.

The strategy is correct: first, take the largest possible model that fits into memory, and only then reduce precision. Not the other way around.

3. Q4_K_M Retains 92–95% of FP16 Quality

For everyday chat, translation, summarization, and most text-based tasks — a 5% difference in quality is practically imperceptible. This is why Q4_K_M is the default quantization for Ollama: Ollama already selects this option by default. You don't need to think about it — just run the model.

The main point here: Q4_K_M wins not because Q8 is bad — but because the freed-up memory allows you to run a model twice the size, which provides a greater quality increase.

Example with numbers: on an RTX 4070 12GB — the choice is between Qwen3-7B Q8_0 (~7.7 GB) and Qwen3-14B Q4_K_M (~8.5 GB). The latter option, in real tests, provides noticeably better results on complex tasks despite lower quantization.

When It's Worth Paying for Q5, Q6, or Q8

There are scenarios where a higher quantization level is truly justified:

Code and Structured Output → Q5_K_M or Q6_K

Code generation requires strict adherence to syntax. A missing parenthesis or incorrect indentation means the code won't compile. Q5_K_M noticeably reduces the frequency of syntax errors compared to Q4.

Mathematics and Reasoning → Q5_K_M Minimum

Studies from 2026 confirm: arithmetic reasoning degrades sharply below Q4. For tasks involving step-by-step calculations, especially with long chains of logic — Q5_K_M or Q6_K.

RAG Systems with Long Context → Q5_K_M

When loading large documents into context, the model needs to "hold" facts from the beginning of the text to the end of the answer. Q4 handles this, but Q5_K_M is more reliable with contexts of 16K+.

Agents with Tool Calling → Q5_K_M+

Agents must strictly adhere to the JSON schema for tool calls. Q4 sometimes generates invalid JSON under load. In my agent system, I've already switched to Q5_K_M specifically for this reason.

Small Models (3B–7B) with Large VRAM → Q8_0

If you have 16+ GB of VRAM and are running a small 7B model — Q8_0 is justified: the model fits entirely, and the quality gain is noticeable.

Key takeaway: Increase quantization not "for quality in general," but for a specific task. For code and agents — Q5_K_M is the minimum. For chat and translation — Q4_K_M is sufficient.

Does Quantization Degrade Response Quality — Real Examples

I compared Q4_K_M and Q8_0 on Qwen3-14B via Ollama on an M1. Here's what I got for different types of tasks:

Task	Q4_K_M	Q8_0	Difference
General chat, explanations	Excellent	Excellent	Not noticeable
Translation (UK/EN/DE)	Excellent	Excellent	Practically none
Article writing (SEO)	Excellent	Excellent	Minimal
Code generation (Java/Spring Boot)	Good	Better	Noticeable on complex functions
Mathematics, calculations	Satisfactory	Good	Noticeable on multi-step tasks
JSON-structured output (agents)	Occasional errors	Stable	Noticeable under load

Remember: 80% of tasks do not require Q8. But if your task involves code, agents, or mathematics — Q5_K_M will be a noticeable improvement even without switching to Q8.

Important nuance: the difference between Q4_K_M and Q8_0 on a 14B model is smaller than the difference between a 7B Q8_0 and a 14B Q4_K_M. Model size is more important than quantization level — always choose a larger model at a lower quantization if you have a choice.

GGUF vs GPTQ vs AWQ vs EXL2: What to Choose in 2026

GGUF is not the only quantization format. Here's the full picture:

Format	For Whom	Runtime	Pros	Cons
GGUF	Most local users	Ollama, LM Studio, llama.cpp	CPU+GPU, any hardware, single file	Slower than AWQ on GPU servers
GPTQ	NVIDIA GPUs, servers	vLLM, AutoGPTQ, text-generation-webui	First practical 4-bit scheme, wide support	Inferior to AWQ in quality/speed
AWQ	NVIDIA GPU production	vLLM, TGI	Best throughput on GPU, Marlin kernel	GPU only, more complex setup
EXL2	Advanced users, single GPU	ExLlamaV2	Best interactive quality on a single GPU	Smaller community, more complex installation

More on GPTQ and AWQ: LLM Quantization Guide 2026 — tensorrigs.com.

If you are a home user or a developer on a MacBook/PC — GGUF Q4_K_M via Ollama is the only format you need to think about. GPTQ and AWQ are for GPU servers and production infrastructure.

Where each is applied: GGUF — local dev, RAG prototypes, personal assistant. AWQ — production API, vLLM server with dozens of parallel requests. EXL2 — enthusiasts squeezing the maximum out of a single RTX 4090.

How to download the correct GGUF file from Hugging Face

I'll show you with my own example – I regularly download models from here. You can find detailed information on the page bartowski/Qwen_Qwen3-14B-GGUF — it has all the quantizations with descriptions of size and recommendations:

Step 1. Open the model page on Hugging Face. Look for repositories from bartowski, mradermacher, or TheBloke — they have imatrix and verified quality.

Step 2. In the "Files and versions" tab, filter files by the .gguf extension.

Step 3. Determine the required quantization from the table above. For most, it's Qwen3-14B-Q4_K_M.gguf.

Step 4. Download directly through Ollama:

ollama run hf.co/bartowski/Qwen3-14B-GGUF:Q4_K_M

Or via Hugging Face CLI:

pip install huggingface_hub
huggingface-cli download bartowski/Qwen3-14B-GGUF --include "Qwen3-14B-Q4_K_M.gguf"

To check the quantization of an already downloaded model in Ollama:

ollama show qwen3:14b

In the output, look for the line quantization — it will indicate the type (e.g., Q4_K_M).

Remember: do not download GGUF files from unknown authors without checking the model card. A bad imatrix makes IQ-quants worse than regular K-quants. Three reliable sources are bartowski, mradermacher, TheBloke.

Common mistakes when choosing quantization

Choosing Q2 or Q3 for the sake of small size. Below Q4, quality degrades unevenly: the model may speak coherently but "lose logic" in the middle of a response. It's better to take a smaller model in Q4 than a larger one in Q2.
Downloading Q8 on hardware without VRAM surplus. Q8 not only takes up more memory but is also slower. If the model barely fits, Q8 will cause swapping and effectively fewer tokens/sec than Q4_K_M.
Confusing model parameters and quantization. "14B Q4" and "7B Q8" are very different things. Parameters (7B/14B/70B) affect quality much more than the quantization level.
Ignoring KV cache when calculating memory. A 8.5 GB model on 12 GB VRAM seems normal. But with a context of 32K, the KV cache can add another 4–6 GB — and the system will freeze.
Choosing IQ-quants without checking the imatrix source. IQ4_XS from an unknown author without a calibration dataset may be inferior to Q4_K_M. Check the model card for the presence of imatrix.
Re-quantizing an already quantized model. If you take a Q8_0 file and quantize it to Q4 — errors accumulate. Always quantize from the original FP16.

Frequently Asked Questions

What does Q4 mean in the model name?

Q4 means that each model weight is compressed to approximately 4 bits (instead of 16 in FP16). This reduces the file size by about 3–4 times while preserving ~92–95% of the quality.

What is better: Q4 or Q8?

It depends on the task. For chat, translation, and writing texts — Q4_K_M is sufficient. For code, math, and agents — Q5_K_M or Q8_0 are noticeably better. But remember: a larger model in Q4 is usually better than a smaller model in Q8.

Is it worth using Q2 or Q3?

No, if there is an alternative. Q2 and Q3 cause significant degradation of logic. It's better to take a smaller model (3B or 7B) in Q4_K_M — it will be both higher quality and faster.

What quantization is best for Ollama?

Q4_K_M is the default standard for Ollama and the right choice for most users. If you have spare memory and your tasks involve code — Q5_K_M.

Does quantization affect generation speed?

Yes. Q8_0 generates tokens ~29% slower than Q4_K_M due to the larger amount of data read. LLM generation is limited by memory bandwidth, not computation.

Why is 70B in Q4 better than 7B in Q8?

Because the number of model parameters affects quality much more than the quantization level. A 70B model with 40 GB in Q4 and a 7B model with 7.7 GB in Q8 — the former will win on the vast majority of tasks.

What is IQ4_XS and how does it differ from Q4_K_M?

IQ4_XS uses an importance matrix — a preliminary analysis of which weights are most important. Result: a file ~5% smaller with similar quality. But it requires a high-quality imatrix file from a trusted author. Without it, IQ4_XS may be inferior to Q4_K_M.

How does quantization affect long context?

Directly — no. But with long context, the KV cache takes up a lot of memory. Set OLLAMA_KV_CACHE_TYPE=q8_0 — this will reduce the KV buffer by 47% without noticeable quality loss.

Where to get verified GGUF files?

Personally, I always start with two sources: bartowski — the most active uploader in 2026, publishes imatrix-quants with detailed documentation immediately after new models are released, and mradermacher — a team with broad model coverage, also includes imatrix. TheBloke — a classic archive, but practically not updated anymore: for new models, it's better to look for bartowski or mradermacher immediately.

Can the quantization of an already downloaded model be changed?

No — without recompilation. You can only download a different GGUF file with the desired quantization. Important: do not quantize from an already quantized file (e.g., Q8 → Q4) — errors accumulate. Always quantize from FP16.

Read also

Top 10 Ollama Models 2026 — which model to choose after deciding on quantization
How to run GGUF models from Hugging Face in Ollama — step-by-step guide with commands
Ollama 0.30: GGUF, Vulkan, and NVIDIA acceleration — what's new in conjunction with llama.cpp
Ollama REST API: Java, Python, JavaScript — how to integrate a local model into your project

Categories