If you are already running Ollama on 8 GB RAM and are wondering if it's worth upgrading to 16 GB,
this article provides a concrete answer. It's not just "more RAM is better," but rather what exactly becomes possible,
which models become available, and where an upgrade makes no sense.
If you haven't read about the 8 GB tier yet,
start with the previous article. This one is a direct continuation.
📚 Table of Contents
🎯 Honest Arithmetic: How Much RAM is Actually Available for the Model
16 GB on paper ≠ 16 GB for the model. After the OS, browser, and background applications,
8–11 GB is actually available. This window opens up 12B–14B class models in Q4_K_M —
but without much headroom. Understanding this arithmetic is important before choosing a model.
The most common mistake is thinking that 16 GB RAM and 16 GB VRAM are the same thing.
These are fundamentally different scenarios with different models and different performance.
RAM vs VRAM — An Important Distinction
This article is about system RAM — without a discrete GPU or with integrated graphics
(including Apple Silicon where RAM and VRAM are the same, unified memory).
If you have a discrete graphics card with 16 GB VRAM, that's a separate scenario with higher performance.
Benchmarks on RTX 4080 16 GB VRAM (Ollama 0.17.7)
show 139 tokens/sec for GPT-OSS 20B — which is a completely different game compared to CPU inference.
The Real Window for a Model on 16 GB RAM
Typical memory distribution on a 16 GB system during operation:
- OS + system processes: 2–3 GB
- Browser (Chrome/Firefox with several tabs): 1–2 GB
- IDE or code editor: 0.5–1 GB
- Background applications: 0.5–1 GB
- Remaining for Ollama: 8–11 GB
This means that a 14B model in Q4_K_M (8–9 GB) fits — but with little headroom.
When expanding the context or running parallel tasks, CPU offloading is possible.
If you close the browser and unnecessary applications, the window expands to 12–13 GB.
What Fits in the 8–11 GB Window
| Model |
Size (Q4_K_M / Ollama) |
Fits in 16 GB RAM? |
| Qwen 3 14B |
~9.3 GB |
✔️ Yes, barely |
| Qwen 2.5 Coder 14B |
~9 GB |
✔️ Yes, barely |
| Phi-4 14B |
~8.5 GB |
✔️ Yes |
| DeepSeek R1 14B |
~9 GB |
✔️ Yes, barely |
| Llama 3.2 Vision 11B |
~7.9 GB |
✔️ Yes, comfortably |
| Gemma 3 12B |
~8.1 GB |
✔️ Yes |
| Qwen 3.5 9B |
~6.6 GB |
✔️ Yes, with room to spare |
| Mistral Small 3 7B |
~4.1 GB |
✔️ Yes, with plenty of room to spare |
| DeepSeek R1 32B |
~20 GB |
❌ No — CPU offloading |
| Llama 3.3 70B |
~43 GB |
❌ No |
Conclusion: 16 GB RAM opens up the stable 12B–14B tier —
but you need to understand the real 8–11 GB window and not try to run 20B+ models
without being prepared for a significant speed drop.
🎯 7 Models Impossible on 8 GB — and Possible on 16 GB
Short answer:
On 8 GB, you are limited to the 7B–8B class. On 16 GB, the 11B–14B tier opens up —
with significantly better quality for code, math, reasoning, and image analysis.
Here are 7 specific models and what each offers compared to its 8 GB counterpart.
The transition from 8B to 14B is not just "more parameters."
It's a qualitative leap in specific tasks where 7B hits a ceiling.
1. Qwen 2.5 Coder 14B — for code
On 8 GB — Qwen 2.5 Coder 7B (HumanEval 88.4% — already an impressive result for 7B).
On 16 GB — Qwen 2.5 Coder 14B, which excels not so much in simple benchmarks,
but in real-world tasks: complex refactoring, multi-step debugging,
SWE-bench tasks requiring understanding of large codebases.
The 14B version maintains context more stably in long code review sessions.
- ✔️ Size: ~9 GB (Q4_K_M)
- ✔️ Command:
ollama pull qwen2.5-coder:14b
- ✔️ Advantage over 7B: more complex refactoring, more stable code review on large files, better SWE-bench
- ✔️ License: Apache 2.0
- ✔️ Context: 32K (expandable via YaRN)
2. Qwen 3 14B — reasoning with thinking mode
New model for 2025. On 8 GB — Qwen 3 8B (5.2 GB).
On 16 GB — Qwen 3 14B (9.3 GB) with a hybrid thinking/non-thinking mode:
for complex tasks, the model generates a chain of reasoning within <think> tags;
for simple ones, it responds directly. Qwen 3 4B already competes with Qwen 2.5 72B Instruct
in quality — the 14B version is accordingly even stronger.
- ✔️ Size: ~9.3 GB (Q4_K_M)
- ✔️ Command:
ollama pull qwen3:14b
- ✔️ Advantage over 8B: deeper reasoning, better instruction following, agent capabilities
- ✔️ License: Apache 2.0
- ✔️ Context: 40K tokens
3. Phi-4 14B — for math and logic
On 8 GB — Phi-4 Mini 3.8B. On 16 GB — the full Phi-4 14B.
MATH benchmark: 80.4%, GPQA Diamond (graduate-level questions): 56.1% —
both metrics exceed GPT-4o, the model that was Phi-4's teacher during training.
HumanEval 82.6% — the best among open-weight models of its size.
Also available is Phi-4-reasoning — a version with reasoning mode
that competes with DeepSeek R1 and o1/o3-mini on math tasks.
- ✔️ Size: ~8.5 GB (Q4_K_M)
- ✔️ Command:
ollama pull phi4
- ✔️ Advantage over Mini: significantly more complex math and STEM problems
- ⚠️ Limitation: 16K context — not for long documents
- ⚠️ Weakness: IFEval 63.0 — not the best at strict instruction following
- ✔️ License: MIT
4. DeepSeek R1 14B — reasoning without compromise
On 8 GB — DeepSeek R1 8B (slow, reasoning mode).
On 16 GB — the 14B version offers more comfortable reasoning without noticeable pauses.
The <think> tags are normal behavior: the model "thinks aloud"
before the final answer, improving quality on complex tasks.
- ✔️ Size: ~9 GB (Q4_K_M)
- ✔️ Command:
ollama pull deepseek-r1:14b
- ✔️ Advantage over 8B: significantly faster reasoning, better on complex tasks
5. Llama 3.2 Vision 11B — image analysis
On 8 GB — Gemma 3 4B with basic vision support.
On 16 GB — Llama 3.2 Vision 11B: OCR, screenshot analysis, chart reading,
UI description, technical image analysis.
128K context allows analyzing images with long textual context.
- ✔️ Size: ~7.9 GB (Q4_K_M)
- ✔️ Command:
ollama pull llama3.2-vision:11b
- ✔️ Advantage over 4B: higher quality OCR, more accurate analysis of complex images
- ✔️ Context: 128K tokens
6. Gemma 3 12B — a balanced multimodal option
On 8 GB — Gemma 3 4B. On 16 GB — Gemma 3 12B with multimodality
(text + images), support for 140+ languages, and 128K context.
Gemma 3 is available in sizes 1B, 4B, 12B, and 27B (there is no 9B version).
Google has optimized Gemma 3 for single-accelerator deployment —
efficient memory usage.
- ✔️ Size: ~8.1 GB (Q4_K_M)
- ✔️ Command:
ollama pull gemma3:12b
- ✔️ Advantage over 4B: significantly better text quality, image analysis, and reasoning
- ✔️ Context: 128K tokens
7. Qwen 3.5 9B — the new sweet spot (March 2026)
The newest model on the list. Qwen 3.5 9B was released in March 2026 —
native multimodality (text + images), 262K context, thinking mode.
It occupies only 6.6 GB in Ollama — fitting comfortably within 16 GB.
It's not a coding-specific model, but works excellently for code review,
debugging, and analyzing error screenshots.
- ✔️ Size: ~6.6 GB (Q4_K_M)
- ✔️ Command:
ollama pull qwen3.5:9b
- ✔️ Advantage: native vision, thinking mode, massive context
- ✔️ License: Apache 2.0
- ✔️ Context: 262K tokens
Conclusion: 16 GB of RAM unlocks a specific tier of models
where there's a qualitative leap: code (14B Coder for complex refactoring),
math (Phi-4 — 80.4% MATH), reasoning (Qwen 3 14B with thinking mode),
full-fledged vision (11B–12B instead of 4B), and a new sweet spot — Qwen 3.5 9B.
🎯 What Improves for Models Already on 8 GB
On 16 GB, the same 7B–8B models get three bonuses:
higher quantization (Q5 instead of Q4), larger context without degradation,
and the ability to run two models simultaneously for comparison.
Higher Quantization — More Reasoning Fidelity
On 8 GB, Llama 3.3 8B runs in Q4_K_M (~4.7 GB).
On 16 GB systems, Q5_K_M is the optimal choice — slightly more reasoning fidelity
with a minimal difference in speed. The size of Q5_K_M is ~5.4 GB compared to 4.7 GB for Q4.
Smaller models are more sensitive to quantization, so for the 8B class, Q5 makes a noticeable difference.
# Q5_K_M instead of the default Q4_K_M
ollama pull llama3.3:8b-instruct-q5_K_M
Larger Context Without Degradation
On 8 GB, extending context to 32K+ noticeably eats into RAM and can trigger
CPU offloading. On 16 GB — the headroom allows comfortable work with 32K–64K context
without significant speed degradation.
For RAG on long documents or analyzing large codebases — a substantial difference.
# Extended context via Modelfile
FROM llama3.3:8b
PARAMETER num_ctx 32768
Two Models Simultaneously
On 8 GB, keeping two models in memory is practically impossible.
On 16 GB — Mistral 7B (4.1 GB) + embedding model nomic-embed-text (2 GB) = 6.1 GB.
This means RAG search and response generation simultaneously without reloading models.
Or: Qwen 2.5 Coder 7B for autocompletion + Qwen 3.5 9B for chat — a combination
recommended for local development in 2026.
Section Conclusion: Even if you don't upgrade to the 14B class —
16 GB of RAM improves the experience with 7B–8B models through higher quantization,
larger context, and the ability for parallel execution.
📊 Comparison Table: 8 GB vs 16 GB by Task
Benchmark sources:
SitePoint: Best Local LLM Models 2026,
LocalLLM.in: 16GB benchmark,
InsiderLLM: Best Local Coding Models 2026,
official technical reports
Phi-4,
Qwen 2.5 Coder.
| Task |
8 GB RAM |
16 GB RAM |
Difference |
| Code (autocompletion) |
Qwen 2.5 Coder 7B HumanEval 88.4% |
Qwen 2.5 Coder 14B More complex refactoring, SWE-bench |
Quality on complex tasks |
| Code (chat/review) |
Qwen 3.5 9B (6.6 GB, tight fit) |
Qwen 3.5 9B + Coder 14B two models in parallel |
Combined workflow |
| Math / Logic |
Phi-4 Mini 3.8B |
Phi-4 14B MATH 80.4%, GPQA 56.1% |
Qualitative leap |
| Reasoning |
DeepSeek R1 8B (slow) |
Qwen 3 14B with thinking mode or DeepSeek R1 14B |
Comfortable speed + hybrid mode |
| Image Analysis |
Gemma 3 4B (basic vision) |
Llama 3.2 Vision 11B or Gemma 3 12B |
Qualitative leap (OCR, charts, UI) |
| RAG on Documents |
Llama 3.3 8B 32K context |
Qwen 3.5 9B — 262K context or Qwen 3 14B — 40K |
Up to 8x larger context |
| General Chat |
Llama 3.3 8B Q4 |
Llama 3.3 8B Q5 or Qwen 3 14B |
Minimal difference |
| Maximum Speed |
Mistral Small 3 7B ~40 t/s |
Mistral Small 3 7B ~50 t/s (with RAM headroom) |
+25% speed |
🎯 CPU Offloading — A Trap to Avoid
If a model doesn't fit into RAM — Ollama automatically offloads layers to the CPU.
Speed drops by 5–11 times. On a 16 GB system, this is relevant for 20B+ models.
How to diagnose and avoid it.
Real benchmarks (RTX 4080 16 GB VRAM, Ollama 0.17.7):
GPT-OSS 20B fully in memory — 139 tokens/sec.
GPT-OSS 120B with 78% on CPU — 12.64 tokens/sec.
An 11x difference on the same hardware.
How to Detect CPU Offloading
After launching a model, check with ollama ps:
ollama ps
# Good case — 100% in memory:
NAME SIZE PROCESSOR CONTEXT
llama3.3:8b 4.7 GB 100% GPU 4096
# Bad case — CPU offloading:
NAME SIZE PROCESSOR CONTEXT
deepseek-r1:32b 19 GB 43%/57% CPU/GPU 4096
If you see a split like 43%/57% CPU/GPU —
a significant portion of computations is going to the CPU. Expect 5–10x slower generation.
Each token requires transfer between CPU and GPU memory via PCIe — this is a bottleneck
that grows with each offloaded layer.
Degradation Numbers on Real Hardware
Testing degradation with CPU offloading (data from LocalLLM.in):
- Qwen 3 8B fully in memory (36/36 layers): 40 tokens/sec
- Qwen 3 8B with 25/36 layers in memory: 8 tokens/sec — 5 times slower
- CPU-only mode (num_gpu 0): even slower — only acceptable for batch tasks
Context Also Consumes Memory
An important nuance: not only the model, but also the context length affects RAM consumption.
According to the KV cache formula, extending context from 4K to 32K can add hundreds of megabytes.
GPT-OSS 20B at 60K context (13.7 GB) yielded 42 t/s, but at 120K (14.1 GB) — only 7 t/s,
because offloading began. On 16 GB RAM, the effect is even more pronounced.
How to Avoid CPU Offloading on 16 GB
- ✔️ Choose models from the table above — all fit within the 8–11 GB window
- ✔️ Close your browser and unnecessary programs before running a heavy model
- ✔️ Do not attempt to run 20B+ models without adequate GPU/RAM
- ✔️ Monitor via
ollama ps after each new launch
- ✔️ Control
num_ctx — smaller context = less RAM for KV cache
Conclusion: I understand that CPU offloading is not an error, but Ollama's automatic behavior.
But it's important for me to know about it so I don't wonder why a 20B model is "slow."
On 16 GB of RAM, I can avoid offloading — I simply choose the right models
and control the context size.
🎯 Is it worth upgrading from 8 GB to 16 GB
I advise from my experience: it's worth it if my main task is coding, math, or RAG on long documents.
The difference in these tasks is significant and measurable. For basic chat and simple text tasks,
the upgrade changes almost nothing.
I personally use Ollama locally mainly for API testing – I run the model
locally instead of OpenRouter's free tier. The reason is simple: OpenRouter's free tier is often
overloaded, responses are delayed, or 429 and 503 errors occur during peak hours.
With a local model, there's zero dependency on an external service. The response is always
available, regardless of the load on other servers. For testing and development,
this is more important than the quality difference between an 8B and a 14B model.
For whom an upgrade to 16 GB is definitely justified
- ✔️ Developers who code daily — Qwen 2.5 Coder 14B
provides noticeably better code review and refactoring than the 7B version on complex tasks.
And the combination of Coder 14B + Qwen 3.5 9B provides a full-fledged local dev workflow
- ✔️ Mathematics and algorithms — Phi-4 14B (80.4% MATH, 56.1% GPQA)
vs Phi-4 Mini (significantly lower). If you solve complex STEM problems, the difference is fundamental
- ✔️ RAG on large documents — Qwen 3.5 9B with 262K context
or Qwen 3 14B. For analyzing long PDFs or codebases, large context is critical
- ✔️ Image analysis — Llama 3.2 Vision 11B or Gemma 3 12B
qualitatively surpass 4B variants. For OCR, chart analysis, or UI screenshots,
an 11B–12B tier is precisely what's needed
- ✔️ Local API testing — if you depend on stability
and want to avoid external provider errors during peak hours
For whom the difference is minimal
- ⚠️ Basic chat and simple questions — Llama 3.3 8B on 8 GB
covers 80% of daily chat tasks. The upgrade won't provide a noticeable difference
- ⚠️ Simple text tasks — paraphrasing, summarization,
translation — the 7B–8B class handles it. The quality difference is minimal
- ⚠️ If you already have OpenRouter as a fallback — for occasional
heavy tasks, you can use a cloud model. But if the service is often
overloaded in your time zones, a local 14B provides stability
Alternative to upgrading: a hybrid strategy
If a RAM upgrade is not planned yet, there's an intermediate solution:
- ✔️ 8 GB locally for daily tasks and API testing
- ✔️ OpenRouter free tier for heavy one-off tasks —
Qwen 2.5 72B or DeepSeek R1 70B when maximum quality is needed
- ✔️ Fallback logic in code — try the local model,
on error or timeout, switch to OpenRouter
Conclusion: An upgrade to 16 GB is justified if
coding, math, or RAG are your primary tasks. For basic use,
the difference doesn't justify the upgrade cost. A hybrid strategy is
a good compromise until an upgrade is planned.
❓ Frequently Asked Questions (FAQ)
Can a 20B model be run on 16 GB RAM?
Technically yes — but with CPU offloading. Expect a 5–11x speed drop.
Benchmarks show: a 20B model with CPU offloading gives ~12 tokens/sec
compared to 139 tokens/sec fully in GPU memory.
For batch tasks without interactivity, it's acceptable. For live chat, no.
What's the difference between Q4_K_M and Q5_K_M in practice?
Q5_K_M uses 15–20% more RAM and has 5–10% slower generation.
The quality of responses is slightly better, especially on tasks where reasoning accuracy is important.
On a 16 GB system, Q5_K_M is justified for 7B–8B models — there's headroom.
For 14B models, Q4_K_M is usually optimal, as the window is already tight.
The golden rule: a larger model with Q4 is almost always better than a smaller one with Q8.
How to check if a model is using CPU offloading?
ollama ps shows the current status:
ollama ps
# NAME SIZE PROCESSOR CONTEXT
# 100% GPU — normal (for Apple Silicon: 100% RAM)
# 43%/57% CPU/GPU — offloading is active
Is 16 GB on Apple Silicon the same as 16 GB RAM on a PC?
Better. On Apple Silicon, unified memory is both RAM and VRAM simultaneously.
16 GB of unified memory on M2/M3/M4 provides better performance than 16 GB of RAM on a PC
without a discrete GPU, as the model is loaded directly into unified memory
without transfer overhead between CPU and GPU.
What about Qwen 3.5 — isn't that a new model?
Yes, Qwen 3.5 was released in March 2026 and is already available in Ollama.
The 9B version (6.6 GB) has native multimodality, 262K context, and thinking mode.
On a 16 GB system, it's one of the best choices for general use —
more compact than 14B models, but with larger context and vision capabilities.
Is it worth waiting for new models before upgrading?
New models are released constantly — there will always be a "better model in a month."
If a task is already limiting you now (slow code review, inaccurate math,
unstable external API), an upgrade makes sense now.
If everything is satisfactory, wait.
✅ Conclusions
Moving from 8 GB to 16 GB RAM for Ollama is not a linear improvement,
but the unlocking of a specific new tier of models: the 11B–14B class.
What is unlocked:
- ✔️ Code: Qwen 2.5 Coder 14B — more complex refactoring and SWE-bench tasks
- ✔️ Reasoning: Qwen 3 14B with thinking mode — hybrid reasoning without compromises
- ✔️ Math: Phi-4 14B — MATH 80.4%, surpasses GPT-4o
- ✔️ Images: Llama 3.2 Vision 11B or Gemma 3 12B — full OCR and analysis
- ✔️ Context: Qwen 3.5 9B — 262K tokens with native vision
- ✔️ Parallel execution: two models simultaneously (Coder + chat)
What does not change significantly: basic chat, simple text tasks.
Llama 3.3 8B on 8 GB covers 80% of daily tasks without an upgrade.
If you're still on 8 GB —
read the previous article about the 8 GB tier
.
If you're choosing a model for a specific task —
full comparison of Ollama models 2026
.
If you want to understand what Ollama is and why everyone is switching to local models —
Ollama overview in 2026
.
If you plan to work with documents —
RAG with Ollama guide
.
📎 Sources
- SitePoint: Best Local LLM Models 2026 — HumanEval, MMLU, and MT-Bench benchmarks
- LocalLLM.in: Best Local LLMs for 16GB VRAM — real-world tests, VRAM scaling
- Rost Glukhov: LLMs on Ollama 16GB VRAM (Ollama 0.17.7) — RTX 4080 benchmarks, CPU offloading, Qwen 3.5
- LocalLLM.in: Ollama VRAM Requirements 2026 — degradation with offloading
- InsiderLLM: Best Local Coding Models 2026 — Qwen 2.5 Coder, Qwen 3.5 tier comparison
- Phi-4 Technical Report (Microsoft Research) — MATH, GPQA, HumanEval benchmarks
- Qwen 2.5 Coder Technical Report (Alibaba) — official benchmarks by size
- Ollama Library — official model registry and sizes