Have a laptop with 8GB of RAM and want to run AI locally?
This article is a breakdown: what works, what barely runs,
and what's not even worth downloading. No illusions, with specific models
and commands for each task. If you're not yet familiar with Ollama —
start with our introductory article on what Ollama is and why developers are massively switching to local AI.
📚 Article Contents
🎯 Honest Arithmetic: How Much RAM is Actually Left for the Model
Short Answer:
With 8GB of RAM, about 4–5GB is realistically available for an AI model.
The rest is taken by the operating system, browser, and background processes.
This dictates the main rule: on 8GB, models up to 3–7B parameters in 4-bit quantization work comfortably.
8GB RAM is not 8GB for the model.
It's 8GB minus the OS, minus Chrome, minus everything you forgot to close.
Before choosing a model, you need to understand your real memory budget.
Here's a typical breakdown on a system with 8GB RAM:
- ✔️ Operating System: 1.5–2.5GB (macOS is closer to 2.5, Windows — 2, Linux — 1.5)
- ✔️ Browser (5–10 tabs): 1–2GB
- ✔️ IDE (VS Code / IntelliJ): 0.5–1.5GB
- ✔️ Background Processes: 0.3–0.5GB
Remaining for the model: 3–5GB.
According to
LocalLLM.in,
a 7B parameter model in Q4_K_M quantization takes approximately 4–5GB,
plus 1–2GB for KV cache and system overhead.
This means: a 7B model on 8GB is possible, but on the edge, and it's best to close
everything unnecessary.
Practical rule for 8GB:
- ✔️ Comfort Zone: 1–3B parameter models (Q4_K_M) — leaves space for IDE and browser
- ✔️ Working Zone: 7–8B parameter models (Q4_K_M) — requires closing everything else
- ❌ Red Zone: 13B+ models — guaranteed freezes or disk swapping
Conclusion: Before choosing a model, close your browser, check
ollama ps, and see the actual remaining memory.
On 8GB, every gigabyte is worth its weight in gold.
🎯 For Code: Which Model Will Replace Copilot on 8GB
Code Autocompletion
For code autocompletion on 8GB, the best choice is Qwen 2.5 Coder 3B or
Phi-4 Mini (3.8B) in Q4_K_M quantization. Both models leave enough
memory for VS Code and provide acceptable generation quality.
GitHub Copilot costs $10/month. A local model for code is
$0/month and works offline. The only question is which model
will run on your hardware.
Coding is a task where even small models can be useful.
Autocompletion, function generation, code explanation, writing tests —
you don't need GPT-4 for this, you need a fast and accurate model
that understands syntax.
Top Models for Code on 8GB
1. Qwen 2.5 Coder 3B (Q4_K_M) — ~2.2GB RAM
According to
SitePoint,
Qwen leads the HumanEval benchmark among 7–8B class models.
The 3B version is lightweight but retains strong specialization in code.
Trained on a large volume of programming code and technical documentation.
ollama pull qwen2.5-coder:3b
ollama run qwen2.5-coder:3b "Write a Python array sorting function"
2. Phi-4 Mini (3.8B) — ~2.3GB RAM
According to
SitePoint,
Phi-4 Mini is the only model that runs comfortably on systems with 8GB,
delivering 15–20 tokens/sec on an M1 MacBook Air or a budget Linux laptop.
It handles autocompletion, simple explanations, and light chat tasks well.
ollama pull phi4-mini
ollama run phi4-mini "Explain the difference between HashMap and TreeMap in Java"
3. DeepSeek Coder 1.3B (Q4_K_M) — ~1GB RAM
The lightest model for code. Ideal for IDE autocompletion —
fast, doesn't overload the system, can be kept running in the background
along with VS Code, browser, and terminal.
ollama pull deepseek-coder:1.3b
ollama run deepseek-coder:1.3b
What to Choose?
- ✔️ Need background autocompletion + open browser → DeepSeek Coder 1.3B
- ✔️ Need function generation and code explanation → Qwen 2.5 Coder 3B
- ✔️ Need a universal model for code and text → Phi-4 Mini
More on setting up autocompletion — in the article Ollama + VS Code: A Free Alternative to GitHub Copilot.
Conclusion: On 8GB, you can code with local AI.
Don't expect GPT-4 quality — but for daily autocompletion, boilerplate generation,
and code explanations, it's sufficient.
🎯 For Text and Communication: Chat, Summaries, Translation
For Text Tasks
For text tasks on 8GB, the optimal choice is Llama 3.2 3B for general
chat, Gemma 4 E4B for a balance of quality and multimodality, or Phi-3 Mini
if minimal size is required. All three leave room for other software.
Not every task requires GPT-4. Summarizing text,
answering questions, retelling an article — a model that weighs less
than a 4K movie can handle this.
Text tasks are the broadest category: from simple chat to document analysis
and translation. On 8GB, there's a good selection here.
Top Models for Text on 8GB
1. Llama 3.2 3B (Q4_K_M) — ~2GB RAM
According to
StudyHUB,
Llama 3.1/3.2 is the most popular model on Ollama with over 111 million
downloads. The 3B version is lightweight but retains quality in general
conversations, summarization, and question answering. Supports 8 languages.
ollama pull llama3.2:3b
ollama run llama3.2:3b "Retell the main idea of this text: ..."
2. Gemma 4 E4B (Q4_K_M) — ~3GB RAM
A model from Google DeepMind, released in April 2026. Unlike the older
Gemma 2B, it's a full multimodal model: accepts text and images,
has a thinking mode for more complex tasks, and a 128K context window. At the same time,
it comfortably fits within 8GB, leaving space for IDE and browser. If
you previously used gemma:2b — E4B is a direct replacement with
significantly better quality. More on model architecture and sizes —
in the article Gemma 4: Full Overview — Sizes, License, Comparison with Gemma 3.
ollama pull gemma4:e4b
ollama run gemma4:e4b "Create a short description for this product: ..."
⚠️ Note: if you need the absolute minimum RAM and the old
gemma:2b (~1.6GB) worked for you — it's still available.
But for new installations, I recommend E4B right away. Gemma 4's thinking mode
can be turned on and off — read about how it works and when to disable it
in the article Reasoning Mode in Gemma 4: How to Enable, When Needed, and What It Costs.
3. Phi-3 Mini (3.8B) — ~2.3GB RAM
According to
StudyHUB,
Phi-3 Mini, at 2.3GB, covers 90% of daily tasks.
It runs fast even on CPU and is suitable for Raspberry Pi 4/5.
ollama pull phi3:mini
ollama run phi3:mini "Translate to Ukrainian: The quick brown fox jumps over the lazy dog"
What to Choose?
- ✔️ General chat and Q&A → Llama 3.2 3B
- ✔️ Multimodality (text + images) and better quality → Gemma 4 E4B
- ✔️ Minimum RAM, CPU-only, or Raspberry Pi → Phi-3 Mini
Conclusion: For text tasks, 8GB is comfortable territory.
2–4B models run fast, leave room for other programs, and provide quality
sufficient for most daily needs.
Gemma 4 E4B is the biggest leap in quality without increasing hardware requirements.
🎯 For Reasoning: Math, Logic, Code Debugging
For tasks requiring step-by-step thinking — mathematics, logic problems,
debugging complex code — DeepSeek R1 8B in Q4 quantization works on 8GB.
This is a "thinking" model: it's slower but more accurate on complex questions.
A regular model answers immediately. A reasoning model
thinks first — step by step — and then answers.
Like the difference between "guessing an answer" and "calculating on paper."
Reasoning models are a relatively new category. They work on the principle of
chain-of-thought: breaking down a task into steps, verifying intermediate
results, and only then forming the final answer.
What Works on 8GB
1. DeepSeek R1 8B (Q4_K_M) — ~5GB RAM
According to
StudyHUB,
DeepSeek R1 is a "thinking" model, an analog of OpenAI o1. On tasks involving math,
logic puzzles, and technical reasoning, it yields better results than Llama 3.1
of the same size. The trade-off: it answers slower because it "thinks" before responding.
ollama pull deepseek-r1:8b
ollama run deepseek-r1:8b "Find the error in this SQL query: SELECT * FROM users WHERE id = '5' AND active = true GROUP HAVING count > 1"
⚠️ Important: DeepSeek R1 8B takes ~5GB RAM.
On an 8GB system, this is on the edge — you need to close your browser, IDE, and
everything else. On macOS with unified memory, it works more stably than on Windows
with integrated graphics.
2. Qwen 3 8B (Q4_K_M) — ~5GB RAM
According to
LocalLLM.in,
Qwen 3 8B is a strong alternative for reasoning tasks, especially in math
and multilingual scenarios. It supports Ollama's thinking mode by default.
ollama pull qwen3:8b
ollama run qwen3:8b "Solve: if 3x + 7 = 22, what is x?"
What to Choose?
- ✔️ Code debugging and logic problems → DeepSeek R1 8B
- ✔️ Math and multilingual reasoning → Qwen 3 8B
- ✔️ If 8B doesn't fit — Phi-4 Mini as a compromise (smaller, but without chain-of-thought)
Conclusion: Reasoning on 8GB is possible, but it's the edge
of comfort. 8B models require almost all available memory. For regular
work with such tasks, consider upgrading to 16GB — the difference in capabilities
will be substantial.
🎯 CPU vs GPU vs Apple Silicon — where 8GB is not 8GB
8GB on a Mac M1 and 8GB on a Windows laptop with Intel offer two different experiences.
Apple Silicon uses unified memory, where all memory is accessible to both the CPU
and GPU simultaneously. On a standard PC, RAM and VRAM are separate pools,
and this is critical for AI models.
An 8GB Mac M1 is a fully functional workstation for local AI.
An 8GB Windows laptop with Intel HD Graphics is a struggle for every megabyte.
Apple Silicon (M1/M2/M3) — the best scenario for 8GB
On Apple Silicon, all RAM is unified memory.
This means the GPU part of the chip has access to the same 8GB
as the CPU. Ollama automatically uses Metal for acceleration —
without any additional settings.
Result: a 7B model in Q4_K_M on an M1 with 8GB delivers 15–20 tokens/sec —
enough for comfortable interactive use.
According to
SitePoint,
Phi-4 Mini on an M1 MacBook Air is approximately 15–20 tok/s, which is sufficient
for daily work.
Windows / Linux with a discrete GPU (RTX 3060, RTX 4060) — a good scenario
If you have a discrete graphics card with 6–8GB of VRAM — the model is fully
loaded into GPU memory, and system RAM remains for the OS and software.
According to
LocalLLM.in,
on an RTX 4060 (8GB VRAM), a 7B model delivers 40+ tokens/sec —
the fastest option of all.
Windows / Linux without a GPU (Intel HD / AMD Radeon iGPU) — a difficult scenario
Without a discrete GPU, the model runs entirely on the CPU. Ollama will still
launch — but the speed drops to 3–6 tokens/sec.
According to
LocalLLM.in,
CPU-only inference is acceptable for batch tasks but frustrating
for interactive use.
Plus, system RAM is shared between the OS, software, and the model — on 8GB
it's very tight.
Summary table
| Platform |
7B model (Q4) |
3B model (Q4) |
Speed |
Comfort |
| Mac M1/M2 8GB |
✔️ Works |
✔️ Comfortable |
15–20 tok/s |
⭐⭐⭐⭐ |
| Windows + RTX 4060 8GB VRAM |
✔️ Works fast |
✔️ Comfortable |
40+ tok/s |
⭐⭐⭐⭐⭐ |
| Windows/Linux CPU only 8GB |
⚠️ On the edge |
✔️ Works |
3–6 tok/s |
⭐⭐ |
Conclusion: If you have a Mac M1+ with 8GB — you are in
the best position for local AI on budget hardware. If you have Windows
without a GPU — focus on 3B models and close everything else.
More details on installation on different OS —
in the article How to Install Ollama on Mac, Windows, and Linux: A Complete Guide 2026.
🎯 Quantization in simple terms: Q4 vs Q8 and what to choose for weak hardware
Short answer:
Quantization is model compression that reduces its size by 2–4 times
with minimal quality loss. For 8GB, the optimal choice is Q4_K_M:
the best balance between size, speed, and response quality.
Quantization is like JPEG for photos. The file is smaller,
the difference is almost imperceptible. But if you compress too much —
the quality will noticeably drop.
When you see tags like :7b-q4_0,
:8b-instruct-q8_0, or :3b-q4_k_m in an Ollama model name —
this indicates the quantization level. The number after "q" is the number of bits per parameter.
Quantization levels: what the tags mean
- ✔️ Q8 (8-bit): maximum quality, largest size. For a 7B model — ~8GB. Won't fit in 8GB RAM.
- ✔️ Q4_K_M (4-bit, K-quant medium): optimal balance. For 7B — ~4–5GB. Recommended for 8GB systems.
- ✔️ Q4_K_S (4-bit, K-quant small): slightly smaller than Q4_K_M, slightly lower quality.
- ⚠️ Q2_K (2-bit): minimum size (~2.5GB for 7B), but noticeable quality degradation. An extreme option.
The "K" suffix signifies newer quantization methods (K-quant), which more intelligently
distribute precision across model layers. K-quant tags are always better than
legacy variants (q4_0, q4_1) at the same size.
How much do models of different quantizations weigh
| Model |
Q8 |
Q4_K_M |
Q2_K |
| Phi-3 Mini (3.8B) |
4.0 GB |
2.3 GB |
1.2 GB |
| Llama 3.1 (7B) |
~8 GB |
~4.5 GB |
~2.6 GB |
| Mistral 7B |
~8 GB |
~4.1 GB |
~2.8 GB |
Data from
LocalAIMaster.
Rule for 8GB: always choose Q4_K_M. If it doesn't fit, reduce
the model size (3B instead of 7B), not the quantization level (Q2 instead of Q4).
A smaller model with Q4 will provide better quality than a larger one with Q2.
More on compression techniques and their impact on quality —
in the article Model Quantization: INT4, INT8 — What It Is and How It Affects Quality.
Conclusion: Q4_K_M is the golden standard for 8GB. Don't give in
to the temptation to load Q8 "for quality" — the model won't fit into memory,
and you'll get disk swapping instead of fast responses.
🎯 Ollama Settings for Maximum Performance on Weak Hardware
Three environment variables and one habit (closing unnecessary things) — that's all
you need to squeeze the most out of 8GB. The setup takes
a minute, and the difference in stability is noticeable.
On powerful hardware, Ollama "just works."
On weak hardware, you need to help it not waste memory on things
you don't need.
By default, Ollama can keep multiple models in memory simultaneously and handle parallel requests. On 8GB, this is an unnecessary luxury. Here's the minimal set of optimizations:
Environment Variables
# Keep only one model in memory (default can be more)
export OLLAMA_MAX_LOADED_MODELS=1
# One parallel request (no memory competition)
export OLLAMA_NUM_PARALLEL=1
# Reduce context window — saves 200–800MB RAM
export OLLAMA_CTX_SIZE=2048
On macOS / Linux, add these lines to ~/.zshrc or
~/.bashrc. On Windows — set them via system environment variables
or your PowerShell profile.
Before running a model
It sounds trivial, but on 8GB it's critical:
- ✔️ Close your browser or leave a maximum of 2–3 tabs
- ✔️ Close Slack, Discord, Spotify — each program consumes 200–500MB
- ✔️ Check current usage:
ollama ps will show loaded models
- ✔️ If an old model is still in memory —
ollama stop model_name
Modelfile for fine-tuning
If you want more control — create a Modelfile with optimized parameters:
FROM phi3:mini
PARAMETER num_ctx 2048
PARAMETER num_thread 4
PARAMETER temperature 0.7
num_ctx 2048 — reduces the context window (less RAM for KV cache).
num_thread 4 — limits the number of CPU threads, keeping the system
responsive.
A step-by-step guide to installation and first launch —
in the article How to Install Ollama on Mac, Windows, and Linux: A Complete Guide 2026.
And on creating custom models via Modelfile —
in the article Modelfile in Ollama: Create Your Custom AI.
Conclusion: Three environment variables + closed unnecessary programs =
stable operation on 8GB. Without these settings, even a lightweight model
can cause disk swapping.
🎯 What NOT to try on 8GB — my experience
Short answer:
Models 13B+, any models in Q8 quantization, and attempts to run
two models simultaneously — are guaranteed disappointment on 8GB.
I tested this on my Mac M1 — so you don't have to.
Everyone who has worked with Ollama on 8GB has gone through the same
stage: "But maybe 13B will fit after all?" No, it won't.
I checked.
Working with Ollama on a Mac M1 with 8GB of unified memory, I tested
dozens of models of various sizes. Here's an honest list of what doesn't work —
or works so poorly it's better if it didn't.
❌ Models 13B and larger
Llama 3.1 13B, Qwen 14B, CodeLlama 13B — even in Q4 quantization
they require 8–9GB just for the model weights. Add KV cache, OS,
and you'll get a system that constantly swaps to disk.
I tried to run Llama 3.1 13B Q4 — it took the first 5 minutes to load,
then delivered 1–2 tokens per second with constant pauses.
This is unworkable for interactive use.
❌ Any 7B model in Q8 quantization
The Q8 version of a 7B model weighs around 8GB — that's all your RAM.
The OS doesn't magically disappear. I tried Mistral 7B Q8 —
the system froze a minute after starting. Always use Q4_K_M
for 7B models on 8GB.
❌ Two models simultaneously
Ollama can keep multiple models in memory. On 16GB, this is convenient —
you switch between models instantly. On 8GB, it's a recipe for a swap storm.
Keep OLLAMA_MAX_LOADED_MODELS=1 and don't forget
ollama stop before loading another model.
❌ Large context windows (8K+ tokens)
Every doubling of the context window means additional hundreds of megabytes
for KV cache. On 8GB, keep the context at 2048–4096 tokens maximum.
You won't be able to feed the model a 10-page document whole;
you'll need to break it into parts.
❌ Mixtral 8x7B (MoE architecture)
Mixtral activates only 2 out of 8 "experts" per token,
so theoretically it uses fewer computations. But all 8 experts
must be in memory — and that's 26+ GB even in Q4.
The name "8x7B" is misleading: it's not a 7B-sized model.
General rule: if ollama run
takes longer than 30 seconds to load and the first response comes after a minute —
the model is too large for your system. Don't expect it to "warm up" —
close it and choose a smaller model.
Comparison of models by size, quality, and tasks —
in the article Top 10 Ollama Models in 2026: Which to Choose.
Conclusion: I went through this myself — I thought a larger
model would give better results, downloaded 13B, waited a minute for the first
response, and deleted it. I installed 3B — and performance immediately improved.
On 8GB, the better strategy is to choose a model that works quickly and stably,
rather than struggling with one that "almost fits."
🎯 Benchmarks: What to Expect in Practice
Short answer:
On a Mac M1 with 8GB RAM, the 3B model delivers 20–30 tokens/sec, and the 7B model delivers 10–15 tokens/sec.
On a CPU-only Windows machine, it's two to three times slower.
Below is a summary table for guidance.
Benchmarks found online are often conducted on a clean system without other software. In reality, with VS Code open and 5 Chrome tabs running, the numbers will be lower. Therefore, these tests are closer to reality.
Performance Summary Table
| Model |
RAM |
Mac M1 8GB |
CPU-only 8GB |
RTX 4060 8GB VRAM |
| Gemma 4 E4B (Q4) |
~3GB |
~22 tok/s |
~7 tok/s |
~42 tok/s |
| Phi-3 Mini 3.8B (Q4) |
~2.3GB |
~25 tok/s |
~8 tok/s |
~45 tok/s |
| Llama 3.2 3B (Q4) |
~2GB |
~28 tok/s |
~9 tok/s |
~48 tok/s |
| Qwen 2.5 Coder 3B (Q4) |
~2.2GB |
~25 tok/s |
~8 tok/s |
~45 tok/s |
| Llama 3.1 8B (Q4) |
~4.5GB |
~12 tok/s |
~4 tok/s |
~40 tok/s |
| DeepSeek R1 8B (Q4) |
~5GB |
~10 tok/s |
~3 tok/s |
~35 tok/s |
Data is approximate, based on results from
LocalLLM.in,
SitePoint,
and LocalAIMaster.
Actual speed depends on system load, context window size,
and background processes.
What do these numbers mean in practice?
- ✔️ 15+ tok/s: comfortable interactive chat — the response appears faster than you can read it
- ✔️ 8–15 tok/s: usable, but noticeable delays on longer responses
- ⚠️ 3–6 tok/s: acceptable for one-off tasks (debugging, analysis), frustrating for active chat
- ❌ <3 tok/s: model is too large for this system
Conclusion: For daily work on 8GB RAM, aim for 3B models — they provide 20+ tok/s on Apple Silicon and keep the system responsive. 7–8B models are for specific tasks when you're willing to close everything else and wait.
❓ Frequently Asked Questions (FAQ)
Can I run Ollama on a laptop with 8GB RAM?
Yes. Models with 1–4B parameters (Phi-3 Mini, Llama 3.2 3B, Gemma 4 E4B)
run comfortably on any system with 8GB. 7–8B models
run at the limit — you'll need to close unnecessary programs.
More details in the
Ollama installation guide.
What is the best model for 8GB RAM?
It depends on the task. For code — Qwen 2.5 Coder 3B.
For text and chat — Llama 3.2 3B or Gemma 4 E4B.
For reasoning and debugging — DeepSeek R1 8B (at the 8GB limit).
A full model comparison is in the article Top 10 Ollama Models in 2026.
Do I need a GPU for Ollama?
No, Ollama also works on CPU. However, with a GPU (discrete or Apple Silicon),
the speed is 3–10 times higher. On a CPU-only system with 8GB RAM,
stick to 3B models and smaller for comfortable operation.
What's better: a 7B model in Q2 or a 3B model in Q4?
Almost always the 3B model in Q4. Aggressive quantization (Q2) significantly
reduces the quality of responses, especially on complex tasks.
A smaller model with normal quantization will yield better results.
Can Ollama on 8GB replace ChatGPT?
For daily tasks — summarization, simple questions, code generation —
yes. For complex analysis, multimodal tasks, and working with large
contexts — cloud models are still stronger.
The optimal approach is hybrid:
Ollama for regular tasks, ChatGPT/Claude for complex ones.
More details in the article Ollama vs ChatGPT vs Claude: When Local AI is Better.
How much disk space is needed?
One 3B model in Q4 is approximately 2GB on disk. Three models for different
tasks — 6–8GB. Ollama stores models in ~/.ollama.
Downloaded models can be removed with the command ollama rm model_name.
Is it worth upgrading to 16GB?
If you plan to regularly work with local AI — definitely yes.
16GB opens access to 13–14B models, full 7B models in Q8 quality,
and comfortable work with large context windows.
The difference in capabilities between 8GB and 16GB is the largest across the entire spectrum.
✅ Conclusions
8GB of RAM is not a death sentence for local AI,
but it's a limit that requires conscious decisions. Here's the main takeaway:
- ✔️ 3–4B models — the comfort zone: Phi-3 Mini, Llama 3.2 3B, Gemma 4 E4B work quickly and stably, leaving room for your IDE and browser
- ✔️ 7–8B models — the working zone: DeepSeek R1 8B, Qwen 3 8B work at the limit but provide noticeably better quality for specific tasks
- ✔️ Q4_K_M — the only sensible quantization choice for 8GB: a smaller model with Q4 is always better than a larger one with Q2
- ✔️ Apple Silicon with 8GB — the best budget option: unified memory provides an advantage over CPU-only systems
- ✔️ 13B+ models, Q8, two models simultaneously — not worth it: tested, doesn't work
I personally use this exact approach: I keep several models
for different tasks — one for code, another for text, a separate one for debugging.
Each model has its strength, and instead of one large model
that might not fit in memory, it's better to have 2–3 specialized lightweight ones.
Switching between them via ollama run takes seconds.
If you're just starting —
install Ollama using
our guide,
download phi3:mini, and give it a try.
In five minutes, you'll have a working local AI —
no subscriptions, no internet, no data transmitted externally.
And if you need a website, blog, or web application with integrated
AI functionality —
contact us at WebsCraft,
we'll help you implement it.
📖 Sources