The official Ollama registry already has over 200 models, and their number is growing weekly.
The problem isn't finding a model, but choosing the right one:
for a specific task and specific hardware. Make the wrong choice, and you'll either wait 30 seconds for a response
or get a weak result where quality is needed.
This article features ten models worth considering in 2026.
With benchmarks, download commands, and clear recommendations:
for whom, for what purpose, and on what hardware.
📚 Table of Contents
🎯 How to Read Model Specs: Parameters, Quantization, RAM
Quick Answer:
Two parameters define everything: the number of parameters (B = billions) and the quantization level (Q4, Q8).
More parameters mean better quality, but more RAM. Lower quantization means less RAM,
with a slight loss in quality. A practical rule: the model file size on disk ≈ RAM needed for launch.
The right selection strategy: first determine how much RAM you have available, then choose the best model that fits, not the other way around.
Parameters (B — Billions)
7B, 8B, 13B, 14B, 70B — the number of billions of parameters. More means better response quality,
but slower generation and more RAM. For daily tasks, 7–14B models
cover most scenarios without noticeable quality compromises.
Quantization (Q4_K_M, Q5_K_M, Q8)
Quantization is the compression of model weights to lower precision.
CodeGPT explains:
Q4_K_M takes up half the space of Q8 but loses minimal quality.
K-quantization (K_M, K_S) are more modern methods, more accurate than the older Q4_0.
Ollama downloads Q4_K_M by default — an optimal balance for most users.
| Quantization |
Relative Size |
Quality |
When to Use |
| Q4_K_M |
~50% of Q8 |
Very Good |
Default choice, limited RAM |
| Q5_K_M |
~60% of Q8 |
Excellent |
If you have a little extra RAM |
| Q8 |
100% |
Maximum |
Sufficient RAM, accuracy needed |
RAM: A Quick Rule
Model file size ≈ minimum RAM for launch plus ~2 GB for the system and Ollama.
For example: Llama 3.3 8B in Q4_K_M weighs ~4.7 GB — requires about 7 GB of RAM.
Onyx AI clarifies:
actual consumption is 10–20% higher due to KV cache and framework overhead.
Conclusion: Model selection starts with your hardware. Know your RAM budget — know your selection space.
🎯 Models for Code: Qwen 2.5 Coder, DeepSeek Coder, Phi-4
Qwen 2.5 Coder 14B is the best local model for code in 2026.
HumanEval score 72.5% — higher than Llama 3.3 8B (68.1%) and significantly higher than Mistral 7B (43.6%).
For 8GB RAM — Qwen 2.5 Coder 7B. For math and structured tasks — Phi-4.
Qwen 2.5 Coder 32B is competitive with GPT-4o on the Aider code repair benchmark — for a local model, it's an equivalent tool, not an alternative.
1. Qwen 2.5 Coder — Best for Code
According to SitePoint,
Qwen 2.5 Coder 14B shows a HumanEval score of 72.5% — the highest result among local models
in this size class. Supports over 92 programming languages.
CodeGPT notes:
developers highlight its ability to maintain logic through long, multi-turn editing and debugging sessions.
- ✔️ RAM: 7B — 8 GB / 14B — 16 GB / 32B — 24+ GB
- ✔️ Command:
ollama pull qwen2.5-coder:14b
- ✔️ Best for: code generation, debugging, code review, refactoring
- ✔️ License: Apache 2.0
- ✔️ Context: 128K tokens
2. DeepSeek Coder V2 — Debugging Specialist
DeepSeek Coder V2 supports over 300 programming languages.
Developers describe
it as a "debugging partner": responses are often ready to use without further editing.
For tasks requiring detailed error analysis — a strong practical alternative to Qwen.
- ✔️ RAM: from 16 GB
- ✔️ Command:
ollama pull deepseek-coder-v2
- ✔️ Best for: debugging, complex code analysis, 300+ languages
3. Phi-4 — Compact Model for Structured Tasks
SitePoint tested:
Phi-4 14B scored 80.4% on the MATH benchmark — higher than Llama 3.3 8B (68.0%) and Qwen 2.5 14B (75.6%).
For logical and mathematical tasks — the best quality on 16 GB RAM.
Important limitation: a 16K context window — not suitable for long documents.
- ✔️ RAM: 16 GB
- ✔️ Command:
ollama pull phi4
- ✔️ Best for: math, logical tasks, structured code
- ⚠️ Limitation: 16K context — not for long documents
Conclusion: For code — Qwen 2.5 Coder as a base, DeepSeek Coder for heavy debugging, Phi-4 for math and algorithmic tasks.
🎯 Models for Text: Llama 3.3, Mistral, Gemma 4
Llama 3.3 8B is the best general-purpose choice for 8 GB RAM: good text quality,
128K context, the largest ecosystem. Mistral 7B — if maximum
speed or local API testing is needed. Gemma 4 E4B — a balance of size and quality
with native multimodality and thinking mode on 8 GB RAM.
Mistral 7B is the "workhorse" of local AI: small,
fast, stable. For text regeneration and API testing — the optimal choice.
4. Llama 3.3 — Standard for General Use
Blue Headline notes:
Llama 3.3 is the default recommendation for most scenarios:
RAG systems, chatbots, code assistance, fine-tuning.
The largest ecosystem among open models — more integrations,
more tutorials, more ready-made solutions.
A 128K token context window allows processing long documents in a single request.
- ✔️ RAM: 8B — 6–8 GB / 70B — 40+ GB
- ✔️ Command:
ollama pull llama3.3
- ✔️ Best for: general chat, RAG, text writing, code
- ✔️ Context: 128K tokens
- ✔️ License: Llama 3 Community License
5. Mistral 7B — Fastest Model and Ideal for API Testing
Mistral 7B takes up only 4.1 GB on disk thanks to two architectural solutions:
Grouped-Query Attention (GQA) for faster inference and Sliding Window Attention (SWA)
for processing longer sequences with less overhead.
DataCamp confirms:
both mechanisms allow Mistral 7B to achieve significantly higher speeds
than models with a comparable number of parameters.
According to Elephas comparison:
Mistral stands out for its fastest response time — an advantage particularly
noticeable with streaming requests and in latency-sensitive tasks.
Why Mistral is the Optimal Choice for API Testing
Mistral 7B via Ollama is practically an ideal platform for developing and testing APIs.
The reasons are simple:
- ✔️ Fast Start: 4.1 GB — the model downloads in minutes,
instead of waiting for 15–20 GB to transfer
- ✔️ OpenAI-Compatible API: Ollama provides an endpoint at
localhost:11434 in OpenAI format —
code written for the ChatGPT API works without changes
- ✔️ Zero Testing Costs: unlimited requests
without paying for tokens — convenient for automated testing
- ✔️ Stable Behavior: responses are predictable,
without "surprises" from cloud model updates
- ✔️ Apache 2.0 License: can be used
in commercial projects without restrictions
Example: Testing API with Mistral via Ollama
Ollama provides a REST API compatible with OpenAI. A basic request for testing
text regeneration:
# Basic request via curl
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "mistral",
"messages": [
{
"role": "system",
"content": "You are a text editor. Paraphrase the text while preserving the meaning."
},
{
"role": "user",
"content": "Paraphrase: The company achieved high results in the reporting quarter."
}
]
}'
The same request via Python — fully compatible with the openai SDK,
just change the base_url:
from openai import OpenAI
# Connect to local Ollama instead of OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # arbitrary string, Ollama doesn't check it
)
response = client.chat.completions.create(
model="mistral",
messages=[
{
"role": "system",
"content": "You are a text editor. Paraphrase while preserving the meaning."
},
{
"role": "user",
"content": "The company achieved high results in the reporting quarter."
}
]
)
print(response.choices[0].message.content)
This means: if you already have code that calls the ChatGPT API —
to switch to local Mistral, you only need to change one variable.
The rest of the code remains unchanged.
Parameters for Text Regeneration
Two parameters that most affect paraphrasing quality:
- ✔️ temperature: 0.3–0.5 — more precise paraphrasing,
close to the original. 0.7–0.9 — more creative, with variations
- ✔️ top_p: 0.9 — standard balance between diversity
and response accuracy
response = client.chat.completions.create(
model="mistral",
temperature=0.4, # low for precise paraphrasing
top_p=0.9,
messages=[...]
)
Mistral 7B Limitations
- ⚠️ 32K Context — not suitable for very long documents
(Llama 3.3 offers 128K)
- ⚠️ Inferior to Llama 3.3 on complex analytical tasks
— HumanEval 43.6% vs 68.1%
- ⚠️ No Multimodality — text only
- ✔️ RAM: 6 GB
- ✔️ Command:
ollama pull mistral
- ✔️ Best for: text regeneration, API testing,
automation, fast responses
- ✔️ License: Apache 2.0
6. Gemma 4 E4B — Google's Next-Generation Model with Multimodality
Gemma 4 was released in April 2026 and is fundamentally different from Gemma 3.
According to
the official Ollama registry, all Gemma 4 family models are natively multimodal:
they accept text and images, have a configurable thinking mode, and an extended
128K token context window for smaller variants.
The E4B variant (~4B parameters, ~3 GB in Q4) runs comfortably on 8 GB RAM,
leaving space for IDE and browser.
Compared to Gemma 3, the model has received significant improvements:
reasoning with thinking mode, native image processing in all sizes,
improved coding benchmarks, and native function calling support for agent tasks.
The license has changed to Apache 2.0 — completely free for commercial use.
- ✔️ RAM: E2B — ~2 GB / E4B — ~3 GB / 26B — 18+ GB
- ✔️ Command:
ollama pull gemma4:e4b
- ✔️ Best for: general chat, image and screenshot analysis,
thinking mode for more complex tasks, 8 GB RAM
- ✔️ Context: 128K tokens
- ✔️ License: Apache 2.0
⚠️ Important: if you previously used gemma3:9b —
E4B is a direct replacement with better quality at a smaller size.
More details about Gemma 4 in Ollama —
in the article Gemma 4: Full Overview — Sizes, License, Comparison with Gemma 3.
Conclusion: Llama 3.3 is the standard choice for text and RAG.
Mistral 7B — if speed, API testing, or limited RAM is important.
Gemma 4 E4B — when multimodality and thinking mode are needed on 8 GB RAM.
🎯 Reasoning Models for Complex Tasks: DeepSeek R1, QwQ
Reasoning models are a separate class of LLMs that think step-by-step before
responding. DeepSeek R1 and QwQ are significantly stronger than standard models
in mathematics, logical problems, and complex debugging. They are slower on
simple requests – not worth using for daily chat.
For daily use – Llama 3.3. For tasks where reasoning accuracy is important – DeepSeek R1.
If you need an even more powerful reasoning model via API – DeepSeek V4 Pro.
Hugging Face confirms:
DeepSeek R1 achieves results comparable to OpenAI o1 on math, code, and reasoning tasks – with fully open source code and an MIT license.
What is a reasoning model – and how does it differ from a regular one
A regular language model – Llama, Mistral, Gemma – receives a query and immediately
generates a response. It doesn't "check" itself in the process – it simply predicts
the next token based on the previous ones.
A reasoning model works differently. Chris McCormick explains:
the core idea is "thinking before responding" (Chain-of-Thought).
The model first generates a chain of reasoning between the tags
<think>...</think>, checks itself,
can go back and correct errors – and only then outputs the
final answer.
Sean Goedecke describes
the key difference in training: standard models are trained
on examples of correct answers. DeepSeek R1 is trained through
reinforcement learning – the model generates reasoning chains itself,
and only receives a reward if the final answer is correct.
This means the model can find reasoning methods that were not present in the training data.
How DeepSeek R1's response looks in practice
You send a query – and see two blocks in the response:
<think>
Need to find all prime numbers up to 50.
Starting with 2 – divisible only by 1 and itself, prime.
3 – prime. 4 – divisible by 2, not prime...
...checking each number...
So the list is: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47
</think>
Prime numbers up to 50: 2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47.
The <think> block is the reasoning process. It's not an error
or service text – it's what makes the model more accurate.
Trend Micro notes:
when used in production applications, the <think> tags
should be filtered in post-processing – showing only the final answer to the end-user.
7. DeepSeek R1 – the best reasoning model for local deployment
IBM describes
DeepSeek R1 as a model that combines chain-of-thought reasoning with
reinforcement learning – where an autonomous agent learns to solve
tasks through trial and error, without human instructions.
The result: on mathematical and coding benchmarks – OpenAI o1 level,
but with open source code and an MIT license.
Official DeepSeek recommendations
for configuration for the best result:
- ✔️ Temperature: 0.5–0.7 (recommended 0.6) –
too low leads to repetitions, too high – irrelevant answers
- ✔️ System prompt: do not add – all instructions
should be in the user prompt
- ✔️ For mathematics: add to the prompt
"Please reason step by step, and put your final answer within \boxed{}"
- ✔️ Testing: run multiple times and average
the result – the model has some variability
Example: how to correctly ask DeepSeek R1
ollama run deepseek-r1:8b
# For mathematics – with directive
"Find all prime numbers from 1 to 100. Please reason step by step."
# For debugging – with full error context
"Here is a Python function and the error traceback. Find the cause and fix it:
[code]
[traceback]"
# For logical analysis
"Analyze the advantages and disadvantages of this architectural solution
step by step, considering scalability and maintainability:
[architecture description]"
When to use DeepSeek R1, and when not to
| Task |
DeepSeek R1 |
Llama 3.3 |
| Mathematical problems |
✔️ Better |
Acceptable |
| Complex debugging |
✔️ Better |
Acceptable |
| Logical analysis |
✔️ Better |
Acceptable |
| Daily chat |
⚠️ Slow |
✔️ Better |
| Text regeneration |
⚠️ Overkill |
✔️ Better |
| Fast responses |
⚠️ Slow |
✔️ Better |
| Production API without think-tag filtering |
⚠️ Requires post-processing |
✔️ Ready immediately |
- ✔️ RAM: 8B – 8 GB / 14B – 16 GB / 70B – 40+ GB
- ✔️ Command:
ollama pull deepseek-r1:8b
- ✔️ License: MIT – commercial use allowed
- ✔️ Context: 128K tokens
- ⚠️ Limitations: slow on simple tasks,
<think> tags require filtering in production
If R1 8B is not enough: for tasks requiring frontier-level power –
DeepSeek V4 Pro (1.6T parameters, MIT license) is available via API.
It does not run locally on consumer hardware, but it is significantly cheaper
than GPT-5 and Claude Opus at comparable reasoning quality.
More details –
in the article DeepSeek V4 Pro in 2026: a complete breakdown.
8. QwQ – reasoning from Alibaba
QwQ is a reasoning variant of Alibaba's Qwen series, built on the same
chain-of-thought idea as DeepSeek R1. Comparable results
on mathematical benchmarks.
Till Freitag notes:
the Qwen3 series in general is one of the strongest open-source model families
in 2026.
Practical advantage of QwQ: if you are already using Qwen 2.5 Coder for code
and Llama 3.3 for text – QwQ allows you to add reasoning to the same
ecosystem without additional configuration. Behavior with <think> tags
is analogous to DeepSeek R1.
- ✔️ RAM: from 16 GB
- ✔️ Command:
ollama pull qwq
- ✔️ Best for: mathematics, structured analysis,
if already in the Qwen ecosystem
- ⚠️ Limitations: smaller community and fewer tutorials
compared to DeepSeek R1
How to filter <think> tags in Python
If you are using DeepSeek R1 or QwQ via API and want to show
users only the final answer:
import re
def extract_answer(response: str) -> str:
"""Removes the <think>...</think> block from the model's response."""
clean = re.sub(r'<think>.*?</think>', '', response, flags=re.DOTALL)
return clean.strip()
raw_response = """
<think>
Need to find the error in the code...
I see the variable is not initialized...
</think>
Error on line 15: variable `counter` used before initialization.
Add `counter = 0` before the loop.
"""
print(extract_answer(raw_response))
# Will output: Error on line 15: variable `counter` used before initialization.
# Add `counter = 0` before the loop.
Conclusion: Reasoning models are a separate tool
for specific tasks. DeepSeek R1 is justified where reasoning accuracy is needed:
mathematics, complex debugging, structured analysis.
For daily use – Llama 3.3 or Mistral remain the better choice.
For frontier-level tasks via API – DeepSeek V4 Pro.
🎯 Models for RAG and Document Work
RAG requires two models: one generates answers, the second creates embeddings for search.
For embeddings in Ollama – nomic-embed-text or mxbai-embed-large.
For document generation – Llama 3.3 or Qwen 2.5 with a 128K context.
RAG is not a single model, but a pipeline. The correct choice of embedding model is as important as the choice of generative model.
What is RAG and why two models are needed
Retrieval-Augmented Generation (RAG) is an approach where the model answers not from memory,
but from your documents. Pipeline: document → chunking → embeddings →
vector database → search for relevant chunks → answer generation.
Embeddings are numerical vectors of text's semantic meaning. A separate, light, and fast model is needed to create them.
Embedding Models for Ollama
- ✔️ nomic-embed-text – the most popular embedding model in Ollama.
High quality, supports large context, 2 GB RAM.
ollama pull nomic-embed-text
- ✔️ mxbai-embed-large – strong results on MTEB benchmark.
ollama pull mxbai-embed-large
Generative Models for RAG
- ✔️ Llama 3.3 8B – 128K context, holds long document context well
- ✔️ Qwen 2.5 14B – 128K context, better quality on analytical tasks with documents
- ⚠️ Mistral 7B – faster, but 32K context is limiting for large documents
More details on building a RAG pipeline –
in the article RAG with Ollama: How to Teach AI to Answer Based on Your Documents.
Section Conclusion: For RAG – nomic-embed-text for embeddings + Llama 3.3 or Qwen 2.5 for generation. 128K context is a mandatory requirement for working with long documents.
🎯 Models for Low-End Hardware: What to Run on 8 GB RAM
Short answer:
On 8 GB RAM, it's realistic to run quality models for most tasks.
Llama 3.3 8B is the best all-around choice. Qwen 2.5 Coder 7B – for code.
Mistral 7B – if speed is needed. Phi-4 Mini and Gemma 4 E2B – if RAM is even less.
On 8 GB RAM in 2026, there's no longer a reason to sacrifice quality – the right model solves most real-world problems.
Recommendations for Tasks on 8 GB RAM
- ✔️ General chat and text: Llama 3.3 8B –
ollama pull llama3.3:8b
- ✔️ Code and programming: Qwen 2.5 Coder 7B –
ollama pull qwen2.5-coder:7b
- ✔️ Fast responses: Mistral 7B –
ollama pull mistral
- ✔️ Mathematics and logic: Phi-4 Mini –
ollama pull phi4-mini
- ✔️ Multimodality and text on 8 GB: Gemma 4 E4B –
ollama pull gemma4:e4b
- ✔️ Less than 4 GB RAM: Gemma 4 E2B –
ollama pull gemma4:e2b
- ✔️ Reasoning on 8 GB: DeepSeek R1 8B –
ollama pull deepseek-r1:8b
What Not to Run on 8 GB RAM
- ⚠️ Models 13B+ in Q4 – will be slow or won't run
- ⚠️ Qwen 2.5 Coder 14B – requires 16 GB
- ⚠️ Phi-4 14B – requires 16 GB
- ⚠️ Llama 3.3 70B – requires 40+ GB
More details – in the article Ollama on Low-End Hardware: A Complete Guide for 8 GB RAM.
Conclusion: 8 GB RAM is a sufficient minimum for quality work with Ollama. Llama 3.3 8B and Qwen 2.5 Coder 7B cover most practical tasks.
📊 Comparison Table: Quality / Speed / RAM / Task
A consolidated table of all models with benchmarks and recommendations.
Sources: SitePoint,
Onyx AI Leaderboard,
CodeGPT.
| Model |
RAM |
HumanEval |
Speed |
Context |
Best for |
Command |
| Llama 3.3 8B |
8 GB |
68.1% |
High |
128K |
General chat, RAG, text |
ollama pull llama3.3:8b |
| Qwen 2.5 Coder 14B |
16 GB |
72.5% |
Medium |
128K |
Code, debugging, review |
ollama pull qwen2.5-coder:14b |
| Qwen 2.5 Coder 7B |
8 GB |
~65% |
High |
128K |
Code on 8 GB RAM |
ollama pull qwen2.5-coder:7b |
| Mistral 7B |
6 GB |
43.6% |
Highest |
32K |
Fast responses, automation |
ollama pull mistral |
| Phi-4 14B |
16 GB |
— |
Medium |
16K |
Mathematics, logic, structured code |
ollama pull phi4 |
| DeepSeek R1 8B |
8 GB |
— |
Low |
128K |
Reasoning, complex analysis |
ollama pull deepseek-r1:8b |
| Gemma 4 E4B |
~3 GB |
— |
High |
128K |
Chat, image analysis, thinking mode |
ollama pull gemma4:e4b |
| nomic-embed-text |
2 GB |
— |
Very high |
8K |
Embeddings for RAG |
ollama pull nomic-embed-text |
| Llama 3.2 Vision |
8 GB |
— |
Medium |
128K |
Image analysis locally |
ollama pull llama3.2-vision |
| QwQ |
16 GB |
— |
Low |
128K |
Mathematics, reasoning |
ollama pull qwq |
🎯 How to Test a Model in 5 Minutes – Checklist
When I was choosing a model for text regeneration and API testing,
I ran Mistral 7B and Llama 3.3 8B in parallel with the same prompt.
Mistral responded faster – and for my task, this turned out to be more important than
the difference in HumanEval score. Three real prompts from your workflow
will provide more information than any synthetic benchmark.
If you are just starting with Ollama and haven't yet grasped the basic concepts –
we recommend reading the overview before testing models:
What is Ollama and Why Developers Are Massively Switching to Local AI in 2026 –
it explains how the platform works, what tasks it solves,
and who it's suitable for.
The best way to choose a model is to download two candidates and give them the same prompt. The result will be obvious in 10 minutes.
Step 1. Download and Run
ollama pull llama3.3:8b
ollama run llama3.3:8b
Step 2. Check Quality on Your Task
- ✔️ For code: "Write a Python function that [your task]" – check if the code runs without errors
- ✔️ For text: "Rephrase this paragraph in a business style" – compare the result with the original
- ✔️ For analysis: "Summarize this document in 5 points" – paste real work text
- ✔️ For reasoning: "Solve the problem step by step: [mathematical or logical problem]"
Step 3. Check Speed
After the response, Ollama shows tokens/sec. For comfortable work –
at least 10–15 tokens/sec. If less – consider a smaller model or Q4_K_M instead of Q8.
Step 4. Compare Two Candidates on the Same Prompt
# Terminal 1
ollama run llama3.3:8b "Write a Python function for parsing JSON"
# Terminal 2
ollama run qwen2.5-coder:7b "Write a Python function for parsing JSON"
Step 5. Choose and Remove Unnecessary
The model that gives a better result on your task is your primary one.
The rest can be removed to free up disk space:
ollama rm model-name
Conclusion: Testing takes 10–15 minutes and provides a more accurate answer than any review. Start with Llama 3.3 8B as a baseline for comparison.
❓ Frequently Asked Questions (FAQ)
Which model should I download first?
Start with Llama 3.3 8B — if you have 8 GB of RAM. The most balanced option:
good quality, large context, active community support.
Command: ollama pull llama3.3:8b
Can I run multiple models simultaneously?
Technically yes, but each model consumes RAM. Two 8B models simultaneously require 12–16 GB.
Ollama automatically unloads inactive models after 5 minutes — this helps save memory.
Why does Ollama download Q4_K_M by default?
Q4_K_M is the optimal balance between size and quality. For most tasks, the difference between
Q4_K_M and Q8 is insignificant, but Q4_K_M is half the size. If you need maximum quality:
ollama pull llama3.3:8b-instruct-q8_0
How to check which models are installed?
ollama list — shows all downloaded models, their size, and download date.
ollama rm model-name — removes a model and frees up disk space.
Where can I find all available models?
The full catalog is at ollama.com/search.
It can be filtered by task, size, and programming language.
What are <think> tags in DeepSeek R1 responses?
This is a chain of thought — the model's step-by-step "thinking" process before the final answer.
This is expected behavior for reasoning models, not an error. If you're using the API,
you can filter out the <think>...</think> tags in post-processing.
Which Gemma 4 version should I choose for 8 GB RAM?
For 8 GB RAM — Gemma 4 E4B (~3 GB in Q4). Supports text and images,
has a thinking mode and 128K context. Command: ollama pull gemma4:e4b.
If you have less than 4 GB RAM — E2B (~2 GB): ollama pull gemma4:e2b.
The large 26B MoE variant requires 18+ GB and has its own specifics —
more details in the article
Why Gemma 4 26B is slow and when it wins.
How to enable and disable thinking mode in Gemma 4?
Thinking mode in Gemma 4 is controlled via the system prompt: add the token
<|think|> at the beginning of the system prompt to enable it,
or remove it to disable. For simple tasks, thinking mode slows down the response
without improving quality — it's best to enable it only for complex reasoning.
Details on configuration are in the article
Reasoning mode in Gemma 4: how to enable, when needed, and cost.
✅ Conclusions
The choice of an Ollama model depends on three things: hardware, task, and speed requirements. Concise recommendations:
- ✔️ General start, 8 GB RAM → Llama 3.3 8B
- ✔️ Code, 16 GB RAM → Qwen 2.5 Coder 14B
- ✔️ Code, 8 GB RAM → Qwen 2.5 Coder 7B
- ✔️ Maximum speed → Mistral 7B
- ✔️ Math and logic → Phi-4 or DeepSeek R1
- ✔️ Complex analysis → DeepSeek R1 or QwQ
- ✔️ RAG and documents → Llama 3.3 + nomic-embed-text
- ✔️ Images and multimodality → Gemma 4 E4B or Llama 3.2 Vision
- ✔️ Less than 4 GB RAM → Gemma 4 E2B or Phi-4 Mini
The best way to choose is to download two candidates and test them on real tasks within 15 minutes.
📎 Sources
- Ollama Library — official model registry
- AI Tool Discovery: Best Local LLM Models 2026 — HumanEval and MATH benchmarks
- Onyx AI: Self-Hosted LLM Leaderboard 2026 — MMLU-Pro, GPQA Diamond, SWE-bench
- CodeGPT: Choosing the Best Ollama Model — quantization and models for code
- Blue Headline: Llama vs Mistral vs DeepSeek vs Qwen 2026
- O-Mega AI: Top 10 Open Source LLMs 2026 — Gemma 3, Mistral Small, Phi-3
- Till Freitag: Open-Source LLMs Compared 2026 — 20+ models, hardware requirements
- Sebastian Raschka: The Big LLM Architecture Comparison — Qwen3, DeepSeek, Mistral
- WebsCraft — DeepSeek V4 Pro in 2026: Full Review — architecture, benchmarks, and when it's profitable to switch
- WebsCraft — Why Gemma 4 26B is slow and when it wins
- WebsCraft — Reasoning mode in Gemma 4: how to enable, when needed, and cost