Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026

Updated:
Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026
In short: Reasoning mode is Gemma 4's built-in ability to "think" before responding. It's enabled by default. On a 16 GB M1, it takes 20 to 73 seconds depending on the task. It cannot be fully disabled via Ollama, but its duration can be reduced using /no_think. Read when it's worth doing and when it's not.

🧠 What is reasoning mode and how did it appear in Gemma 4

Gemma 4 is the first model in Google's lineup that can think before answering. This isn't marketing – it's a separate technical mechanism that truly changes the quality of responses.

Reasoning mode (or thinking mode) is the model's ability to generate an internal monologue of reasoning before the final answer. The model builds a plan, checks logic, corrects itself – and only then provides the result to the user. What you see in the final answer is already the result after internal "error correction."

Where it came from

The idea of "thinking before answering" is not new – but it only became widespread in open models around 2025-2026. The first to popularize this approach were DeepSeek-R1 (a Chinese open model) and OpenAI o1. Both showed that a model that spends time on internal reasoning solves complex tasks significantly better than a model that answers immediately.

Google followed the same path. Gemma 3 did not have reasoning – the model answered immediately, saying whatever it thought. Gemma 4 received a built-in thinking mode as one of its key new features. This explains the most dramatic jump in benchmarks: AIME (competitive mathematics) from 20.8% to 89.2%, Codeforces ELO from 110 to 2150. Such results are impossible without thinking – mathematical problems require step-by-step reasoning, not an instant answer.

How it works technically

Technically, reasoning in Gemma 4 is implemented through a special token <|think|> in the system prompt. When the model sees this token, it activates the reasoning mode and generates an internal monologue of up to 4000+ tokens before the final answer.

These 4000 tokens are not just "extra text." It's a separate pass through the task: the model formulates the problem in its own words, breaks it down into sub-tasks, builds a plan, checks if the plan is logical, and only then starts generating the final answer. If it detects contradictions during thinking, it corrects itself before you even see a word of the answer.

Important detail: via Ollama, the <|think|> token is inserted automatically – you don't need to configure anything. This differs from some other models where thinking needs to be explicitly activated via a system prompt or API parameters.

How it differs from regular generation

Regular text generation in LLMs is a sequential prediction of the next token. The model doesn't "plan" the answer – it simply continues the text token by token based on the context. This works well for simple requests but poorly for tasks requiring logic, mathematics, or a multi-step plan.

Reasoning mode changes this: before generating the final answer, the model gets a "thinking space" where it can freely reason, make mistakes, and correct them. This is a fundamentally different approach – and that's why models with reasoning show such better results on complex tasks.

A simple analogy: a regular model is like a student who immediately writes the answer on an exam. A model with reasoning is like a student who first makes a draft, checks the logic, and only then writes the clean copy.

What this means for you in practice

If you've run Gemma 4 for the first time and were surprised that it "thinks" for a long time before answering – now you know why. It's not a bug or hardware slowdown. It's a deliberate behavior that improves the quality of the response.

On a MacBook Pro M1 16 GB, thinking takes 15 to 73 seconds depending on the task's complexity. Detailed figures are in the section on thinking cost below. In the meantime, let's look at what exactly happens inside the Thinking block.

🔍 What the Thinking block looks like – what's actually happening there

The Thinking block is not a hidden technical log. It's the model's actual reasoning process that can be read and learned from.

When you send a request to Gemma 4 via the Ollama terminal or UI, a block appears before the answer:

Thinking...
Thinking Process:

1. Analyze the user's input...
2. Identify the core question...
3. Recall personal identity/nature...
...done thinking.

What happens inside this block depends on the task. I've observed three patterns:

For simple questions (e.g., "how many parameters do you have?") – the model builds a short plan of 4-7 steps: determine the language of the request, understand the question, recall relevant facts, formulate the answer. Takes 20-37 seconds.

For complex code (Spring Boot endpoint) – the model analyzes what is needed, lists the components to include (Entity, Repository, Service, Controller), plans the structure, performs self-correction if something is forgotten. Takes 60-73 seconds.

For text (explaining RAG for business) – the model determines the audience, formulates analogies, plans the paragraph structure, checks if prompt constraints are met. Takes 37 seconds – and thanks to this, it independently added a comparison table that I didn't ask for, but which genuinely improved the answer.

Key point: the Thinking block is only visible to you – it's not in the final answer. It's the model's internal process.

Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026

⚙️ How to control reasoning via Ollama terminal and API

It's impossible to completely disable thinking via Ollama — but it can be significantly reduced. And it's important to know this before getting disappointed in the model's speed.

According to Ollama's official documentation, thinking is controlled via a token in the system prompt:

# Thinking is enabled (by default)
# The <|think|> token is inserted automatically

# To disable — remove the token from the system prompt
# But this is not so simple via the standard Ollama CLI

Method 1 — /no_think at the beginning of the prompt:

The easiest way to reduce thinking directly in the request:

ollama run gemma4
>>> /no_think Explain Docker in simple terms

According to my tests, this reduces thinking from ~37 seconds to ~20 seconds. It doesn't disable it completely — the model still thinks, but for a shorter time.

Method 2 — create a separate model without thinking via Modelfile:

# Create Modelfile
echo 'FROM gemma4
SYSTEM ""' > Modelfile

# Build a new model
ollama create gemma4-fast -f Modelfile

# Run
ollama run gemma4-fast

Theoretically, this should remove the system prompt with the <|think|> token. In practice — thinking still appears, but in a shortened form. This is a known behavior of Gemma 4 via Ollama, discussed in GitHub issues.

Method 3 — via Ollama API with the think: false parameter:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "think": false,
  "messages": [
    {
      "role": "user",
      "content": "Explain what Docker is"
    }
  ]
}'

This is the most reliable way for programmatic thinking control. The think: false parameter is supported in Ollama 0.20+.

🖥️ How to control reasoning in Open WebUI

Controlling thinking in a graphical interface is easier — but the capabilities depend on your UI version.

If you are using Open WebUI or another Ollama-compatible interface — the thinking block is displayed as an expandable section before the response. It is usually collapsed and marked as "Thought for X seconds".

To reduce thinking in the UI — there are two approaches:

1. Via the System Prompt field (if available in model settings): leave it empty or add your own system prompt without the <|think|> token. But as my test showed — this doesn't guarantee complete disabling.

2. Via /no_think at the beginning of the message: works in the UI just like in the terminal — just add it at the beginning of the request. Thinking will be reduced but won't disappear completely.

For most UI users — the most practical solution is to simply accept that thinking exists and evaluate the model by the quality of the final answer, not by its speed.

🧪My test: with and without thinking — quality comparison

I tested on a MacBook Pro M1 16 GB. The same prompt — two modes. Here's what I got.

Prompt for both tests:

Explain RAG (Retrieval-Augmented Generation) in simple terms for business. No technical jargon. 3-4 paragraphs.

Test 1 — normal run (thinking enabled):

Thinking took ~37 seconds. The model planned the structure, identified the audience, and chose analogies. Result: a structured answer with subheadings, a strong analogy ("a student with all the books in the world vs. an assistant with your company's reference book"), and a comparison table "LLM without RAG vs. with RAG" — which I didn't ask for, but which genuinely improved the answer. Total time: ~1.5 minutes.

Test 2 — with /no_think (thinking reduced):

Thinking took 20.3 seconds. The model responded faster. Result: 4 paragraphs, there's an analogy ("an intern with an internal knowledge base"), clear and concise. But — no table, no subheadings, less structured. Total time: ~50 seconds.

Parameter With thinking (normal) With /no_think
Thinking time ~37 sec ~20 sec
Total time ~1.5 min ~50 sec
Response structure Subheadings + table 4 paragraphs without structure
Analogies ✅ Strong ✅ Present, but simpler
Adherence to instructions Violated (added more) ✅ Exactly 4 paragraphs
Overall quality ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐

Interesting nuance: with full thinking, the model *violated* the "3-4 paragraphs" constraint — it added a table and subheadings that I didn't ask for. But it did it correctly — the answer became better. With /no_think — it strictly followed the instructions, but the answer is simpler.

⏱️ How much time does thinking mode take on M1 16 GB

Thinking is not free. Here are the numbers from my tests on MacBook Pro M1 16 GB.
Task Thinking time Generation time Total
Simple question (model parameters) ~15 sec ~20 sec ~35 sec
Text (RAG for business) ~37 sec ~1 min ~1.5 min
Text with /no_think ~20 sec ~30 sec ~50 sec
Complex code (Spring Boot) ~73 sec ~3 min ~4 min

What affects the duration of thinking:

  • Task complexity — the more steps need to be planned, the longer it thinks
  • Number of components in the response — code with four classes takes longer to think about than one paragraph of text
  • Presence of /no_think — reduces by approximately half
  • Current load on M1 — if many browser tabs are open, thinking is slower

For comparison: Qwen3:8b thinks for 18-32 seconds and generates text in 20-35 seconds for the same tasks. This means the full cycle for Qwen3 is 38-67 seconds versus 50-240 seconds for Gemma 4. The difference is significant for daily work.

✅ When thinking is needed and when it only slows things down

Thinking is a tool, not an obligation. Turn it on when quality is needed, turn it off (as much as possible) when speed is needed.

Thinking is definitely worth the time:

  • Complex mathematics or logical problems — without thinking, quality drops dramatically
  • Generating structured text — articles, documentation, explanations for business
  • Agent scenarios with multiple steps — planning before execution is critical
  • Code with non-trivial architecture — the model itself catches errors in the plan before writing them
  • Any task where quality is more important than time

Thinking can be shortened using /no_think:

  • Simple questions with unambiguous answers
  • Text translation
  • Short answers where structure is not needed
  • Chat where reactivity is important
  • Template code that you already know

My advice from my experience: I leave thinking enabled by default and only add /no_think when I explicitly want a quick answer to a simple question. For complex tasks, thinking justifies itself even on M1 where it is slower.

If you need a model where thinking is faster or where it can be reliably turned off, consider Qwen3:8b. Detailed comparison: Gemma 4 on M1 16 GB — real tests: code, text, speed.

📚 Read also

Vadym Kharovyk — developer, founder of WebsCraft and AskYourDocs. I test local AI models on my own Mac M1 and write about what really works.

Останні статті

Читайте більше цікавих матеріалів

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

TL;DR Як ефективно керувати контекстом у довгоживучих AI-агентах: — Sliding Window + Pinning — Автоматична summarization з розумними тригерами — Compression та semantic memory З конкретними цифрами, кодом і архітектурними рішеннями, які значно підвищили стабільність агента. Ця стаття —...

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

15 травня 2026 року Google тихо оновив одне речення у своїй Spam Policy. Але це речення змінює правила гри для всіх хто займається контентом і SEO. Без гучних анонсів, без великої прес-конференції — просто нове формулювання на сторінці документації. Search Engine Roundtable...

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Агент отримав запит — обробив — відповів. Наступний запит — і він не пам'ятає нічого з попереднього. Не тому що щось зламалось. А тому що так влаштована LLM за замовчуванням: кожен виклик — чистий аркуш. Якщо ви будуєте агента і не думали про пам'ять — ви будуєте амнезика з доступом до...

Grok Build від xAI: детальний технічний огляд

Grok Build від xAI: детальний технічний огляд

Grok Build — новий agentic CLI від xAI (early beta, 14 травня 2026). Головні фішки: Plan Mode з обов’язковим затвердженням плану, паралельні субагенти (до 8), контекстне вікно ~1–2M токенів та сучасний TUI на Rust. Працює на Grok 4.3, підтримує ACP, git worktree та MCP....

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Оновлено: 15 травня 2026 14 травня 2026 вийшла Ollama 0.24 — і це не черговий патч з виправленням багів. Цей реліз додає офіційну підтримку Codex App від OpenAI: тепер десктопний AI coding agent можна запустити на будь-якій локальній або хмарній моделі через Ollama....

Tool RAG: що робити коли у агента забагато інструментів

Tool RAG: що робити коли у агента забагато інструментів

У вас 5 tools — все чудово. У вас 15 tools — починаються проблеми. У вас 50 tools — агент деградує. Але є рішення яке вирішує проблему масштабу елегантно — і ви вже знаєте як воно працює, бо використовуєте його для документів. Ця стаття — частина серії про AI агентів на Spring Boot. Якщо...