Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026

Updated:
Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026
In short: Reasoning mode is Gemma 4's built-in ability to "think" before responding. It's enabled by default. On a 16 GB M1, it takes 20 to 73 seconds depending on the task. It cannot be fully disabled via Ollama, but its duration can be reduced using /no_think. Read when it's worth doing and when it's not.

🧠 What is reasoning mode and how did it appear in Gemma 4

Gemma 4 is the first model in Google's lineup that can think before answering. This isn't marketing – it's a separate technical mechanism that truly changes the quality of responses.

Reasoning mode (or thinking mode) is the model's ability to generate an internal monologue of reasoning before the final answer. The model builds a plan, checks logic, corrects itself – and only then provides the result to the user. What you see in the final answer is already the result after internal "error correction."

Where it came from

The idea of "thinking before answering" is not new – but it only became widespread in open models around 2025-2026. The first to popularize this approach were DeepSeek-R1 (a Chinese open model) and OpenAI o1. Both showed that a model that spends time on internal reasoning solves complex tasks significantly better than a model that answers immediately.

Google followed the same path. Gemma 3 did not have reasoning – the model answered immediately, saying whatever it thought. Gemma 4 received a built-in thinking mode as one of its key new features. This explains the most dramatic jump in benchmarks: AIME (competitive mathematics) from 20.8% to 89.2%, Codeforces ELO from 110 to 2150. Such results are impossible without thinking – mathematical problems require step-by-step reasoning, not an instant answer.

How it works technically

Technically, reasoning in Gemma 4 is implemented through a special token <|think|> in the system prompt. When the model sees this token, it activates the reasoning mode and generates an internal monologue of up to 4000+ tokens before the final answer.

These 4000 tokens are not just "extra text." It's a separate pass through the task: the model formulates the problem in its own words, breaks it down into sub-tasks, builds a plan, checks if the plan is logical, and only then starts generating the final answer. If it detects contradictions during thinking, it corrects itself before you even see a word of the answer.

Important detail: via Ollama, the <|think|> token is inserted automatically – you don't need to configure anything. This differs from some other models where thinking needs to be explicitly activated via a system prompt or API parameters.

How it differs from regular generation

Regular text generation in LLMs is a sequential prediction of the next token. The model doesn't "plan" the answer – it simply continues the text token by token based on the context. This works well for simple requests but poorly for tasks requiring logic, mathematics, or a multi-step plan.

Reasoning mode changes this: before generating the final answer, the model gets a "thinking space" where it can freely reason, make mistakes, and correct them. This is a fundamentally different approach – and that's why models with reasoning show such better results on complex tasks.

A simple analogy: a regular model is like a student who immediately writes the answer on an exam. A model with reasoning is like a student who first makes a draft, checks the logic, and only then writes the clean copy.

What this means for you in practice

If you've run Gemma 4 for the first time and were surprised that it "thinks" for a long time before answering – now you know why. It's not a bug or hardware slowdown. It's a deliberate behavior that improves the quality of the response.

On a MacBook Pro M1 16 GB, thinking takes 15 to 73 seconds depending on the task's complexity. Detailed figures are in the section on thinking cost below. In the meantime, let's look at what exactly happens inside the Thinking block.

🔍 What the Thinking block looks like – what's actually happening there

The Thinking block is not a hidden technical log. It's the model's actual reasoning process that can be read and learned from.

When you send a request to Gemma 4 via the Ollama terminal or UI, a block appears before the answer:

Thinking...
Thinking Process:

1. Analyze the user's input...
2. Identify the core question...
3. Recall personal identity/nature...
...done thinking.

What happens inside this block depends on the task. I've observed three patterns:

For simple questions (e.g., "how many parameters do you have?") – the model builds a short plan of 4-7 steps: determine the language of the request, understand the question, recall relevant facts, formulate the answer. Takes 20-37 seconds.

For complex code (Spring Boot endpoint) – the model analyzes what is needed, lists the components to include (Entity, Repository, Service, Controller), plans the structure, performs self-correction if something is forgotten. Takes 60-73 seconds.

For text (explaining RAG for business) – the model determines the audience, formulates analogies, plans the paragraph structure, checks if prompt constraints are met. Takes 37 seconds – and thanks to this, it independently added a comparison table that I didn't ask for, but which genuinely improved the answer.

Key point: the Thinking block is only visible to you – it's not in the final answer. It's the model's internal process.

Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026

⚙️ How to control reasoning via Ollama terminal and API

It's impossible to completely disable thinking via Ollama — but it can be significantly reduced. And it's important to know this before getting disappointed in the model's speed.

According to Ollama's official documentation, thinking is controlled via a token in the system prompt:

# Thinking is enabled (by default)
# The <|think|> token is inserted automatically

# To disable — remove the token from the system prompt
# But this is not so simple via the standard Ollama CLI

Method 1 — /no_think at the beginning of the prompt:

The easiest way to reduce thinking directly in the request:

ollama run gemma4
>>> /no_think Explain Docker in simple terms

According to my tests, this reduces thinking from ~37 seconds to ~20 seconds. It doesn't disable it completely — the model still thinks, but for a shorter time.

Method 2 — create a separate model without thinking via Modelfile:

# Create Modelfile
echo 'FROM gemma4
SYSTEM ""' > Modelfile

# Build a new model
ollama create gemma4-fast -f Modelfile

# Run
ollama run gemma4-fast

Theoretically, this should remove the system prompt with the <|think|> token. In practice — thinking still appears, but in a shortened form. This is a known behavior of Gemma 4 via Ollama, discussed in GitHub issues.

Method 3 — via Ollama API with the think: false parameter:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "think": false,
  "messages": [
    {
      "role": "user",
      "content": "Explain what Docker is"
    }
  ]
}'

This is the most reliable way for programmatic thinking control. The think: false parameter is supported in Ollama 0.20+.

🖥️ How to control reasoning in Open WebUI

Controlling thinking in a graphical interface is easier — but the capabilities depend on your UI version.

If you are using Open WebUI or another Ollama-compatible interface — the thinking block is displayed as an expandable section before the response. It is usually collapsed and marked as "Thought for X seconds".

To reduce thinking in the UI — there are two approaches:

1. Via the System Prompt field (if available in model settings): leave it empty or add your own system prompt without the <|think|> token. But as my test showed — this doesn't guarantee complete disabling.

2. Via /no_think at the beginning of the message: works in the UI just like in the terminal — just add it at the beginning of the request. Thinking will be reduced but won't disappear completely.

For most UI users — the most practical solution is to simply accept that thinking exists and evaluate the model by the quality of the final answer, not by its speed.

🧪My test: with and without thinking — quality comparison

I tested on a MacBook Pro M1 16 GB. The same prompt — two modes. Here's what I got.

Prompt for both tests:

Explain RAG (Retrieval-Augmented Generation) in simple terms for business. No technical jargon. 3-4 paragraphs.

Test 1 — normal run (thinking enabled):

Thinking took ~37 seconds. The model planned the structure, identified the audience, and chose analogies. Result: a structured answer with subheadings, a strong analogy ("a student with all the books in the world vs. an assistant with your company's reference book"), and a comparison table "LLM without RAG vs. with RAG" — which I didn't ask for, but which genuinely improved the answer. Total time: ~1.5 minutes.

Test 2 — with /no_think (thinking reduced):

Thinking took 20.3 seconds. The model responded faster. Result: 4 paragraphs, there's an analogy ("an intern with an internal knowledge base"), clear and concise. But — no table, no subheadings, less structured. Total time: ~50 seconds.

Parameter With thinking (normal) With /no_think
Thinking time ~37 sec ~20 sec
Total time ~1.5 min ~50 sec
Response structure Subheadings + table 4 paragraphs without structure
Analogies ✅ Strong ✅ Present, but simpler
Adherence to instructions Violated (added more) ✅ Exactly 4 paragraphs
Overall quality ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐

Interesting nuance: with full thinking, the model *violated* the "3-4 paragraphs" constraint — it added a table and subheadings that I didn't ask for. But it did it correctly — the answer became better. With /no_think — it strictly followed the instructions, but the answer is simpler.

⏱️ How much time does thinking mode take on M1 16 GB

Thinking is not free. Here are the numbers from my tests on MacBook Pro M1 16 GB.
Task Thinking time Generation time Total
Simple question (model parameters) ~15 sec ~20 sec ~35 sec
Text (RAG for business) ~37 sec ~1 min ~1.5 min
Text with /no_think ~20 sec ~30 sec ~50 sec
Complex code (Spring Boot) ~73 sec ~3 min ~4 min

What affects the duration of thinking:

  • Task complexity — the more steps need to be planned, the longer it thinks
  • Number of components in the response — code with four classes takes longer to think about than one paragraph of text
  • Presence of /no_think — reduces by approximately half
  • Current load on M1 — if many browser tabs are open, thinking is slower

For comparison: Qwen3:8b thinks for 18-32 seconds and generates text in 20-35 seconds for the same tasks. This means the full cycle for Qwen3 is 38-67 seconds versus 50-240 seconds for Gemma 4. The difference is significant for daily work.

✅ When thinking is needed and when it only slows things down

Thinking is a tool, not an obligation. Turn it on when quality is needed, turn it off (as much as possible) when speed is needed.

Thinking is definitely worth the time:

  • Complex mathematics or logical problems — without thinking, quality drops dramatically
  • Generating structured text — articles, documentation, explanations for business
  • Agent scenarios with multiple steps — planning before execution is critical
  • Code with non-trivial architecture — the model itself catches errors in the plan before writing them
  • Any task where quality is more important than time

Thinking can be shortened using /no_think:

  • Simple questions with unambiguous answers
  • Text translation
  • Short answers where structure is not needed
  • Chat where reactivity is important
  • Template code that you already know

My advice from my experience: I leave thinking enabled by default and only add /no_think when I explicitly want a quick answer to a simple question. For complex tasks, thinking justifies itself even on M1 where it is slower.

If you need a model where thinking is faster or where it can be reliably turned off, consider Qwen3:8b. Detailed comparison: Gemma 4 on M1 16 GB — real tests: code, text, speed.

📚 Read also

Vadym Kharovyk — developer, founder of WebsCraft and AskYourDocs. I test local AI models on my own Mac M1 and write about what really works.

Останні статті

Читайте більше цікавих матеріалів

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Коротко: Gemma 4 26B MoE рекламують як "якість 26B за ціною 4B". Це правда щодо швидкості інференсу — але не щодо пам'яті. Завантажити потрібно всі 18 GB. На Mac з 24 GB — свопінг і 2 токени/сек. Комфортно працює на 32+ GB. Читай перш ніж завантажувати. Що таке MoE і чому 26B...

Reasoning mode в Gemma 4: як вмикати, коли потрібно і скільки коштує — 2026

Reasoning mode в Gemma 4: як вмикати, коли потрібно і скільки коштує — 2026

Коротко: Reasoning mode — це вбудована здатність Gemma 4 "думати" перед відповіддю. Увімкнений за замовчуванням. На M1 16 GB з'їдає від 20 до 73 секунд залежно від задачі. Повністю вимкнути через Ollama не можна — але можна скоротити через /no_think. Читай коли це варто робити, а коли...

Gemma 4: повний огляд — розміри, ліцензія, порівняння з Gemma 3

Gemma 4: повний огляд — розміри, ліцензія, порівняння з Gemma 3

Коротко: Gemma 4 — нове покоління відкритих моделей від Google DeepMind, випущене 2 квітня 2026 року. Чотири розміри: E2B, E4B, 26B MoE і 31B Dense. Ліцензія Apache 2.0 — можна використовувати комерційно без обмежень. Підтримує зображення, аудіо, reasoning mode і 256K контекст. Запускається...

Gemma 4 на M1 16 GB — реальні тести: код, текст, швидкість

Gemma 4 на M1 16 GB — реальні тести: код, текст, швидкість

Коротко: Встановив Gemma 4 на MacBook Pro M1 16 GB і протестував на двох реальних задачах — генерація Spring Boot коду і текст про RAG. Порівняв з Qwen3:8b і Mistral Nemo. Результат: Gemma 4 видає найкращу якість, але найповільніша. Qwen3:8b — майже та сама якість коду за 1/4 часу. Читай якщо...

Як модель LLM  вирішує коли шукати — механіка прийняття рішень

Як модель LLM вирішує коли шукати — механіка прийняття рішень

Розробник налаштував tool use, перевірив на тестових запитах — все працює. У production модель раптом відповідає без виклику інструменту, впевнено і зв'язно, але з даними річної давнини. Жодної помилки в логах. Просто неправильна відповідь. Спойлер: модель не «зламалась»...

Tool Use vs Function Calling: механіка, JSON schema і зв'язок з RAG

Tool Use vs Function Calling: механіка, JSON schema і зв'язок з RAG

Коли розробник вперше бачить як LLM «викликає функцію» — виникає інтуїтивна помилка: здається що модель сама виконала запит до бази або API. Це не так, і саме ця помилка породжує цілий клас архітектурних багів. Спойлер: LLM лише повертає структурований JSON з назвою...