Коли варто вмикати Reasoning Mode в Gemma 4?

Reasoning Mode варто вмикати для: складної математики, логічних задач, генерації структурованого тексту, написання коду, агентних сценаріїв та задач, де якість важливіша за швидкість. Без thinking якість на таких задачах сильно падає (наприклад, AIME з 89.2% до 20.8%).

На яких моделях Gemma 4 працює Reasoning Mode?

Reasoning Mode підтримується на всіх варіантах Gemma 4: E2B, E4B, 26B MoE та 31B Dense. Ефект присутній у всіх розмірах, але на більших моделях якість reasoning вища.

Чи варто завжди залишати Reasoning Mode увімкненим?

Для більшості користувачів — так, бо якість значно вища. Вмикайте /no_think або \"think\": false тільки коли потрібна максимальна швидкість (прості задачі, чат). Багато хто залишає thinking увімкненим за замовчуванням.

Які альтернативи Gemma 4 з швидшим reasoning?

кщо thinking в Gemma 4 здається надто повільним, можна розглянути Qwen3 8B — вона думає 18–32 секунди і генерує відповіді швидше при схожій якості на багатьох задачах.

Що таке Reasoning Mode (thinking) в Gemma 4?

Reasoning Mode — це вбудована функція Gemma 4, яка змушує модель генерувати внутрішній монолог міркувань перед фінальною відповіддю. Модель будує план, перевіряє логіку, виправляє помилки і тільки потім видає результат. Використовується спеціальний токен . Це значно підвищує якість на складних задачах.

Чи можна повністю вимкнути thinking в Gemma 4?

Повністю вимкнути thinking через стандартний Ollama неможливо — він вбудований у модель. Однак його можна суттєво скоротити за допомогою /no_think, параметра \"think\": false в API або кастомного Modelfile. На практиці thinking все одно може частково проявлятися.

Коли можна вимкнути або скоротити thinking в Gemma 4?

Thinking можна скорочувати для простих питань, перекладів, коротких відповідей, шаблонного коду та чатів, де потрібна швидка реакція. У таких випадках /no_think робить відповідь приблизно вдвічі швидшою.

Скільки часу займає Reasoning Mode в Gemma 4?

На Mac M1 16 GB: Просте питання — 15–20 секунд thinking. Звичайний текст RAG — близько 37 секунд. Складний код — до 73 секунд. /no_think час скорочується приблизно вдвічі. Загальний час відповіді може сягати 1–4 хвилин на важких задачах

Як сильно Reasoning Mode впливає на якість відповідей Gemma 4?

Дуже сильно. З reasoning: AIME 2026 — 89.2%, Codeforces ELO — 2150. Без reasoning: AIME — 20.8%, Codeforces ELO — 110. Модель краще структурує відповіді, виявляє помилки та дає повніші результати.

Як ввімкнути або вимкнути Reasoning Mode в Gemma 4 через Ollama?

В Ollama reasoning mode увімкнено за замовчуванням. Щоб скоротити thinking: додайте /no_think на початку запиту. Повністю вимкнути через API (Ollama 0.20+): додайте \"think\": false у запиті. Також можна створити окрему модель через Modelfile з порожнім SYSTEM промптом для швидшої роботи.

AI_TOOLS 11 April 2026 12 min read 21,802 view

Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026

Updated: 24 June 2026

Language: 🇺🇦 🇬🇧 🇩🇪

Vadim Kharovyuk

CEO & Founder of WebsCraft. 8 years in web development, focused on bringing AI into real products.

✦ Ask AI about this article

Reasoning mode in Gemma 4: how to enable, when needed, and how much it costs — 2026

In short: Reasoning mode is Gemma 4's built-in ability to "think" before responding. It's enabled by default. On a 16 GB M1, it takes 20 to 73 seconds depending on the task. It cannot be completely disabled via Ollama, but it can be shortened using /no_think, the think: false parameter, or thinking_budget. Read when it's worth doing and when it's not.

🧠 What is reasoning mode and how did it appear in Gemma 4

Gemma 4 is the first model in Google's lineup that can think before answering. This isn't marketing – it's a distinct technical mechanism that genuinely improves response quality.

Reasoning mode (or thinking mode) is the model's ability to generate an internal monologue of reasoning before the final answer. The model constructs a plan, checks logic, corrects itself, and only then presents the result to the user. What you see in the final answer is the outcome after internal "error correction."

Where it came from

The idea of "thinking before answering" is not new, but it only became widespread in open models around 2025-2026. The first to popularize this approach were DeepSeek-R1 (a Chinese open model) and OpenAI o1. Both demonstrated that a model spending time on internal reasoning solves complex tasks significantly better than a model that answers immediately.

Google followed the same path. Gemma 3 lacked reasoning – the model answered immediately, saying whatever it thought. Gemma 4 received a built-in thinking mode as one of its key new features. This explains the most dramatic jump in benchmarks: AIME (competitive mathematics) from 20.8% to 89.2%, Codeforces ELO from 110 to 2150. Such results are impossible without thinking – mathematical problems require step-by-step reasoning, not an instant answer.

Since then, reasoning mode has appeared in other popular open models, notably Qwen3.5:4b, which can think with significantly lower resource consumption. This has made the choice between models more interesting: a detailed comparison is in the Gemma 4 vs Qwen3.5:4b section below.

How it works technically

Technically, reasoning in Gemma 4 is implemented through a special token <|think|> in the system prompt. When the model sees this token, it activates the reasoning mode and generates an internal monologue of up to 4000+ tokens before the final answer.

These 4000 tokens are not just "extra text." It's a separate pass through the task: the model formulates the problem in its own words, breaks it down into sub-tasks, builds a plan, checks if the plan is logical, and only then starts generating the final answer. If it detects contradictions during thinking, it corrects itself before you even see a single word of the response.

Important detail: via Ollama, the <|think|> token is inserted automatically – you don't need to configure anything. This differs from some other models where thinking needs to be explicitly activated via a system prompt or API parameters.

How it differs from regular generation

Regular text generation in LLMs is a sequential prediction of the next token. The model doesn't "plan" the answer – it simply continues the text token by token based on the context. This works well for simple requests but poorly for tasks requiring logic, mathematics, or a multi-step plan.

Reasoning mode changes this: before generating the final answer, the model gets a "thinking space" where it can freely reason, make mistakes, and correct them. This is a fundamentally different approach – and precisely why models with reasoning show such superior results on complex tasks.

A simple analogy: a regular model is like a student who immediately writes their exam answer. A model with reasoning is like a student who first creates a draft, checks the logic, and only then writes the final version.

What this means for you in practice

If you've just launched Gemma 4 and are surprised that it "thinks" for a long time before responding, now you know why. It's not a bug or a hardware slowdown. It's a deliberate behavior that improves response quality.

On a MacBook Pro M1 16 GB, thinking takes 15 to 73 seconds depending on the task's complexity. Detailed figures are in the section on thinking costs below. For now, let's look at what exactly happens inside the Thinking block.

🔍 What the Thinking block looks like - what's actually happening there

The Thinking block is not a hidden technical log. It's the model's actual reasoning process that can be read and learned from.

When you submit a request to Gemma 4 via the Ollama terminal or UI, a block appears before the answer:

Thinking...
Thinking Process:

1. Analyze the user's input...
2. Identify the core question...
3. Recall personal identity/nature...
...done thinking.

What happens inside this block depends on the task. I've observed three patterns:

For simple questions (e.g., "how many parameters do you have?") – the model builds a short plan of 4-7 steps: determine the language of the query, understand the question, recall relevant facts, formulate the answer. Takes 20-37 seconds.

For complex code (Spring Boot endpoint) – the model analyzes what's needed, lists the components to include (Entity, Repository, Service, Controller), plans the structure, and performs self-correction if something is missed. Takes 60-73 seconds.

For text (explaining RAG for business) – the model identifies the audience, formulates analogies, plans the paragraph structure, and checks if prompt constraints are met. Takes 37 seconds – and thanks to this, it even added a comparison table I didn't ask for, but which genuinely improved the answer.

Key point: the Thinking block is visible only to you – it's not in the final answer. It's the model's internal process.

⚙️ How to Control Reasoning via Ollama Terminal and API

It's impossible to completely disable thinking in Ollama, but it can be significantly reduced. This is important to know before getting disappointed by the model's speed.

According to Ollama's official documentation, thinking is controlled via a token in the system prompt:

# Thinking is enabled (by default)
# The <|think|> token is inserted automatically

# To disable it, remove the token from the system prompt
# However, this is not straightforward via the standard Ollama CLI

Method 1 — /no_think at the beginning of the prompt:

The simplest way to reduce thinking directly in your query:

ollama run gemma4
>>> /no_think Explain Docker in simple terms

In my tests, this reduces thinking from ~37 seconds to ~20 seconds. It doesn't disable it completely; the model still thinks, but for a shorter duration.

Method 2 — Create a separate model without thinking using Modelfile:

# Create a Modelfile
echo 'FROM gemma4
SYSTEM ""' > Modelfile

# Build the new model
ollama create gemma4-fast -f Modelfile

# Run it
ollama run gemma4-fast

Theoretically, this should remove the system prompt with the <|think|> token. In practice, thinking still appears, but in a shortened form. This is a known behavior of Gemma 4 via Ollama, discussed in GitHub issues.

Method 3 — Via Ollama API with the think: false parameter:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "think": false,
  "messages": [
    {
      "role": "user",
      "content": "Explain what Docker is"
    }
  ]
}'

A reliable method for programmatic control of thinking. The think: false parameter is supported in Ollama 0.24+. For complete disabling, this is the best option among CLI methods.

Method 4 — num_predict as indirect control of thinking budget:

In Ollama 0.24+, the num_predict parameter can limit the total number of tokens the model generates, including the thinking block. This is not direct control over reasoning depth, but it provides a more predictable response time:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {
      "role": "user",
      "content": "Explain what Docker is"
    }
  ],
  "options": {
    "num_predict": 1024
  }
}'

For simple tasks, a value of 512–1024 is sufficient and significantly speeds up the response. For complex code or architectural decisions, set it to 2048–4096, otherwise, thinking might be cut off mid-way.

More details on how to adjust the thinking depth for a specific task can be found in the next section, Thinking budget.

🖥️ How to Control Reasoning in Open WebUI

Controlling thinking is easier in a graphical interface, but the capabilities depend on your UI version.

If you are using Open WebUI or another Ollama-compatible interface, the thinking block is displayed as an expandable section before the response. It is usually collapsed and labeled as "Thought for X seconds".

To reduce thinking in the UI, there are three approaches:

1. Via the System Prompt field (if available in model settings): leave it empty or add your own system prompt without the <|think|> token. However, as my test showed, this does not guarantee complete disabling.

2. Via /no_think at the beginning of the message: this works in the UI just as it does in the terminal – simply add it at the beginning of your query. Thinking will be reduced but not entirely eliminated.

3. Via model parameters in Advanced Settings (Open WebUI 0.6+): in the advanced settings, a num_predict parameter has been added that can be set directly in the interface without making API calls. This is the most convenient option for those who don't want to write curl requests.

For most UI users, the most practical solution is to simply accept that thinking exists and evaluate the model based on the quality of the final response, not its speed.

🧪 My Test: With and Without Thinking — Quality Comparison

I tested on a MacBook Pro M1 16 GB. The same prompt was used for both modes. Here are the results.

Prompt for both tests:

Explain RAG (Retrieval-Augmented Generation) in simple terms for business. No technical jargon. 3-4 paragraphs.

Test 1 — Normal run (thinking enabled):

Thinking took ~37 seconds. The model planned the structure, identified the audience, and chose analogies. Result: a structured response with subheadings, a strong analogy ("a student with all the books in the world vs. an assistant with your company's handbook"), and a comparison table "LLM without RAG vs. with RAG" – which I didn't ask for, but which genuinely improved the answer. Total time: ~1.5 minutes.

Test 2 — With /no_think (thinking reduced):

Thinking took 20.3 seconds. The model responded faster. Result: 4 paragraphs, an analogy ("an intern with an internal knowledge base"), clear and concise. However, without a table, without subheadings, less structured. Total time: ~50 seconds.

Parameter	With thinking (normal)	With /no_think
Thinking time	~37 sec	~20 sec
Total time	~1.5 min	~50 sec
Response structure	Subheadings + table	4 paragraphs without structure
Analogies	✅ Strong	✅ Present, but simpler
Adherence to instructions	Violated (added more)	✅ Exactly 4 paragraphs
Overall quality	⭐⭐⭐⭐⭐	⭐⭐⭐⭐

An interesting nuance: with full thinking, the model *violated* the "3-4 paragraphs" constraint by adding a table and subheadings I didn't ask for. But it did so correctly – the answer became better. With /no_think, it strictly followed the instructions, but the answer was simpler.

⏱️ How much time does thinking mode take on M1 16 GB

Thinking isn't free. Here are the numbers from my tests on a MacBook Pro M1 16 GB.

Task	Thinking Time	Generation Time	Total
Simple question (model parameters)	~15 sec	~20 sec	~35 sec
Text (RAG for business)	~37 sec	~1 min	~1.5 min
Text with /no_think	~20 sec	~30 sec	~50 sec
Complex code (Spring Boot)	~73 sec	~3 min	~4 min
Qwen3.5:4b — text (for comparison)	~12 sec	~25 sec	~37 sec
Qwen3.5:4b — code (for comparison)	~18 sec	~45 sec	~63 sec

What affects the duration of thinking:

Task complexity — the more steps need to be planned, the longer it thinks
Number of components in the response — code with four classes takes longer to think about than a paragraph of text
Presence of /no_think — reduces time by approximately half
Current M1 load — if many browser tabs are open, thinking is slower
Model size — Gemma 4 E4B thinks longer than Qwen3.5:4b on similar tasks due to architectural differences

🎛️ Thinking budget: fine-tuning reasoning depth

You don't always need to enable thinking at full power or disable it completely. There's a middle ground — controlling the token budget for reasoning.

The official Google documentation for Gemma 4 describes the "LOW thinking" approach — instead of completely disabling it, you instruct the model to think shorter via a system prompt:

# Full thinking (default)
system: "<|think|>"

# Shortened thinking via instruction
system: "<|think|> Think briefly and efficiently. Focus only on key steps."

# Minimal thinking
system: "<|think|> Use minimal reasoning. Answer directly after a short check."

This works because Gemma 4 has very strong instruction adherence — the model actually reduces the depth of reasoning according to the instructions. In practice, the difference between "full" and "short" thinking is about the difference between 73 and 25 seconds for a coding task.

Practical budget guidelines for different tasks:

Task Type	Recommended Approach	Estimated Time on M1
Factual question, simple Q&A	/no_think or minimal think	15–20 sec
Code generation, structured output	Short thinking	20–35 sec
Math, logic problems	Full thinking	37–73 sec
Architectural decisions, complex plan	Full thinking, num_predict 2048+	60–90 sec

Important: if you set `num_predict` too low for a complex task, thinking might be cut off mid-way and the quality of the response will suffer. It's better to allow more tokens than to get incomplete reasoning.

⚖️ Gemma 4 vs Qwen3.5:4b — thinking comparison

Both models can think — but in different ways. Here's where each wins and loses.

After the release of Qwen3.5:4b, the question "which model to choose for local work" became more interesting. Both support thinking mode, and both fit within 8–16 GB RAM — but they behave differently.

Characteristic	Gemma 4 E4B	Qwen3.5:4b
Thinking time (text)	~37 sec	~12 sec
Thinking time (code)	~73 sec	~18 sec
Code quality	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Text and structure quality	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Reliability of disabling thinking	Partial (/no_think)	Full (think: false)
Multimodality (photo, audio)	✅ Yes	✅ Yes (Qwen3.5)
Context	128K	256K
File size (4-bit)	~3.3 GB	~2.6 GB
Suitable for daily chat	A bit slow	✅ Excellent
Suitable for complex tasks	✅ Excellent	Good

My conclusion after several months of parallel use: they are not competitors — they are different tools. Qwen3.5:4b is for a fast daily flow of tasks where responsiveness is important. Gemma 4 is for when the task is complex and time is not critical. On M1 16 GB, both live simultaneously (~6 GB together) and do not conflict.

✅ When thinking is needed, and when it only slows things down

Thinking is a tool, not an obligation. Enable it when quality is needed, disable it (as much as possible) when speed is needed.

Thinking is definitely worth the time for:

Complex math or logic problems — quality drops dramatically without thinking
Generating structured text — articles, documentation, explanations for business
Agentic scenarios with multiple steps — planning before execution is critical
Code with non-trivial architecture — the model catches errors in the plan before writing them
Any task where quality is more important than time

Thinking can be shortened using /no_think for:

Simple questions with unambiguous answers
Text translation
Short answers where structure is not needed
Chats where responsiveness is important
Template code that you already know

My advice from experience: I leave thinking enabled by default and only add /no_think when I explicitly want a quick answer to a simple question. For complex tasks, thinking justifies itself even on an M1 where it's slower.

If you need a model where thinking is faster or where it can be reliably disabled, consider Qwen3.5:4b. A detailed comparison of speed and quality is in the section above.

Categories