In short: Installed Gemma 4 on a MacBook Pro M1 16 GB and tested it on two real tasks — generating Spring Boot code and text about RAG. Compared it with Qwen3:8b and Mistral Nemo. Result: Gemma 4 produces the best quality, but is the slowest. Qwen3:8b — almost the same code quality in 1/4 of the time. Read if you want to know if it's worth switching.
⚠️ How I installed Gemma 4 on M1: a real error with the Ollama version
The first thing I saw was not the model, but an error. And this is the first useful fact for those who want to repeat it.
I have been using Ollama for local AI for a long time — so the first thing I did after Gemma 4 was released, I just typed in the terminal:
ollama run gemma4
And immediately got:
Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama.
Please download the latest version at: https://ollama.com/download
The reason is simple: I had version 0.17.0 installed, and Gemma 4 requires at least 0.20+. To check your version: ollama --version. You can update either through the official download page, or via Homebrew — which is what I did (official Ollama documentation):
brew upgrade ollama
brew services restart ollama
After that, version 0.20.5 was installed, and the model downloaded without problems. If you installed Ollama a long time ago — check the version before trying Gemma 4. You'll save 10 minutes of searching for the error cause.
Downloading the model:
ollama run gemma4
Size: 9.6 GB. On my internet, it took about 2 hours. After downloading, the model immediately launched in the terminal — the ⠇ symbol means it's loading into memory, after a few seconds >>> appears.
💾 Which Gemma 4 variant is suitable for M1 16 GB and why not 26B
Gemma 4 is not one model, but four. And only one of them is suitable for M1 16 GB.
About gemma4:26b separately — it's actively advertised online as "MoE magic: 26B quality at 8B price". This is not entirely true. The actual file size is 18 GB, and on an M1 with 16 GB of unified memory, it simply won't fit without aggressive swapping. Even on a Mac mini with 24 GB, people report freezes under load and returning to e4b. More details on this — in a separate article about the pitfalls of Gemma 4 26B MoE.
My choice: gemma4 (e4b) — the default option, no need to specify anything extra.
💻 Test 1 — code generation: Spring Boot endpoint with pagination
The same prompt — three models. Let's see what came out.
The prompt I used:
Write a Spring Boot REST endpoint to get a list of users with pagination. Use JPA Repository.
I deliberately chose this task — I know Spring Boot well, so I can evaluate the quality without Googling.
Gemma 4 — result:
Full structure: Entity → Repository → Service → Controller + dependencies in pom.xml + examples of URL requests. Correct DI via constructor, ResponseEntity<Page<User>>, comments for each step. This is production-ready code that can be taken and used. The only downside is the time. First, it "thought" for 73 seconds (Thinking block), then it generated text for ~3 minutes. Total almost 4 minutes.
Qwen3:8b — result:
The same full structure: Entity + Repository + Service + Controller. Additionally — dependencies for both Maven and Gradle (which Gemma didn't do). Code quality is practically identical. Time: ~32 seconds thinking + ~35 seconds generation = 67 seconds total. 3.5 times faster.
Mistral Nemo — result:
Minimal code — only Controller, without a separate Service layer. The same block of code was duplicated twice (looks like a generation bug). Time ~30 seconds — the fastest, but the weakest response.
📝 Test 2 — Text Generation: RAG Explanation for Business
The picture changed here — Gemma 4 showed itself significantly better than its competitors.
Prompt:
Explain what RAG (Retrieval-Augmented Generation) is in simple terms for business. No technical terms. 3-4 paragraphs.
The constraints "3-4 paragraphs" and "no technical terms" were specifically to check if the model follows instructions.
Gemma 4 — Result:
It broke the paragraph limit — but correctly. Instead of 3-4 paragraphs, it created a structured article with subheadings, an analogy ("a student with all the books in the world vs. an assistant with your company's handbook"), and a comparison table "LLM without RAG vs. with RAG". This is exactly what businesses need — I know this from my own experience with AskYourDocs. Time: ~37 seconds thinking + ~1 minute text.
Qwen3:8b — Result:
It adhered to the constraint — exactly 3 paragraphs. Clean, concise, understandable. There's an analogy ("an additional source of knowledge"). But compared to Gemma 4 — significantly simpler, without structure and without a table. Time: ~18 seconds thinking + ~20 seconds text = 38 seconds total.
Mistral Nemo — Result:
6 paragraphs instead of 3-4 — did not adhere to the constraint. The content is watery, with repetitions of the same ideas in different words. Time ~30 seconds, but the quality is the lowest of the three.
📊 Comparison with Qwen3:8b and Mistral Nemo: Results Table
Figures collected on a MacBook Pro M1 16 GB. Not laboratory benchmarks — my own tests.
Model
Size
Code: Time
Code: Quality
Text: Time
Text: Quality
gemma4
9.6 GB
~4 min
⭐⭐⭐⭐⭐
~1.5 min
⭐⭐⭐⭐⭐
qwen3:8b
5.2 GB
~67 sec
⭐⭐⭐⭐⭐
~38 sec
⭐⭐⭐⭐
mistral-nemo
7.1 GB
~30 sec
⭐⭐
~30 sec
⭐⭐⭐
Conclusion from the table: for code, Qwen3:8b and Gemma 4 are equal in quality, but Qwen3 is 3.5 times faster. For text, Gemma 4 is noticeably better — structure, analogies, tables. Mistral Nemo loses in both tests except for speed.
🧠 Reasoning mode in practice: how much time does it take and is it worth it
Gemma 4 "thinks" before each response by default. This is its main advantage - and the main reason for its slowness.
Immediately after the first request, I noticed something unusual:
Thinking...
Thinking Process:
1. Analyze the user's input...
2. Identify the core question...
...done thinking.
This is reasoning mode — the model builds a plan for the response before generating text. In Gemma 4, it is enabled by default through the <|think|> token in the system prompt. More details on how to enable and disable it manually can be found in a separate article about reasoning mode in Gemma 4.
What this gives in practice is evident from the tests:
Code: 73 seconds of thinking → response with full structure and explanations
Text: 37 seconds of thinking → response with a structure that wasn't requested, but which actually improved the result
Is it worth it? It depends on the task. For one-off complex requests - yes, the quality is noticeably higher. For routine tasks where speed is required (autocompletion, short answers, chat) - reasoning only slows things down. In such cases, Qwen3:8b is better.
✅ Conclusion: when to choose Gemma 4 on M1, and when to stick with Qwen3
Gemma 4 does not replace all models. It occupies its niche - and in this niche, it is truly the best.
Choose Gemma 4 if:
You are writing complex text - articles, documentation, business explanations
You need maximum code quality and time is not critical
You want a model that structures the response itself without detailed instructions
You plan to use it in an RAG product - 128K context and native function calling
Stick with Qwen3:8b if:
You generate code daily and need speed
You use it for autocompletion in an IDE
Responsiveness in chat is important
On my M1 16 GB, both models are currently installed simultaneously - they take up ~15 GB together and do not conflict. I switch depending on the task.
If you want to delve deeper - read more on the topic:
Коротко: Gemma 4 26B MoE рекламують як "якість 26B за ціною 4B". Це правда щодо швидкості інференсу — але не щодо пам'яті. Завантажити потрібно всі 18 GB. На Mac з 24 GB — свопінг і 2 токени/сек. Комфортно працює на 32+ GB. Читай перш ніж завантажувати.
Що таке MoE і чому 26B...
Коротко: Reasoning mode — це вбудована здатність Gemma 4 "думати" перед відповіддю. Увімкнений за замовчуванням. На M1 16 GB з'їдає від 20 до 73 секунд залежно від задачі. Повністю вимкнути через Ollama не можна — але можна скоротити через /no_think. Читай коли це варто робити, а коли...
Коротко: Gemma 4 — нове покоління відкритих моделей від Google DeepMind, випущене 2 квітня 2026 року. Чотири розміри: E2B, E4B, 26B MoE і 31B Dense. Ліцензія Apache 2.0 — можна використовувати комерційно без обмежень. Підтримує зображення, аудіо, reasoning mode і 256K контекст. Запускається...
Коротко: Встановив Gemma 4 на MacBook Pro M1 16 GB і протестував на двох реальних задачах — генерація Spring Boot коду і текст про RAG. Порівняв з Qwen3:8b і Mistral Nemo. Результат: Gemma 4 видає найкращу якість, але найповільніша. Qwen3:8b — майже та сама якість коду за 1/4 часу. Читай якщо...
Розробник налаштував tool use, перевірив на тестових запитах — все працює.
У production модель раптом відповідає без виклику інструменту, впевнено і зв'язно,
але з даними річної давнини. Жодної помилки в логах. Просто неправильна відповідь.
Спойлер: модель не «зламалась»...
Коли розробник вперше бачить як LLM «викликає функцію» — виникає інтуїтивна помилка:
здається що модель сама виконала запит до бази або API.
Це не так, і саме ця помилка породжує цілий клас архітектурних багів.
Спойлер: LLM лише повертає структурований JSON з назвою...