Gemma 4 on M1 16 GB — real tests: code, text, speed

Updated:
Gemma 4 on M1 16 GB — real tests: code, text, speed
In short: Installed Gemma 4 on a MacBook Pro M1 16 GB and tested it on two real tasks — generating Spring Boot code and text about RAG. Compared it with Qwen3:8b and Mistral Nemo. Result: Gemma 4 produces the best quality, but is the slowest. Qwen3:8b — almost the same code quality in 1/4 of the time. Read if you want to know if it's worth switching.

⚠️ How I installed Gemma 4 on M1: a real error with the Ollama version

The first thing I saw was not the model, but an error. And this is the first useful fact for those who want to repeat it.

I have been using Ollama for local AI for a long time — so the first thing I did after Gemma 4 was released, I just typed in the terminal:

ollama run gemma4

And immediately got:

Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama.
Please download the latest version at: https://ollama.com/download

The reason is simple: I had version 0.17.0 installed, and Gemma 4 requires at least 0.20+. To check your version: ollama --version. You can update either through the official download page, or via Homebrew — which is what I did (official Ollama documentation):

brew upgrade ollama
brew services restart ollama

After that, version 0.20.5 was installed, and the model downloaded without problems. If you installed Ollama a long time ago — check the version before trying Gemma 4. You'll save 10 minutes of searching for the error cause.

Downloading the model:

ollama run gemma4

Size: 9.6 GB. On my internet, it took about 2 hours. After downloading, the model immediately launched in the terminal — the symbol means it's loading into memory, after a few seconds >>> appears.

💾 Which Gemma 4 variant is suitable for M1 16 GB and why not 26B

Gemma 4 is not one model, but four. And only one of them is suitable for M1 16 GB.

A detailed overview of all variants is in the article about models for 8 GB RAM. Briefly about Gemma 4:

Model File size RAM (4-bit) Suitable for M1 16 GB
gemma4:e2b ~5 GB 5 GB ✅ Yes, but low quality
gemma4 (e4b) 9.6 GB ~6 GB ✅ Yes — optimal choice
gemma4:26b ~18 GB ~18 GB ❌ No — swapping, freezing
gemma4:31b ~20 GB ~20 GB ❌ No — won't fit

About gemma4:26b separately — it's actively advertised online as "MoE magic: 26B quality at 8B price". This is not entirely true. The actual file size is 18 GB, and on an M1 with 16 GB of unified memory, it simply won't fit without aggressive swapping. Even on a Mac mini with 24 GB, people report freezes under load and returning to e4b. More details on this — in a separate article about the pitfalls of Gemma 4 26B MoE.

My choice: gemma4 (e4b) — the default option, no need to specify anything extra.

💻 Test 1 — code generation: Spring Boot endpoint with pagination

The same prompt — three models. Let's see what came out.

The prompt I used:

Write a Spring Boot REST endpoint to get a list of users with pagination. Use JPA Repository.

I deliberately chose this task — I know Spring Boot well, so I can evaluate the quality without Googling.

Gemma 4 — result:

Full structure: Entity → Repository → Service → Controller + dependencies in pom.xml + examples of URL requests. Correct DI via constructor, ResponseEntity<Page<User>>, comments for each step. This is production-ready code that can be taken and used. The only downside is the time. First, it "thought" for 73 seconds (Thinking block), then it generated text for ~3 minutes. Total almost 4 minutes.

Qwen3:8b — result:

The same full structure: Entity + Repository + Service + Controller. Additionally — dependencies for both Maven and Gradle (which Gemma didn't do). Code quality is practically identical. Time: ~32 seconds thinking + ~35 seconds generation = 67 seconds total. 3.5 times faster.

Mistral Nemo — result:

Minimal code — only Controller, without a separate Service layer. The same block of code was duplicated twice (looks like a generation bug). Time ~30 seconds — the fastest, but the weakest response.

Gemma 4 on M1 16 GB — real tests: code, text, speed

📝 Test 2 — Text Generation: RAG Explanation for Business

The picture changed here — Gemma 4 showed itself significantly better than its competitors.

Prompt:

Explain what RAG (Retrieval-Augmented Generation) is in simple terms for business. No technical terms. 3-4 paragraphs.

The constraints "3-4 paragraphs" and "no technical terms" were specifically to check if the model follows instructions.

Gemma 4 — Result:

It broke the paragraph limit — but correctly. Instead of 3-4 paragraphs, it created a structured article with subheadings, an analogy ("a student with all the books in the world vs. an assistant with your company's handbook"), and a comparison table "LLM without RAG vs. with RAG". This is exactly what businesses need — I know this from my own experience with AskYourDocs. Time: ~37 seconds thinking + ~1 minute text.

Qwen3:8b — Result:

It adhered to the constraint — exactly 3 paragraphs. Clean, concise, understandable. There's an analogy ("an additional source of knowledge"). But compared to Gemma 4 — significantly simpler, without structure and without a table. Time: ~18 seconds thinking + ~20 seconds text = 38 seconds total.

Mistral Nemo — Result:

6 paragraphs instead of 3-4 — did not adhere to the constraint. The content is watery, with repetitions of the same ideas in different words. Time ~30 seconds, but the quality is the lowest of the three.

📊 Comparison with Qwen3:8b and Mistral Nemo: Results Table

Figures collected on a MacBook Pro M1 16 GB. Not laboratory benchmarks — my own tests.
Model Size Code: Time Code: Quality Text: Time Text: Quality
gemma4 9.6 GB ~4 min ⭐⭐⭐⭐⭐ ~1.5 min ⭐⭐⭐⭐⭐
qwen3:8b 5.2 GB ~67 sec ⭐⭐⭐⭐⭐ ~38 sec ⭐⭐⭐⭐
mistral-nemo 7.1 GB ~30 sec ⭐⭐ ~30 sec ⭐⭐⭐

Conclusion from the table: for code, Qwen3:8b and Gemma 4 are equal in quality, but Qwen3 is 3.5 times faster. For text, Gemma 4 is noticeably better — structure, analogies, tables. Mistral Nemo loses in both tests except for speed.

🧠 Reasoning mode in practice: how much time does it take and is it worth it

Gemma 4 "thinks" before each response by default. This is its main advantage - and the main reason for its slowness.

Immediately after the first request, I noticed something unusual:

Thinking...
Thinking Process:
1. Analyze the user's input...
2. Identify the core question...
...done thinking.

This is reasoning mode — the model builds a plan for the response before generating text. In Gemma 4, it is enabled by default through the <|think|> token in the system prompt. More details on how to enable and disable it manually can be found in a separate article about reasoning mode in Gemma 4.

What this gives in practice is evident from the tests:

  • Code: 73 seconds of thinking → response with full structure and explanations
  • Text: 37 seconds of thinking → response with a structure that wasn't requested, but which actually improved the result

Is it worth it? It depends on the task. For one-off complex requests - yes, the quality is noticeably higher. For routine tasks where speed is required (autocompletion, short answers, chat) - reasoning only slows things down. In such cases, Qwen3:8b is better.

✅ Conclusion: when to choose Gemma 4 on M1, and when to stick with Qwen3

Gemma 4 does not replace all models. It occupies its niche - and in this niche, it is truly the best.

Choose Gemma 4 if:

  • You are writing complex text - articles, documentation, business explanations
  • You need maximum code quality and time is not critical
  • You want a model that structures the response itself without detailed instructions
  • You plan to use it in an RAG product - 128K context and native function calling

Stick with Qwen3:8b if:

  • You generate code daily and need speed
  • You use it for autocompletion in an IDE
  • Responsiveness in chat is important

On my M1 16 GB, both models are currently installed simultaneously - they take up ~15 GB together and do not conflict. I switch depending on the task.

If you want to delve deeper - read more on the topic:

Vadym Kharovuyuk - developer, founder of WebsCraft and AskYourDocs.

Останні статті

Читайте більше цікавих матеріалів

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

TL;DR Як ефективно керувати контекстом у довгоживучих AI-агентах: — Sliding Window + Pinning — Автоматична summarization з розумними тригерами — Compression та semantic memory З конкретними цифрами, кодом і архітектурними рішеннями, які значно підвищили стабільність агента. Ця стаття —...

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

15 травня 2026 року Google тихо оновив одне речення у своїй Spam Policy. Але це речення змінює правила гри для всіх хто займається контентом і SEO. Без гучних анонсів, без великої прес-конференції — просто нове формулювання на сторінці документації. Search Engine Roundtable...

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Агент отримав запит — обробив — відповів. Наступний запит — і він не пам'ятає нічого з попереднього. Не тому що щось зламалось. А тому що так влаштована LLM за замовчуванням: кожен виклик — чистий аркуш. Якщо ви будуєте агента і не думали про пам'ять — ви будуєте амнезика з доступом до...

Grok Build від xAI: детальний технічний огляд

Grok Build від xAI: детальний технічний огляд

Grok Build — новий agentic CLI від xAI (early beta, 14 травня 2026). Головні фішки: Plan Mode з обов’язковим затвердженням плану, паралельні субагенти (до 8), контекстне вікно ~1–2M токенів та сучасний TUI на Rust. Працює на Grok 4.3, підтримує ACP, git worktree та MCP....

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Оновлено: 15 травня 2026 14 травня 2026 вийшла Ollama 0.24 — і це не черговий патч з виправленням багів. Цей реліз додає офіційну підтримку Codex App від OpenAI: тепер десктопний AI coding agent можна запустити на будь-якій локальній або хмарній моделі через Ollama....

Tool RAG: що робити коли у агента забагато інструментів

Tool RAG: що робити коли у агента забагато інструментів

У вас 5 tools — все чудово. У вас 15 tools — починаються проблеми. У вас 50 tools — агент деградує. Але є рішення яке вирішує проблему масштабу елегантно — і ви вже знаєте як воно працює, бо використовуєте його для документів. Ця стаття — частина серії про AI агентів на Spring Boot. Якщо...