Gemma 4 on M1 16 GB — real tests: code, text, speed

Updated:
Gemma 4 on M1 16 GB — real tests: code, text, speed
In short: Installed Gemma 4 on a MacBook Pro M1 16 GB and tested it on two real tasks — generating Spring Boot code and text about RAG. Compared it with Qwen3:8b and Mistral Nemo. Result: Gemma 4 produces the best quality, but is the slowest. Qwen3:8b — almost the same code quality in 1/4 of the time. Read if you want to know if it's worth switching.

⚠️ How I installed Gemma 4 on M1: a real error with the Ollama version

The first thing I saw was not the model, but an error. And this is the first useful fact for those who want to repeat it.

I have been using Ollama for local AI for a long time — so the first thing I did after Gemma 4 was released, I just typed in the terminal:

ollama run gemma4

And immediately got:

Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama.
Please download the latest version at: https://ollama.com/download

The reason is simple: I had version 0.17.0 installed, and Gemma 4 requires at least 0.20+. To check your version: ollama --version. You can update either through the official download page, or via Homebrew — which is what I did (official Ollama documentation):

brew upgrade ollama
brew services restart ollama

After that, version 0.20.5 was installed, and the model downloaded without problems. If you installed Ollama a long time ago — check the version before trying Gemma 4. You'll save 10 minutes of searching for the error cause.

Downloading the model:

ollama run gemma4

Size: 9.6 GB. On my internet, it took about 2 hours. After downloading, the model immediately launched in the terminal — the symbol means it's loading into memory, after a few seconds >>> appears.

💾 Which Gemma 4 variant is suitable for M1 16 GB and why not 26B

Gemma 4 is not one model, but four. And only one of them is suitable for M1 16 GB.

A detailed overview of all variants is in the article about models for 8 GB RAM. Briefly about Gemma 4:

Model File size RAM (4-bit) Suitable for M1 16 GB
gemma4:e2b ~5 GB 5 GB ✅ Yes, but low quality
gemma4 (e4b) 9.6 GB ~6 GB ✅ Yes — optimal choice
gemma4:26b ~18 GB ~18 GB ❌ No — swapping, freezing
gemma4:31b ~20 GB ~20 GB ❌ No — won't fit

About gemma4:26b separately — it's actively advertised online as "MoE magic: 26B quality at 8B price". This is not entirely true. The actual file size is 18 GB, and on an M1 with 16 GB of unified memory, it simply won't fit without aggressive swapping. Even on a Mac mini with 24 GB, people report freezes under load and returning to e4b. More details on this — in a separate article about the pitfalls of Gemma 4 26B MoE.

My choice: gemma4 (e4b) — the default option, no need to specify anything extra.

💻 Test 1 — code generation: Spring Boot endpoint with pagination

The same prompt — three models. Let's see what came out.

The prompt I used:

Write a Spring Boot REST endpoint to get a list of users with pagination. Use JPA Repository.

I deliberately chose this task — I know Spring Boot well, so I can evaluate the quality without Googling.

Gemma 4 — result:

Full structure: Entity → Repository → Service → Controller + dependencies in pom.xml + examples of URL requests. Correct DI via constructor, ResponseEntity<Page<User>>, comments for each step. This is production-ready code that can be taken and used. The only downside is the time. First, it "thought" for 73 seconds (Thinking block), then it generated text for ~3 minutes. Total almost 4 minutes.

Qwen3:8b — result:

The same full structure: Entity + Repository + Service + Controller. Additionally — dependencies for both Maven and Gradle (which Gemma didn't do). Code quality is practically identical. Time: ~32 seconds thinking + ~35 seconds generation = 67 seconds total. 3.5 times faster.

Mistral Nemo — result:

Minimal code — only Controller, without a separate Service layer. The same block of code was duplicated twice (looks like a generation bug). Time ~30 seconds — the fastest, but the weakest response.

Gemma 4 on M1 16 GB — real tests: code, text, speed

📝 Test 2 — Text Generation: RAG Explanation for Business

The picture changed here — Gemma 4 showed itself significantly better than its competitors.

Prompt:

Explain what RAG (Retrieval-Augmented Generation) is in simple terms for business. No technical terms. 3-4 paragraphs.

The constraints "3-4 paragraphs" and "no technical terms" were specifically to check if the model follows instructions.

Gemma 4 — Result:

It broke the paragraph limit — but correctly. Instead of 3-4 paragraphs, it created a structured article with subheadings, an analogy ("a student with all the books in the world vs. an assistant with your company's handbook"), and a comparison table "LLM without RAG vs. with RAG". This is exactly what businesses need — I know this from my own experience with AskYourDocs. Time: ~37 seconds thinking + ~1 minute text.

Qwen3:8b — Result:

It adhered to the constraint — exactly 3 paragraphs. Clean, concise, understandable. There's an analogy ("an additional source of knowledge"). But compared to Gemma 4 — significantly simpler, without structure and without a table. Time: ~18 seconds thinking + ~20 seconds text = 38 seconds total.

Mistral Nemo — Result:

6 paragraphs instead of 3-4 — did not adhere to the constraint. The content is watery, with repetitions of the same ideas in different words. Time ~30 seconds, but the quality is the lowest of the three.

📊 Comparison with Qwen3:8b and Mistral Nemo: Results Table

Figures collected on a MacBook Pro M1 16 GB. Not laboratory benchmarks — my own tests.
Model Size Code: Time Code: Quality Text: Time Text: Quality
gemma4 9.6 GB ~4 min ⭐⭐⭐⭐⭐ ~1.5 min ⭐⭐⭐⭐⭐
qwen3:8b 5.2 GB ~67 sec ⭐⭐⭐⭐⭐ ~38 sec ⭐⭐⭐⭐
mistral-nemo 7.1 GB ~30 sec ⭐⭐ ~30 sec ⭐⭐⭐

Conclusion from the table: for code, Qwen3:8b and Gemma 4 are equal in quality, but Qwen3 is 3.5 times faster. For text, Gemma 4 is noticeably better — structure, analogies, tables. Mistral Nemo loses in both tests except for speed.

🧠 Reasoning mode in practice: how much time does it take and is it worth it

Gemma 4 "thinks" before each response by default. This is its main advantage - and the main reason for its slowness.

Immediately after the first request, I noticed something unusual:

Thinking...
Thinking Process:
1. Analyze the user's input...
2. Identify the core question...
...done thinking.

This is reasoning mode — the model builds a plan for the response before generating text. In Gemma 4, it is enabled by default through the <|think|> token in the system prompt. More details on how to enable and disable it manually can be found in a separate article about reasoning mode in Gemma 4.

What this gives in practice is evident from the tests:

  • Code: 73 seconds of thinking → response with full structure and explanations
  • Text: 37 seconds of thinking → response with a structure that wasn't requested, but which actually improved the result

Is it worth it? It depends on the task. For one-off complex requests - yes, the quality is noticeably higher. For routine tasks where speed is required (autocompletion, short answers, chat) - reasoning only slows things down. In such cases, Qwen3:8b is better.

✅ Conclusion: when to choose Gemma 4 on M1, and when to stick with Qwen3

Gemma 4 does not replace all models. It occupies its niche - and in this niche, it is truly the best.

Choose Gemma 4 if:

  • You are writing complex text - articles, documentation, business explanations
  • You need maximum code quality and time is not critical
  • You want a model that structures the response itself without detailed instructions
  • You plan to use it in an RAG product - 128K context and native function calling

Stick with Qwen3:8b if:

  • You generate code daily and need speed
  • You use it for autocompletion in an IDE
  • Responsiveness in chat is important

On my M1 16 GB, both models are currently installed simultaneously - they take up ~15 GB together and do not conflict. I switch depending on the task.

If you want to delve deeper - read more on the topic:

Vadym Kharovuyuk - developer, founder of WebsCraft and AskYourDocs.

Останні статті

Читайте більше цікавих матеріалів

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Коротко: Gemma 4 26B MoE рекламують як "якість 26B за ціною 4B". Це правда щодо швидкості інференсу — але не щодо пам'яті. Завантажити потрібно всі 18 GB. На Mac з 24 GB — свопінг і 2 токени/сек. Комфортно працює на 32+ GB. Читай перш ніж завантажувати. Що таке MoE і чому 26B...

Reasoning mode в Gemma 4: як вмикати, коли потрібно і скільки коштує — 2026

Reasoning mode в Gemma 4: як вмикати, коли потрібно і скільки коштує — 2026

Коротко: Reasoning mode — це вбудована здатність Gemma 4 "думати" перед відповіддю. Увімкнений за замовчуванням. На M1 16 GB з'їдає від 20 до 73 секунд залежно від задачі. Повністю вимкнути через Ollama не можна — але можна скоротити через /no_think. Читай коли це варто робити, а коли...

Gemma 4: повний огляд — розміри, ліцензія, порівняння з Gemma 3

Gemma 4: повний огляд — розміри, ліцензія, порівняння з Gemma 3

Коротко: Gemma 4 — нове покоління відкритих моделей від Google DeepMind, випущене 2 квітня 2026 року. Чотири розміри: E2B, E4B, 26B MoE і 31B Dense. Ліцензія Apache 2.0 — можна використовувати комерційно без обмежень. Підтримує зображення, аудіо, reasoning mode і 256K контекст. Запускається...

Gemma 4 на M1 16 GB — реальні тести: код, текст, швидкість

Gemma 4 на M1 16 GB — реальні тести: код, текст, швидкість

Коротко: Встановив Gemma 4 на MacBook Pro M1 16 GB і протестував на двох реальних задачах — генерація Spring Boot коду і текст про RAG. Порівняв з Qwen3:8b і Mistral Nemo. Результат: Gemma 4 видає найкращу якість, але найповільніша. Qwen3:8b — майже та сама якість коду за 1/4 часу. Читай якщо...

Як модель LLM  вирішує коли шукати — механіка прийняття рішень

Як модель LLM вирішує коли шукати — механіка прийняття рішень

Розробник налаштував tool use, перевірив на тестових запитах — все працює. У production модель раптом відповідає без виклику інструменту, впевнено і зв'язно, але з даними річної давнини. Жодної помилки в логах. Просто неправильна відповідь. Спойлер: модель не «зламалась»...

Tool Use vs Function Calling: механіка, JSON schema і зв'язок з RAG

Tool Use vs Function Calling: механіка, JSON schema і зв'язок з RAG

Коли розробник вперше бачить як LLM «викликає функцію» — виникає інтуїтивна помилка: здається що модель сама виконала запит до бази або API. Це не так, і саме ця помилка породжує цілий клас архітектурних багів. Спойлер: LLM лише повертає структурований JSON з назвою...