Чи працює Gemma 4 на MacBook M1 з 16 GB RAM?

Так, Gemma 4 E4B (4.5B ефективних параметрів) добре працює на M1 16 GB через Ollama. Модель поміщається в пам'ять і видає комфортну швидкість без сильного свопу.

Яка швидкість Gemma 4 E4B на M1 16 GB у 2026 році?

На MacBook з M1 16 GB Gemma 4 E4B (Q4_K_M) показує приблизно 35–55 токенів за секунду залежно від квантизації та довжини промпту. Це швидше за більші моделі, але трохи повільніше за Qwen3 8B.

Яка якість тексту та креативності у Gemma 4 порівняно з іншими моделями?

Gemma 4 E4B видає дуже якісний, природний і coherent текст. За якістю письма вона часто перевершує Qwen3 і Mistral Nemo того ж розміру, особливо в українських та англійських текстах.

Як Gemma 4 показує себе в генерації коду на M1 16 GB?

Gemma 4 E4B демонструє одну з найкращих якостей коду серед моделей, які комфортно запускаються на 16 GB. Вона добре справляється з повноцінними задачами, refactoring'ом і генерацією повноцінних застосунків.

Чи варто використовувати Gemma 4 на 16 GB RAM чи краще взяти іншу модель?

Якщо для вас важлива якість коду та тексту — Gemma 4 E4B є одним з найкращих варіантів на 16 GB. Якщо потрібна максимальна швидкість — краще дивитися на Qwen3 8B або Llama 3.2 3B.

Яка квантизація найкраще підходить для Gemma 4 на M1 16 GB?

Найкращий баланс якості та швидкості дає Q4_K_M. Q5_K_M теж працює стабільно, але трохи повільніше. Q3_K_M можна використовувати, якщо пам'яті критично не вистачає.

Підтримує Gemma 4 мультимодальність (зображення, аудіо) на локальному ПК?

Так, версія E4B підтримує текст, зображення та аудіо. На M1 16 GB мультимодальні можливості працюють, але з меншою швидкістю порівняно з чисто текстовим режимом.

AI_TOOLS 11 April 2026 7 min read 900 view

Gemma 4 on M1 16 GB — real tests: code, text, speed

Updated: 11 April 2026

Language: 🇺🇦 🇺🇸 🇩🇪 🇪🇸

Vadim Kharovyuk

CEO & Founder of WebsCraft. 8 years in web development, focused on bringing AI into real products.

Gemma 4 on M1 16 GB — real tests: code, text, speed

In short: Installed Gemma 4 on a MacBook Pro M1 16 GB and tested it on two real tasks — generating Spring Boot code and text about RAG. Compared it with Qwen3:8b and Mistral Nemo. Result: Gemma 4 produces the best quality, but is the slowest. Qwen3:8b — almost the same code quality in 1/4 of the time. Read if you want to know if it's worth switching.

⚠️ How I installed Gemma 4 on M1: a real error with the Ollama version

The first thing I saw was not the model, but an error. And this is the first useful fact for those who want to repeat it.

I have been using Ollama for local AI for a long time — so the first thing I did after Gemma 4 was released, I just typed in the terminal:

ollama run gemma4

And immediately got:

Error: pull model manifest: 412:
The model you are attempting to pull requires a newer version of Ollama.
Please download the latest version at: https://ollama.com/download

The reason is simple: I had version 0.17.0 installed, and Gemma 4 requires at least 0.20+. To check your version: ollama --version. You can update either through the official download page, or via Homebrew — which is what I did (official Ollama documentation):

brew upgrade ollama
brew services restart ollama

After that, version 0.20.5 was installed, and the model downloaded without problems. If you installed Ollama a long time ago — check the version before trying Gemma 4. You'll save 10 minutes of searching for the error cause.

Downloading the model:

ollama run gemma4

Size: 9.6 GB. On my internet, it took about 2 hours. After downloading, the model immediately launched in the terminal — the ⠇ symbol means it's loading into memory, after a few seconds >>> appears.

💾 Which Gemma 4 variant is suitable for M1 16 GB and why not 26B

Gemma 4 is not one model, but four. And only one of them is suitable for M1 16 GB.

A detailed overview of all variants is in the article about models for 8 GB RAM. Briefly about Gemma 4:

Model	File size	RAM (4-bit)	Suitable for M1 16 GB
gemma4:e2b	~5 GB	5 GB	✅ Yes, but low quality
gemma4 (e4b)	9.6 GB	~6 GB	✅ Yes — optimal choice
gemma4:26b	~18 GB	~18 GB	❌ No — swapping, freezing
gemma4:31b	~20 GB	~20 GB	❌ No — won't fit

About gemma4:26b separately — it's actively advertised online as "MoE magic: 26B quality at 8B price". This is not entirely true. The actual file size is 18 GB, and on an M1 with 16 GB of unified memory, it simply won't fit without aggressive swapping. Even on a Mac mini with 24 GB, people report freezes under load and returning to e4b. More details on this — in a separate article about the pitfalls of Gemma 4 26B MoE.

My choice: gemma4 (e4b) — the default option, no need to specify anything extra.

💻 Test 1 — code generation: Spring Boot endpoint with pagination

The same prompt — three models. Let's see what came out.

The prompt I used:

Write a Spring Boot REST endpoint to get a list of users with pagination. Use JPA Repository.

I deliberately chose this task — I know Spring Boot well, so I can evaluate the quality without Googling.

Gemma 4 — result:

Full structure: Entity → Repository → Service → Controller + dependencies in pom.xml + examples of URL requests. Correct DI via constructor, ResponseEntity<Page<User>>, comments for each step. This is production-ready code that can be taken and used. The only downside is the time. First, it "thought" for 73 seconds (Thinking block), then it generated text for ~3 minutes. Total almost 4 minutes.

Qwen3:8b — result:

The same full structure: Entity + Repository + Service + Controller. Additionally — dependencies for both Maven and Gradle (which Gemma didn't do). Code quality is practically identical. Time: ~32 seconds thinking + ~35 seconds generation = 67 seconds total. 3.5 times faster.

Mistral Nemo — result:

Minimal code — only Controller, without a separate Service layer. The same block of code was duplicated twice (looks like a generation bug). Time ~30 seconds — the fastest, but the weakest response.

📝 Test 2 — Text Generation: RAG Explanation for Business

The picture changed here — Gemma 4 showed itself significantly better than its competitors.

Prompt:

Explain what RAG (Retrieval-Augmented Generation) is in simple terms for business. No technical terms. 3-4 paragraphs.

The constraints "3-4 paragraphs" and "no technical terms" were specifically to check if the model follows instructions.

Gemma 4 — Result:

It broke the paragraph limit — but correctly. Instead of 3-4 paragraphs, it created a structured article with subheadings, an analogy ("a student with all the books in the world vs. an assistant with your company's handbook"), and a comparison table "LLM without RAG vs. with RAG". This is exactly what businesses need — I know this from my own experience with AskYourDocs. Time: ~37 seconds thinking + ~1 minute text.

Qwen3:8b — Result:

It adhered to the constraint — exactly 3 paragraphs. Clean, concise, understandable. There's an analogy ("an additional source of knowledge"). But compared to Gemma 4 — significantly simpler, without structure and without a table. Time: ~18 seconds thinking + ~20 seconds text = 38 seconds total.

Mistral Nemo — Result:

6 paragraphs instead of 3-4 — did not adhere to the constraint. The content is watery, with repetitions of the same ideas in different words. Time ~30 seconds, but the quality is the lowest of the three.

📊 Comparison with Qwen3:8b and Mistral Nemo: Results Table

Figures collected on a MacBook Pro M1 16 GB. Not laboratory benchmarks — my own tests.

Model	Size	Code: Time	Code: Quality	Text: Time	Text: Quality
gemma4	9.6 GB	~4 min	⭐⭐⭐⭐⭐	~1.5 min	⭐⭐⭐⭐⭐
qwen3:8b	5.2 GB	~67 sec	⭐⭐⭐⭐⭐	~38 sec	⭐⭐⭐⭐
mistral-nemo	7.1 GB	~30 sec	⭐⭐	~30 sec	⭐⭐⭐

Conclusion from the table: for code, Qwen3:8b and Gemma 4 are equal in quality, but Qwen3 is 3.5 times faster. For text, Gemma 4 is noticeably better — structure, analogies, tables. Mistral Nemo loses in both tests except for speed.

🧠 Reasoning mode in practice: how much time does it take and is it worth it

Gemma 4 "thinks" before each response by default. This is its main advantage - and the main reason for its slowness.

Immediately after the first request, I noticed something unusual:

Thinking...
Thinking Process:
1. Analyze the user's input...
2. Identify the core question...
...done thinking.

This is reasoning mode — the model builds a plan for the response before generating text. In Gemma 4, it is enabled by default through the <|think|> token in the system prompt. More details on how to enable and disable it manually can be found in a separate article about reasoning mode in Gemma 4.

What this gives in practice is evident from the tests:

Code: 73 seconds of thinking → response with full structure and explanations
Text: 37 seconds of thinking → response with a structure that wasn't requested, but which actually improved the result

Is it worth it? It depends on the task. For one-off complex requests - yes, the quality is noticeably higher. For routine tasks where speed is required (autocompletion, short answers, chat) - reasoning only slows things down. In such cases, Qwen3:8b is better.

✅ Conclusion: when to choose Gemma 4 on M1, and when to stick with Qwen3

Gemma 4 does not replace all models. It occupies its niche - and in this niche, it is truly the best.

Choose Gemma 4 if:

You are writing complex text - articles, documentation, business explanations
You need maximum code quality and time is not critical
You want a model that structures the response itself without detailed instructions
You plan to use it in an RAG product - 128K context and native function calling

Stick with Qwen3:8b if:

You generate code daily and need speed
You use it for autocompletion in an IDE
Responsiveness in chat is important

On my M1 16 GB, both models are currently installed simultaneously - they take up ~15 GB together and do not conflict. I switch depending on the task.

If you want to delve deeper - read more on the topic:

Vadym Kharovuyuk - developer, founder of WebsCraft and AskYourDocs.