Best Ollama Models for 8GB RAM in 2026: Honest Tests & Recommendations

Updated:
Best Ollama Models for 8GB RAM in 2026: Honest Tests & Recommendations

Have a laptop with 8GB of RAM and want to run AI locally? This article is a breakdown: what works, what barely runs, and what's not even worth downloading. No illusions, with specific models and commands for each task. If you're not yet familiar with Ollama — start with our introductory article on what Ollama is and why developers are massively switching to local AI.

📚 Article Contents

🎯 Honest Arithmetic: How Much RAM is Actually Left for the Model

Short Answer: With 8GB of RAM, about 4–5GB is realistically available for an AI model. The rest is taken by the operating system, browser, and background processes. This dictates the main rule: on 8GB, models up to 3–7B parameters in 4-bit quantization work comfortably.

8GB RAM is not 8GB for the model. It's 8GB minus the OS, minus Chrome, minus everything you forgot to close.

Before choosing a model, you need to understand your real memory budget. Here's a typical breakdown on a system with 8GB RAM:

  • ✔️ Operating System: 1.5–2.5GB (macOS is closer to 2.5, Windows — 2, Linux — 1.5)
  • ✔️ Browser (5–10 tabs): 1–2GB
  • ✔️ IDE (VS Code / IntelliJ): 0.5–1.5GB
  • ✔️ Background Processes: 0.3–0.5GB

Remaining for the model: 3–5GB.

According to LocalLLM.in, a 7B parameter model in Q4_K_M quantization takes approximately 4–5GB, plus 1–2GB for KV cache and system overhead. This means: a 7B model on 8GB is possible, but on the edge, and it's best to close everything unnecessary.

Practical rule for 8GB:

  • ✔️ Comfort Zone: 1–3B parameter models (Q4_K_M) — leaves space for IDE and browser
  • ✔️ Working Zone: 7–8B parameter models (Q4_K_M) — requires closing everything else
  • Red Zone: 13B+ models — guaranteed freezes or disk swapping

Conclusion: Before choosing a model, close your browser, check ollama ps, and see the actual remaining memory. On 8GB, every gigabyte is worth its weight in gold.

🎯 For Code: Which Model Will Replace Copilot on 8GB

Code Autocompletion For code autocompletion on 8GB, the best choice is Qwen 2.5 Coder 3B or Phi-4 Mini (3.8B) in Q4_K_M quantization. Both models leave enough memory for VS Code and provide acceptable generation quality.

GitHub Copilot costs $10/month. A local model for code is $0/month and works offline. The only question is which model will run on your hardware.

Coding is a task where even small models can be useful. Autocompletion, function generation, code explanation, writing tests — you don't need GPT-4 for this, you need a fast and accurate model that understands syntax.

Top Models for Code on 8GB

1. Qwen 2.5 Coder 3B (Q4_K_M) — ~2.2GB RAM

According to SitePoint, Qwen leads the HumanEval benchmark among 7–8B class models. The 3B version is lightweight but retains strong specialization in code. Trained on a large volume of programming code and technical documentation.

ollama pull qwen2.5-coder:3b
ollama run qwen2.5-coder:3b "Write a Python array sorting function"

2. Phi-4 Mini (3.8B) — ~2.3GB RAM

According to SitePoint, Phi-4 Mini is the only model that runs comfortably on systems with 8GB, delivering 15–20 tokens/sec on an M1 MacBook Air or a budget Linux laptop. It handles autocompletion, simple explanations, and light chat tasks well.

ollama pull phi4-mini
ollama run phi4-mini "Explain the difference between HashMap and TreeMap in Java"

3. DeepSeek Coder 1.3B (Q4_K_M) — ~1GB RAM

The lightest model for code. Ideal for IDE autocompletion — fast, doesn't overload the system, can be kept running in the background along with VS Code, browser, and terminal.

ollama pull deepseek-coder:1.3b
ollama run deepseek-coder:1.3b

What to Choose?

  • ✔️ Need background autocompletion + open browser → DeepSeek Coder 1.3B
  • ✔️ Need function generation and code explanation → Qwen 2.5 Coder 3B
  • ✔️ Need a universal model for code and text → Phi-4 Mini

More on setting up autocompletion — in the article Ollama + VS Code: A Free Alternative to GitHub Copilot.

Conclusion: On 8GB, you can code with local AI. Don't expect GPT-4 quality — but for daily autocompletion, boilerplate generation, and code explanations, it's sufficient.

🎯 For Text and Communication: Chat, Summaries, Translation

For Text Tasks For text tasks on 8GB, the optimal choice is Llama 3.2 3B for general chat, Gemma 4 E4B for a balance of quality and multimodality, or Phi-3 Mini if minimal size is required. All three leave room for other software.

Not every task requires GPT-4. Summarizing text, answering questions, retelling an article — a model that weighs less than a 4K movie can handle this.

Text tasks are the broadest category: from simple chat to document analysis and translation. On 8GB, there's a good selection here.

Top Models for Text on 8GB

1. Llama 3.2 3B (Q4_K_M) — ~2GB RAM

According to StudyHUB, Llama 3.1/3.2 is the most popular model on Ollama with over 111 million downloads. The 3B version is lightweight but retains quality in general conversations, summarization, and question answering. Supports 8 languages.

ollama pull llama3.2:3b
ollama run llama3.2:3b "Retell the main idea of this text: ..."

2. Gemma 4 E4B (Q4_K_M) — ~3GB RAM

A model from Google DeepMind, released in April 2026. Unlike the older Gemma 2B, it's a full multimodal model: accepts text and images, has a thinking mode for more complex tasks, and a 128K context window. At the same time, it comfortably fits within 8GB, leaving space for IDE and browser. If you previously used gemma:2b — E4B is a direct replacement with significantly better quality. More on model architecture and sizes — in the article Gemma 4: Full Overview — Sizes, License, Comparison with Gemma 3.

ollama pull gemma4:e4b
ollama run gemma4:e4b "Create a short description for this product: ..."

⚠️ Note: if you need the absolute minimum RAM and the old gemma:2b (~1.6GB) worked for you — it's still available. But for new installations, I recommend E4B right away. Gemma 4's thinking mode can be turned on and off — read about how it works and when to disable it in the article Reasoning Mode in Gemma 4: How to Enable, When Needed, and What It Costs.

3. Phi-3 Mini (3.8B) — ~2.3GB RAM

According to StudyHUB, Phi-3 Mini, at 2.3GB, covers 90% of daily tasks. It runs fast even on CPU and is suitable for Raspberry Pi 4/5.

ollama pull phi3:mini
ollama run phi3:mini "Translate to Ukrainian: The quick brown fox jumps over the lazy dog"

What to Choose?

  • ✔️ General chat and Q&A → Llama 3.2 3B
  • ✔️ Multimodality (text + images) and better quality → Gemma 4 E4B
  • ✔️ Minimum RAM, CPU-only, or Raspberry Pi → Phi-3 Mini

Conclusion: For text tasks, 8GB is comfortable territory. 2–4B models run fast, leave room for other programs, and provide quality sufficient for most daily needs. Gemma 4 E4B is the biggest leap in quality without increasing hardware requirements.

🎯 For Reasoning: Math, Logic, Code Debugging

For tasks requiring step-by-step thinking — mathematics, logic problems, debugging complex code — DeepSeek R1 8B in Q4 quantization works on 8GB. This is a "thinking" model: it's slower but more accurate on complex questions.

A regular model answers immediately. A reasoning model thinks first — step by step — and then answers. Like the difference between "guessing an answer" and "calculating on paper."

Reasoning models are a relatively new category. They work on the principle of chain-of-thought: breaking down a task into steps, verifying intermediate results, and only then forming the final answer.

What Works on 8GB

1. DeepSeek R1 8B (Q4_K_M) — ~5GB RAM

According to StudyHUB, DeepSeek R1 is a "thinking" model, an analog of OpenAI o1. On tasks involving math, logic puzzles, and technical reasoning, it yields better results than Llama 3.1 of the same size. The trade-off: it answers slower because it "thinks" before responding.

ollama pull deepseek-r1:8b
ollama run deepseek-r1:8b "Find the error in this SQL query: SELECT * FROM users WHERE id = '5' AND active = true GROUP HAVING count > 1"

⚠️ Important: DeepSeek R1 8B takes ~5GB RAM. On an 8GB system, this is on the edge — you need to close your browser, IDE, and everything else. On macOS with unified memory, it works more stably than on Windows with integrated graphics.

2. Qwen 3 8B (Q4_K_M) — ~5GB RAM

According to LocalLLM.in, Qwen 3 8B is a strong alternative for reasoning tasks, especially in math and multilingual scenarios. It supports Ollama's thinking mode by default.

ollama pull qwen3:8b
ollama run qwen3:8b "Solve: if 3x + 7 = 22, what is x?"

What to Choose?

  • ✔️ Code debugging and logic problems → DeepSeek R1 8B
  • ✔️ Math and multilingual reasoning → Qwen 3 8B
  • ✔️ If 8B doesn't fit — Phi-4 Mini as a compromise (smaller, but without chain-of-thought)

Conclusion: Reasoning on 8GB is possible, but it's the edge of comfort. 8B models require almost all available memory. For regular work with such tasks, consider upgrading to 16GB — the difference in capabilities will be substantial.

Best Ollama Models for 8GB RAM in 2026: Honest Tests & Recommendations

🎯 CPU vs GPU vs Apple Silicon — where 8GB is not 8GB

8GB on a Mac M1 and 8GB on a Windows laptop with Intel offer two different experiences. Apple Silicon uses unified memory, where all memory is accessible to both the CPU and GPU simultaneously. On a standard PC, RAM and VRAM are separate pools, and this is critical for AI models.

An 8GB Mac M1 is a fully functional workstation for local AI. An 8GB Windows laptop with Intel HD Graphics is a struggle for every megabyte.

Apple Silicon (M1/M2/M3) — the best scenario for 8GB

On Apple Silicon, all RAM is unified memory. This means the GPU part of the chip has access to the same 8GB as the CPU. Ollama automatically uses Metal for acceleration — without any additional settings.

Result: a 7B model in Q4_K_M on an M1 with 8GB delivers 15–20 tokens/sec — enough for comfortable interactive use. According to SitePoint, Phi-4 Mini on an M1 MacBook Air is approximately 15–20 tok/s, which is sufficient for daily work.

Windows / Linux with a discrete GPU (RTX 3060, RTX 4060) — a good scenario

If you have a discrete graphics card with 6–8GB of VRAM — the model is fully loaded into GPU memory, and system RAM remains for the OS and software. According to LocalLLM.in, on an RTX 4060 (8GB VRAM), a 7B model delivers 40+ tokens/sec — the fastest option of all.

Windows / Linux without a GPU (Intel HD / AMD Radeon iGPU) — a difficult scenario

Without a discrete GPU, the model runs entirely on the CPU. Ollama will still launch — but the speed drops to 3–6 tokens/sec. According to LocalLLM.in, CPU-only inference is acceptable for batch tasks but frustrating for interactive use.

Plus, system RAM is shared between the OS, software, and the model — on 8GB it's very tight.

Summary table

Platform 7B model (Q4) 3B model (Q4) Speed Comfort
Mac M1/M2 8GB ✔️ Works ✔️ Comfortable 15–20 tok/s ⭐⭐⭐⭐
Windows + RTX 4060 8GB VRAM ✔️ Works fast ✔️ Comfortable 40+ tok/s ⭐⭐⭐⭐⭐
Windows/Linux CPU only 8GB ⚠️ On the edge ✔️ Works 3–6 tok/s ⭐⭐

Conclusion: If you have a Mac M1+ with 8GB — you are in the best position for local AI on budget hardware. If you have Windows without a GPU — focus on 3B models and close everything else. More details on installation on different OS — in the article How to Install Ollama on Mac, Windows, and Linux: A Complete Guide 2026.

🎯 Quantization in simple terms: Q4 vs Q8 and what to choose for weak hardware

Short answer: Quantization is model compression that reduces its size by 2–4 times with minimal quality loss. For 8GB, the optimal choice is Q4_K_M: the best balance between size, speed, and response quality.

Quantization is like JPEG for photos. The file is smaller, the difference is almost imperceptible. But if you compress too much — the quality will noticeably drop.

When you see tags like :7b-q4_0, :8b-instruct-q8_0, or :3b-q4_k_m in an Ollama model name — this indicates the quantization level. The number after "q" is the number of bits per parameter.

Quantization levels: what the tags mean

  • ✔️ Q8 (8-bit): maximum quality, largest size. For a 7B model — ~8GB. Won't fit in 8GB RAM.
  • ✔️ Q4_K_M (4-bit, K-quant medium): optimal balance. For 7B — ~4–5GB. Recommended for 8GB systems.
  • ✔️ Q4_K_S (4-bit, K-quant small): slightly smaller than Q4_K_M, slightly lower quality.
  • ⚠️ Q2_K (2-bit): minimum size (~2.5GB for 7B), but noticeable quality degradation. An extreme option.

The "K" suffix signifies newer quantization methods (K-quant), which more intelligently distribute precision across model layers. K-quant tags are always better than legacy variants (q4_0, q4_1) at the same size.

How much do models of different quantizations weigh

Model Q8 Q4_K_M Q2_K
Phi-3 Mini (3.8B) 4.0 GB 2.3 GB 1.2 GB
Llama 3.1 (7B) ~8 GB ~4.5 GB ~2.6 GB
Mistral 7B ~8 GB ~4.1 GB ~2.8 GB

Data from LocalAIMaster.

Rule for 8GB: always choose Q4_K_M. If it doesn't fit, reduce the model size (3B instead of 7B), not the quantization level (Q2 instead of Q4). A smaller model with Q4 will provide better quality than a larger one with Q2.

More on compression techniques and their impact on quality — in the article Model Quantization: INT4, INT8 — What It Is and How It Affects Quality.

Conclusion: Q4_K_M is the golden standard for 8GB. Don't give in to the temptation to load Q8 "for quality" — the model won't fit into memory, and you'll get disk swapping instead of fast responses.

🎯 Ollama Settings for Maximum Performance on Weak Hardware

Three environment variables and one habit (closing unnecessary things) — that's all you need to squeeze the most out of 8GB. The setup takes a minute, and the difference in stability is noticeable.

On powerful hardware, Ollama "just works." On weak hardware, you need to help it not waste memory on things you don't need.

By default, Ollama can keep multiple models in memory simultaneously and handle parallel requests. On 8GB, this is an unnecessary luxury. Here's the minimal set of optimizations:

Environment Variables

# Keep only one model in memory (default can be more)
export OLLAMA_MAX_LOADED_MODELS=1

# One parallel request (no memory competition)
export OLLAMA_NUM_PARALLEL=1

# Reduce context window — saves 200–800MB RAM
export OLLAMA_CTX_SIZE=2048

On macOS / Linux, add these lines to ~/.zshrc or ~/.bashrc. On Windows — set them via system environment variables or your PowerShell profile.

Before running a model

It sounds trivial, but on 8GB it's critical:

  • ✔️ Close your browser or leave a maximum of 2–3 tabs
  • ✔️ Close Slack, Discord, Spotify — each program consumes 200–500MB
  • ✔️ Check current usage: ollama ps will show loaded models
  • ✔️ If an old model is still in memory — ollama stop model_name

Modelfile for fine-tuning

If you want more control — create a Modelfile with optimized parameters:

FROM phi3:mini
PARAMETER num_ctx 2048
PARAMETER num_thread 4
PARAMETER temperature 0.7

num_ctx 2048 — reduces the context window (less RAM for KV cache). num_thread 4 — limits the number of CPU threads, keeping the system responsive.

A step-by-step guide to installation and first launch — in the article How to Install Ollama on Mac, Windows, and Linux: A Complete Guide 2026. And on creating custom models via Modelfile — in the article Modelfile in Ollama: Create Your Custom AI.

Conclusion: Three environment variables + closed unnecessary programs = stable operation on 8GB. Without these settings, even a lightweight model can cause disk swapping.

🎯 What NOT to try on 8GB — my experience

Short answer: Models 13B+, any models in Q8 quantization, and attempts to run two models simultaneously — are guaranteed disappointment on 8GB. I tested this on my Mac M1 — so you don't have to.

Everyone who has worked with Ollama on 8GB has gone through the same stage: "But maybe 13B will fit after all?" No, it won't. I checked.

Working with Ollama on a Mac M1 with 8GB of unified memory, I tested dozens of models of various sizes. Here's an honest list of what doesn't work — or works so poorly it's better if it didn't.

❌ Models 13B and larger

Llama 3.1 13B, Qwen 14B, CodeLlama 13B — even in Q4 quantization they require 8–9GB just for the model weights. Add KV cache, OS, and you'll get a system that constantly swaps to disk. I tried to run Llama 3.1 13B Q4 — it took the first 5 minutes to load, then delivered 1–2 tokens per second with constant pauses. This is unworkable for interactive use.

❌ Any 7B model in Q8 quantization

The Q8 version of a 7B model weighs around 8GB — that's all your RAM. The OS doesn't magically disappear. I tried Mistral 7B Q8 — the system froze a minute after starting. Always use Q4_K_M for 7B models on 8GB.

❌ Two models simultaneously

Ollama can keep multiple models in memory. On 16GB, this is convenient — you switch between models instantly. On 8GB, it's a recipe for a swap storm. Keep OLLAMA_MAX_LOADED_MODELS=1 and don't forget ollama stop before loading another model.

❌ Large context windows (8K+ tokens)

Every doubling of the context window means additional hundreds of megabytes for KV cache. On 8GB, keep the context at 2048–4096 tokens maximum. You won't be able to feed the model a 10-page document whole; you'll need to break it into parts.

❌ Mixtral 8x7B (MoE architecture)

Mixtral activates only 2 out of 8 "experts" per token, so theoretically it uses fewer computations. But all 8 experts must be in memory — and that's 26+ GB even in Q4. The name "8x7B" is misleading: it's not a 7B-sized model.

General rule: if ollama run takes longer than 30 seconds to load and the first response comes after a minute — the model is too large for your system. Don't expect it to "warm up" — close it and choose a smaller model.

Comparison of models by size, quality, and tasks — in the article Top 10 Ollama Models in 2026: Which to Choose.

Conclusion: I went through this myself — I thought a larger model would give better results, downloaded 13B, waited a minute for the first response, and deleted it. I installed 3B — and performance immediately improved. On 8GB, the better strategy is to choose a model that works quickly and stably, rather than struggling with one that "almost fits."

Best Ollama Models for 8GB RAM in 2026: Honest Tests & Recommendations

🎯 Benchmarks: What to Expect in Practice

Short answer: On a Mac M1 with 8GB RAM, the 3B model delivers 20–30 tokens/sec, and the 7B model delivers 10–15 tokens/sec. On a CPU-only Windows machine, it's two to three times slower. Below is a summary table for guidance.

Benchmarks found online are often conducted on a clean system without other software. In reality, with VS Code open and 5 Chrome tabs running, the numbers will be lower. Therefore, these tests are closer to reality.

Performance Summary Table

Model RAM Mac M1 8GB CPU-only 8GB RTX 4060 8GB VRAM
Gemma 4 E4B (Q4) ~3GB ~22 tok/s ~7 tok/s ~42 tok/s
Phi-3 Mini 3.8B (Q4) ~2.3GB ~25 tok/s ~8 tok/s ~45 tok/s
Llama 3.2 3B (Q4) ~2GB ~28 tok/s ~9 tok/s ~48 tok/s
Qwen 2.5 Coder 3B (Q4) ~2.2GB ~25 tok/s ~8 tok/s ~45 tok/s
Llama 3.1 8B (Q4) ~4.5GB ~12 tok/s ~4 tok/s ~40 tok/s
DeepSeek R1 8B (Q4) ~5GB ~10 tok/s ~3 tok/s ~35 tok/s

Data is approximate, based on results from LocalLLM.in, SitePoint, and LocalAIMaster. Actual speed depends on system load, context window size, and background processes.

What do these numbers mean in practice?

  • ✔️ 15+ tok/s: comfortable interactive chat — the response appears faster than you can read it
  • ✔️ 8–15 tok/s: usable, but noticeable delays on longer responses
  • ⚠️ 3–6 tok/s: acceptable for one-off tasks (debugging, analysis), frustrating for active chat
  • <3 tok/s: model is too large for this system

Conclusion: For daily work on 8GB RAM, aim for 3B models — they provide 20+ tok/s on Apple Silicon and keep the system responsive. 7–8B models are for specific tasks when you're willing to close everything else and wait.

❓ Frequently Asked Questions (FAQ)

Can I run Ollama on a laptop with 8GB RAM?

Yes. Models with 1–4B parameters (Phi-3 Mini, Llama 3.2 3B, Gemma 4 E4B) run comfortably on any system with 8GB. 7–8B models run at the limit — you'll need to close unnecessary programs. More details in the Ollama installation guide.

What is the best model for 8GB RAM?

It depends on the task. For code — Qwen 2.5 Coder 3B. For text and chat — Llama 3.2 3B or Gemma 4 E4B. For reasoning and debugging — DeepSeek R1 8B (at the 8GB limit). A full model comparison is in the article Top 10 Ollama Models in 2026.

Do I need a GPU for Ollama?

No, Ollama also works on CPU. However, with a GPU (discrete or Apple Silicon), the speed is 3–10 times higher. On a CPU-only system with 8GB RAM, stick to 3B models and smaller for comfortable operation.

What's better: a 7B model in Q2 or a 3B model in Q4?

Almost always the 3B model in Q4. Aggressive quantization (Q2) significantly reduces the quality of responses, especially on complex tasks. A smaller model with normal quantization will yield better results.

Can Ollama on 8GB replace ChatGPT?

For daily tasks — summarization, simple questions, code generation — yes. For complex analysis, multimodal tasks, and working with large contexts — cloud models are still stronger. The optimal approach is hybrid: Ollama for regular tasks, ChatGPT/Claude for complex ones. More details in the article Ollama vs ChatGPT vs Claude: When Local AI is Better.

How much disk space is needed?

One 3B model in Q4 is approximately 2GB on disk. Three models for different tasks — 6–8GB. Ollama stores models in ~/.ollama. Downloaded models can be removed with the command ollama rm model_name.

Is it worth upgrading to 16GB?

If you plan to regularly work with local AI — definitely yes. 16GB opens access to 13–14B models, full 7B models in Q8 quality, and comfortable work with large context windows. The difference in capabilities between 8GB and 16GB is the largest across the entire spectrum.

✅ Conclusions

8GB of RAM is not a death sentence for local AI, but it's a limit that requires conscious decisions. Here's the main takeaway:

  • ✔️ 3–4B models — the comfort zone: Phi-3 Mini, Llama 3.2 3B, Gemma 4 E4B work quickly and stably, leaving room for your IDE and browser
  • ✔️ 7–8B models — the working zone: DeepSeek R1 8B, Qwen 3 8B work at the limit but provide noticeably better quality for specific tasks
  • ✔️ Q4_K_M — the only sensible quantization choice for 8GB: a smaller model with Q4 is always better than a larger one with Q2
  • ✔️ Apple Silicon with 8GB — the best budget option: unified memory provides an advantage over CPU-only systems
  • ✔️ 13B+ models, Q8, two models simultaneously — not worth it: tested, doesn't work

I personally use this exact approach: I keep several models for different tasks — one for code, another for text, a separate one for debugging. Each model has its strength, and instead of one large model that might not fit in memory, it's better to have 2–3 specialized lightweight ones. Switching between them via ollama run takes seconds.

If you're just starting — install Ollama using our guide, download phi3:mini, and give it a try. In five minutes, you'll have a working local AI — no subscriptions, no internet, no data transmitted externally.

And if you need a website, blog, or web application with integrated AI functionality — contact us at WebsCraft, we'll help you implement it.

📖 Sources

Останні статті

Читайте більше цікавих матеріалів

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

Як керувати контекстом AI агента: sliding window, summarization і compression з прикладами

TL;DR Як ефективно керувати контекстом у довгоживучих AI-агентах: — Sliding Window + Pinning — Автоматична summarization з розумними тригерами — Compression та semantic memory З конкретними цифрами, кодом і архітектурними рішеннями, які значно підвищили стабільність агента. Ця стаття —...

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

Google Spam Policy 2026: маніпуляції з AI Overview тепер офіційно спам

15 травня 2026 року Google тихо оновив одне речення у своїй Spam Policy. Але це речення змінює правила гри для всіх хто займається контентом і SEO. Без гучних анонсів, без великої прес-конференції — просто нове формулювання на сторінці документації. Search Engine Roundtable...

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Пам'ять AI агента: in-context, episodic, RAG і semantic — коли що використовувати

Агент отримав запит — обробив — відповів. Наступний запит — і він не пам'ятає нічого з попереднього. Не тому що щось зламалось. А тому що так влаштована LLM за замовчуванням: кожен виклик — чистий аркуш. Якщо ви будуєте агента і не думали про пам'ять — ви будуєте амнезика з доступом до...

Grok Build від xAI: детальний технічний огляд

Grok Build від xAI: детальний технічний огляд

Grok Build — новий agentic CLI від xAI (early beta, 14 травня 2026). Головні фішки: Plan Mode з обов’язковим затвердженням плану, паралельні субагенти (до 8), контекстне вікно ~1–2M токенів та сучасний TUI на Rust. Працює на Grok 4.3, підтримує ACP, git worktree та MCP....

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Оновлено: 15 травня 2026 14 травня 2026 вийшла Ollama 0.24 — і це не черговий патч з виправленням багів. Цей реліз додає офіційну підтримку Codex App від OpenAI: тепер десктопний AI coding agent можна запустити на будь-якій локальній або хмарній моделі через Ollama....

Tool RAG: що робити коли у агента забагато інструментів

Tool RAG: що робити коли у агента забагато інструментів

У вас 5 tools — все чудово. У вас 15 tools — починаються проблеми. У вас 50 tools — агент деградує. Але є рішення яке вирішує проблему масштабу елегантно — і ви вже знаєте як воно працює, бо використовуєте його для документів. Ця стаття — частина серії про AI агентів на Spring Boot. Якщо...