Ollama 0.30 has been released with support for GGUF models from Hugging Face, acceleration on NVIDIA and Vulkan, which is now active by default. This update is interesting not for its individual numbers, but for how Ollama is increasingly growing closer to llama.cpp — and this affects which models you'll be able to run tomorrow.
Below is a breakdown without the marketing: what has really changed, who it matters to, and where the pitfalls are that the press release is silent about. If you're not yet familiar with Ollama, start with an introductory article on what Ollama is and why it's needed.
In short: the main thing in 0.30 is deeper integration with llama.cpp, which opens up the entire GGUF ecosystem, Vulkan by default, and noticeable acceleration on NVIDIA. For most users, the most useful thing is the ability to run any GGUF model from Hugging Face with one or two commands.
Contents
- What's new in Ollama 0.30 — in brief
- Deeper integration with llama.cpp: why it's the main thing
- GGUF support from Hugging Face
- NVIDIA acceleration: without marketing
- Vulkan by default — with nuances
- Tool calling and coding agents: ollama launch
- Which models are now easier to run
- Is it worth updating to 0.30
- FAQ
- Conclusions
What's new in Ollama 0.30 — in brief
Ollama 0.30 is not a one-off new feature, but a package of changes around one decision: closer work with llama.cpp on top of the MLX engine on Apple Silicon. According to Ollama's official blog, the release brings improved performance and GGUF model compatibility through llama.cpp, complementing the MLX engine on Apple Silicon and expanding hardware support.
Key changes in a list:
- GGUF support from Hugging Face — you can run any GGUF model from Hugging Face or your own fine-tuned models through a simple Modelfile. Extended compatibility means more model families work "out of the box".
- NVIDIA acceleration — up to 20% faster thanks to optimizations from the NVIDIA and llama.cpp teams.
- Vulkan by default — wider support for AMD and Intel GPUs when the corresponding backend is installed.
- Tool calling is moved to coding agents — if a model supports tool calling, it can be connected to Claude Code, Codex, or OpenCode via
ollama launch.
Next — each point in detail, with an emphasis on what it means in practice, not in the press release.
Deeper integration with llama.cpp: why it's the main thing
llama.cpp is a low-level engine for LLM inference, written in C/C++, which underlies a huge part of the local AI ecosystem. Most new open-weight models appear first in GGUF format for llama.cpp — and only then make their way into other tools.
The GGUF (GPT-Generated Unified Format) format itself is a way to package a model into a single file: weights, tokenizer, and metadata together, in an already quantized form. It's in this format that llama.cpp stores models, and it's this format that Ollama now directly understands. In simple terms: GGUF is the model "container", and llama.cpp is the engine that runs it.
Hugging Face ← where GGUF model files are located
↓ you download .gguf
GGUF ← format: weights + tokenizer + metadata in one file
↓ reads
llama.cpp ← inference engine (C/C++)
↓ wraps, adds API / CLI / model management
Ollama ← convenient layer on top of llama.cpp
Ollama has always used llama.cpp as a backend, but in 0.30, this integration has become tighter. The practical consequence is simple: the closer Ollama is to llama.cpp, the faster new models become available in Ollama — without waiting for the team to write separate architecture support.
The logic is as follows: models are released for llama.cpp first. Tighter integration means the gap between "model appeared" and "model works in Ollama" is reduced.
There's also a downside that's worth knowing honestly: Ollama uses a vendored (built-in) version of llama.cpp, which doesn't always keep up with the latest commits. Historically, this has created a performance gap — for example, on AMD via Vulkan, where some llama.cpp optimizations reached Ollama with a delay. So, "integration with llama.cpp" doesn't mean "all the latest optimizations instantly" — it means "a significantly smaller gap than before".
GGUF Support with Hugging Face
This is perhaps the most useful change for daily work. Previously, to run a model not in the official Ollama registry, you had to find workarounds. Now, you can take any GGUF file from Hugging Face and run it directly.
The process involves three steps. First, you download the GGUF file from Hugging Face. Then, you create a Modelfile — a text file with a single FROM directive that points to the path of the downloaded file:
FROM ./my-model.Q4_K_M.gguf
And finally, you create and run the model:
ollama create my-model -f Modelfile
ollama run my-model
Pay attention to the order of arguments: it's ollama create my-model -f Modelfile, not ollama create -f Modelfile my-model. Many update summaries confuse this, and the command fails with an error.
This is the basic approach. In practice, nuances arise: which quantization to choose, how to check if a file supports tool calling, what to do when a model doesn't load. All of this, with step-by-step examples, is in a separate guide: How to run GGUF models from Hugging Face in Ollama.
What this offers in practice: access to thousands of fine-tuned community models, the ability to test experimental quantizations, and running your own fine-tuned models without converting them to Ollama format.
NVIDIA Acceleration: No Marketing Hype
The official claim is up to a 20% performance increase on NVIDIA due to optimizations from NVIDIA and llama.cpp teams. The figure is realistic, but it's important to understand the context before expecting your inference to become five times faster.
A few honest clarifications:
- "Up to 20%" is the upper limit on a specific configuration, not a guaranteed increase everywhere. The official benchmark was done on Gemma 4 26B with Q4_K_M quantization on an NVIDIA RTX 5090 — a top-tier card. Your figures will depend on the model, context size, and current driver version.
- The increase is most noticeable on newer cards — where there's more to optimize for modern CUDA features. On older GPUs, the difference might be smaller.
- In everyday work, 20% means, for example, 60 tok/s instead of 50 — nice, but not revolutionary. If your bottleneck isn't the GPU but the model size or swapping, you won't feel this acceleration.
Personally, I believe Ollama 0.30 is worth installing at least for better GGUF model compatibility and general platform improvements. If you work on NVIDIA, the additional performance boost will be a pleasant bonus. At the same time, don't expect the update itself to solve the problem of slow inference on weak hardware — your equipment's specifications still play a key role.
Vulkan by Default — With Nuances
This is where the most confusion lies, so I'll break it down in detail. The history of Vulkan in Ollama has changed from version to version, and many online guides describe an outdated state.
How it was: Vulkan appeared in version 0.12.11 (November 2025) as opt-in — it had to be enabled manually via the OLLAMA_VULKAN=1 variable. This provided an alternative to CUDA (NVIDIA) and ROCm (AMD), especially useful for older AMD cards without ROCm support and for Intel GPUs.
How it is now: according to Ollama's official hardware documentation, Vulkan is now enabled by default when the corresponding backend is installed. On Windows, most vendors' drivers come with Vulkan support and don't require additional configuration.
Therefore, both outdated guides ("Vulkan needs to be enabled manually") and overly optimistic summaries ("Vulkan works out of the box everywhere") are inaccurate. The truth is in the middle: by default, when a backend is present, on Windows — without extra steps; on Linux/AMD — there can still be nuances.
What the press release doesn't mention: the path was bumpy. There was a bug where Vulkan remained enabled even when trying to disable it via OLLAMA_VULKAN=0 — and on weak integrated GPUs, this made Ollama *slower* than CPU-only mode. The team later added separate iGPU control (OLLAMA_IGPU_ENABLE) and disabled integrated graphics by default precisely because of these issues.
From personal experience, I recommend after updating not to just check if the model simply runs. If you have a weak iGPU or AMD on Linux, you should ensure that inference is actually running via the GPU. The mere fact of it running doesn't mean everything is optimally configured, and the performance difference between GPU and CPU can be very significant.
ollama ps
# Look at the PROCESSOR column:
# 100% GPU — inference on the graphics card
# 100% CPU — on the processor
# partially CPU — the model is swapping or the GPU is not fully utilized
If you see that the default Vulkan on your iGPU provides worse speed than CPU — integrated graphics can be disabled via an environment variable before starting the server:
OLLAMA_IGPU_ENABLE=0 ollama serve
After that, check ollama ps again to make sure the inference is proceeding as you expect. More details on choosing models for weak hardware can be found in the article Ollama on 8 GB RAM: which models work in 2026.
Tool Calling and Coding Agents: ollama launch
If a GGUF model supports tool calling, this capability is transferred to Ollama — and such a model can be connected to your favorite coding agent with a single command via ollama launch.
ollama launch is a command that appeared earlier (January 2026) and configures and launches coding tools without manual editing of configs and environment variables. Officially supported integrations are four: Claude Code, OpenCode, Codex, and Droid. Which local GGUF to connect to an agent depends on how reliably the model calls tools; a comparison of reliability is available in a separate article.
Example for Claude Code:
ollama launch claude
The command will interactively guide you through model selection and launch the integration. Note: there is no separate --model flag in the documentation — the model is chosen during the process. If you see syntax like ollama launch claude --model my-model or made-up integrations like "hermes" in update summaries — this is inaccurate, refer to the official team page.
To check if a specific GGUF file supports tool calling, look for the presence of the tools capability in the output of ollama show:
ollama show my-model
Capabilities
completion
tools ← present — the model supports tool calling
If tools is not in the Capabilities section — the model will not call tools natively, and it's not suitable for an agent. How tool calling is implemented at the API level and how it differs from simple function calling is explained in the article Tool use vs function calling: mechanics, JSON Schema, and connection to RAG.
It's worth mentioning separately: along with 0.30, the Codex App for Ollama was released — a desktop application where you can use any Ollama model (local or cloud) for coding, with a built-in browser and code review mode.
Which models are now easier to run
Extended GGUF compatibility means more model families work "out of the box". The list includes both large well-known families and fine-tuned community models:
- Qwen, Gemma, DeepSeek — core workhorses of local AI, now with broader quantization compatibility.
- gpt-oss — open models from OpenAI.
- Fine-tuned community models — any GGUF from Hugging Face, including custom fine-tunes.
But "easier to run" doesn't mean "runs equally reliably in an agent": extended compatibility concerns *running* the model, not the quality of its tool calling. The fact that a model loads and responds doesn't guarantee it will reliably call tools under load — that's a separate issue, solved by choosing the right model (see the section above on tool calling and coding agents).
Should you update to 0.30
The short answer is yes, for most users, the update is safe and beneficial. But "worth it" depends on what exactly you're doing.
How to update with one command (Linux): curl -fsSL https://ollama.com/install.sh | sh — this will overwrite the existing version with the latest. On macOS/Windows, updates come automatically via the menu ("Restart to update"). Models are saved, no reinstallation needed. Full breakdown in the FAQ below.
Definitely update if you:
- run GGUF models from Hugging Face or your own fine-tuned ones — this is the main reason;
- work with coding agents (Claude Code, Codex, OpenCode) via local models;
- have an NVIDIA card and are hitting generation speed limits;
- have an AMD or Intel GPU and want GPU acceleration without manually installing vendor libraries.
You can take your time if you:
- only work with official models from the Ollama registry and are happy with them;
- have a weak iGPU — first check if the default Vulkan doesn't slow down your work;
- have a production pipeline on an older version — test on dev first, as tool calling and model behavior can change between versions.
From personal experience
On my MacBook Pro M1 16GB, the main scenario is local development of agent pipelines for AskYourDocs with qwen3:8b and nomic-embed-text in parallel. For this scenario on Apple Silicon, the main value of 0.30 is not Vulkan (it's for Windows/Linux GPUs) nor NVIDIA acceleration, but rather simplified access to GGUF from Hugging Face: testing new quantizations and fine-tuned models has become noticeably more convenient. If your work, like mine, revolves around testing different models for specific tasks — this is the change worth updating for.
FAQ
How to update Ollama to 0.30?
The method depends on the operating system:
- macOS and Windows — Ollama updates automatically. When an update is available, click the icon in the menu (tray) and select "Restart to update". Or download the latest version manually from the official website.
- Linux — there's no auto-update, so update via the terminal by re-running the official install script:
curl -fsSL https://ollama.com/install.sh | sh — it will overwrite the existing version with the latest.
- Homebrew (macOS) — if you installed via Homebrew:
brew upgrade ollama.
Existing models don't need to be reinstalled — they are stored in ~/.ollama/models and the binary update doesn't delete them. To check the version after updating: ollama --version.
Will 0.30 break my existing models?
No, downloaded models will continue to work. However, if you have a production pipeline with tool calling, test it on dev before updating, as model behavior and tool serialization may differ between versions.
Does Vulkan in 0.30 really work out of the box?
On Windows with most vendor drivers — yes, without additional steps. On Linux/AMD there might be nuances (requires a ROCm v7-compatible driver). On weak iGPUs, check ollama ps after launch — default Vulkan can sometimes slow down performance compared to CPU.
Can I run any GGUF model from Hugging Face?
Yes — this is a key feature of 0.30. Download the GGUF file, create a Modelfile with FROM pointing to the file path, and run it via ollama create. A step-by-step guide is in a separate article.
How realistic is the 20% acceleration on NVIDIA?
This is the upper limit on a specific configuration, not a guaranteed boost everywhere. Most noticeable on newer cards. In everyday work, it's a pleasant but not revolutionary increase; if the bottleneck isn't the GPU, you won't feel it.
Conclusions
My verdict — on Apple Silicon, the update is only worth it for GGUF from Hugging Face; Vulkan and NVIDIA acceleration have nothing to do with it, so don't expect anything from them on Mac.
- The main thing in 0.30 is tighter integration with llama.cpp, which opens up the entire GGUF ecosystem of Hugging Face.
- Most practical benefit — running any GGUF model with one or two commands.
- Vulkan by default — it's real, but with nuances: out of the box on Windows, on weak iGPUs check if it doesn't slow things down.
- NVIDIA up to 20% — upper limit, not a guarantee; most noticeable on newer cards.
- Update if you work with GGUF, coding agents, or are hitting speed limits on NVIDIA. For production — test on dev first.
If you want to try the main feature right away — go to the practical guide How to run GGUF models from Hugging Face in Ollama.
Sources