How to run GGUF models from Hugging Face in Ollama

Updated:
How to run GGUF models from Hugging Face in Ollama

In the Ollama 0.30 review, I showed the basic mechanics of running GGUF in three steps and promised a separate breakdown with all the nuances. Here it is. This is a complete practical guide: where to get a GGUF file, how to correctly write a Modelfile, which commands to execute, how to check for tool calling support, and what to do when the model doesn't start.

The guide is intended for those who already have Ollama installed. If not yet, start with the guide on installing Ollama on Mac, Windows, and Linux.

In short: download a .gguf file from Hugging Face, create a text Modelfile with one line FROM ./file.gguf, execute ollama create my-model -f Modelfile — and the model is ready. Everything else in this article is about nuances that distinguish "it started" from "it works as it should."

Contents

  1. What is GGUF and why models are stored in this format
  2. What you'll need before starting
  3. Step 1: Download a GGUF file from Hugging Face
  4. Step 2: Create a Modelfile
  5. Step 3: ollama create and ollama run
  6. Verification: Does the model support tool calling
  7. Connecting to a coding agent via ollama launch
  8. Common errors and how to diagnose them
  9. From personal experience
  10. Conclusions

What is GGUF and why models are stored in this format

GGUF is a model container format: weights, tokenizer, and metadata are packed into a single file, already quantized. The format was created by Georgi Gerganov, the author of llama.cpp — and that's precisely why Ollama, which runs on top of llama.cpp, reads GGUF as a native format.

The practical consequence is simple: most new open-weight models first appear in GGUF format on Hugging Face, and only then (if at all) make it into the official Ollama registry. If you want to run a model that isn't available via ollama pull, or a specific quantization — a GGUF file is your way to go. How Ollama relates to llama.cpp and where GGUF fits into this stack, I discussed in the 0.30 review — I won't duplicate it here.

It's important to understand the limitation: Ollama does not run a .gguf file directly like ollama run ./file.gguf. The file must first be "registered" via a Modelfile with the ollama create command. This takes less than two minutes and is done once per model.

What you'll need before starting

Three things:

  • Ollama 0.30 or newer — this version has expanded GGUF compatibility. Check your version: ollama --version. How to update — in the 0.30 review.
  • A Hugging Face account — formally not required for public models, but convenient if you want to download via CLI or work with gated models (which require license agreement).
  • Sufficient disk space and RAM — the size depends on the quantization (more on this below).

How to read quantization names in files

On Hugging Face, one model usually has several GGUF files with different quantizations in their names: model.Q4_K_M.gguf, model.Q5_K_M.gguf, model.Q8_0.gguf, etc. The number after Q is the number of bits per parameter: a smaller number means a smaller file and less RAM, but lower quality.

A short practical rule for choosing: Q4_K_M is the golden balance of size and quality for most cases. Q5_K_M or Q8_0 — if higher accuracy is needed and you have spare memory. Q2/Q3 — an extreme option for very limited hardware with a noticeable loss of quality.

A detailed breakdown of quantization levels, how much RAM each takes, and why Q4_K_M is almost always better than a larger model's Q2 — in the article Ollama on 8 GB RAM: which models work in 2026. Just remember here: the file size in the name ≈ the RAM needed plus 1–2 GB for the KV cache.

How to run GGUF models from Hugging Face in Ollama

Step 1: download a GGUF file from Hugging Face

On the model page on Hugging Face, open the Files and versions tab. There you will see a list of .gguf files with different quantizations. Choose the one you need (usually Q4_K_M) and download it — either with the download button next to the file, or via CLI.

Via browser: simply click the download icon next to the desired file.

Via CLI (more convenient for large files and reproducibility):

# install Hugging Face CLI (once)
pip install -U "huggingface_hub[cli]"

# download a specific GGUF file to the current directory
hf download <user>/<repo> model.Q4_K_M.gguf --local-dir .

Place the file in a fixed directory from which you will refer to it. Practical tip: if the filename contains spaces or special characters, rename it to something simple like model.gguf, otherwise you will have to escape the path in Modelfile.

Shard trap. Large models on Hugging Face are often split into several files — model-00001-of-00005.gguf, model-00002-of-00005.gguf, and so on. These are sharded GGUF files. Ollama currently does not run sharded GGUF directly — the command will fail with an error. Such files must first be merged into one using llama-gguf-split --merge from llama.cpp. More details in the error section below.

Step 2: create a Modelfile

Modelfile is a text file (without an extension) that describes how to assemble the model. To run a GGUF, the minimal Modelfile consists of a single line — the FROM directive with the path to the file:

FROM ./model.Q4_K_M.gguf

If your goal is simply to run the model, this line is usually enough. Save the file as Modelfile in the same directory where the .gguf file is located.

What else can be put in Modelfile

In addition to FROM, Modelfile accepts several optional directives that significantly improve the model's behavior for a specific task. The most useful ones:

FROM ./model.Q4_K_M.gguf

# Inference parameters
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER stop "</s>"

# Default system prompt
SYSTEM """You are a technical assistant. Respond concisely and to the point."""

# Prompt template (needed for some chat models for correct responses)
TEMPLATE "[INST] {{ .Prompt }} [/INST]"

Briefly, what the directives do: PARAMETER temperature controls the "creativity" of responses, PARAMETER num_ctx sets the context window size (more means more RAM for KV cache), PARAMETER stop defines stop tokens, SYSTEM sets the system prompt, and TEMPLATE defines the format in which the prompt is presented to the model.

About TEMPLATE separately. Many chat models require a correct prompt template, otherwise they respond incorrectly or "brokenly." If the model behaves strangely after launching, the first suspicion is the template. The correct TEMPLATE is usually specified on the model's page on Hugging Face or in the model card. As a guide, you can look at the template of an official model with a similar architecture: ollama show llama3.2 --modelfile.

If a single FROM is enough for you to start, begin with it, and add parameters as needed. Don't overcomplicate the Modelfile in advance.

Step 3: ollama create and ollama run

From the directory where Modelfile is located, execute the model creation command:

ollama create my-model -f Modelfile

Ollama will read the Modelfile, register the GGUF as a local model named my-model, and add it to its registry. This usually takes 10-60 seconds depending on the file size. To check if the model has appeared:

ollama list

Now the model can be run like any other:

ollama run my-model "Hello! What can you do?"
Most common error is argument order. Correct: ollama create my-model -f Modelfile (model name first, then the -f flag with the path to Modelfile). Many summaries confuse the order and write ollama create -f Modelfile my-model — such a command will fail with an error. If Modelfile is in the current directory and is named exactly Modelfile, you can even omit the -f flag: ollama create my-model will pick it up automatically.

After create, the model is available both via CLI and via the local REST API on port 11434 — immediately, without additional steps. How to integrate it into your application in Java, Python, or JavaScript is in the article Ollama REST API: Integration into your application.

Check: does the model support tool calling

If you plan to use the model in an agent, it's not enough for it to just run. It needs to support tool calling natively. Check this using ollama show, looking at the Capabilities section:

ollama show my-model

Capabilities
  completion
  tools           ← present — the model supports tool calling

If tools is not in the Capabilities section, the model will not call tools natively. Ollama will try a fallback via prompt, but it's unreliable: the model might return JSON in text instead of structured tool_calls, or ignore the tool altogether. Such a model is not suitable for an agent pipeline.

An important nuance: tool calling support depends not on the GGUF format, but on what the model was trained on. The same .gguf file either has this capability "baked in" to its weights, or it doesn't — Modelfile won't add anything here.

How to choose a model that reliably calls tools (with a comparison of reliability, speed, and multi-tool) is in the article Which Ollama model to choose for an agent with tool calling. And how the call itself is structured at the API level and how it differs from simple function calling is in the article Tool use vs function calling: mechanics, JSON Schema, and connection to RAG.

Connecting to a coding agent via ollama launch

When a GGUF is running and supports tool calling, it can be connected to a coding agent with a single command. For example, for Claude Code:

ollama launch claude

The command will interactively guide you through model selection and launch the integration without manual configuration editing. Officially supported are Claude Code, OpenCode, Codex, and Droid. More details about ollama launch, its syntax, and common inaccuracies in summaries are in the 0.30 overview section. Here, I'm just fixing the bridge: launched GGUF → checked tools → connected to agent.

Common Errors and How to Diagnose Them

Most problems with running GGUF boil down to a few scenarios. Below is a diagnosis for each.

Error 1. Sharded GGUF — Model Split into Multiple Files

How it looks: Instead of a single file in the repository, you see model-00001-of-00005.gguf, model-00002-of-00005.gguf, etc. Trying to run it results in an error about unsupported sharded GGUF.

Reason: Ollama currently cannot directly download sharded GGUF files.

Solution: Merge the shards into a single file using the llama-gguf-split tool from llama.cpp:

llama-gguf-split --merge model-00001-of-00005.gguf model-merged.gguf

Then, in your Modelfile, specify the merged file: FROM ./model-merged.gguf. Alternatively, look for an un-split version with a lower quantization in the same repository (they are often located nearby).

Error 2. Out of Memory — Model Doesn't Fit in RAM

How it looks: create completes, but run either crashes or the model takes tens of seconds to load and outputs 1–2 tokens per second with constant pauses.

Reason: The chosen quantization is too large for your hardware. A successful create does not guarantee that the model will run comfortably.

Solution: Use a smaller quantization (Q4_K_M instead of Q8_0) or a smaller model. Check the actual resource usage after run:

ollama ps
# Look at the PROCESSOR column:
# 100% GPU  — good
# partial CPU — the model is swapping, you need a smaller quantization

For guidance on memory and choosing models for weaker hardware, refer to the article Ollama on 8 GB RAM.

Error 3. Incorrect Path in FROM

How it looks: ollama create crashes with a "file not found" message.

Reason: The path in FROM does not match the actual location of the .gguf file, or the filename contains spaces/special characters.

Solution: Run create from the same directory where the file is located and use a relative path like FROM ./model.gguf. If the filename contains spaces, rename it to something simple.

Error 4. Model Responds "Broken" or Ignores Format

How it looks: The model starts, but responses are cut off, contain strange tokens, or don't maintain the chat format.

Reason: Incorrect or missing TEMPLATE in the Modelfile. Chat models are sensitive to the prompt template.

Solution: Find the correct template on the model's card on Hugging Face and add the TEMPLATE directive. As a reference, look at the Modelfile of an official model with a similar architecture: ollama show llama3.2 --modelfile.

Error 5. Model Works, but tool_calls is Empty

How it looks: In an agent, the model responds with text instead of calling a tool; tool_calls in the response is empty.

Reason: The model does not natively support tool calling (this is not a GGUF or Modelfile issue).

Solution: Check ollama show my-model — does it list tools under Capabilities? If not, replace the model with one that supports it. More details can be found in the article on choosing a model for an agent with tool calling.

Quick Diagnosis Table

SymptomFirst CheckMost Likely Cause
Error about sharded/split GGUFHow many files in the repositoryModel is split into shards — merge using llama-gguf-split
run crashes or 1–2 tok/sollama ps → PROCESSORQuantization too large — use a smaller one
File not foundPath in FROM and filenameIncorrect path or spaces in the name
"Broken" responsesIs TEMPLATE present in ModelfileIncorrect prompt template
tool_calls is emptyollama show → CapabilitiesModel does not natively support tools

From Personal Experience

On my MacBook Pro M1 with 16 GB RAM, I regularly test new models from Hugging Face before using them for AskYourDocs. The scenario is almost always the same: I find a new model or an interesting quantization not yet in the official Ollama registry, download the Q4_K_M file, write a Modelfile with a single FROM, and check via ollama show if it has tools — because without them, the model is useless for my agent pipeline.

Two things I've stumbled upon and now check first. The first is shards: large models are often split, and before I got used to it, I wasted time wondering "why isn't it running," until I realized Ollama simply wouldn't take them without merging. The second is the template: once, the model started but responded in fragments, and I thought the file was corrupted — but the problem was a missing TEMPLATE. So, advice from experience: if a model behaves strangely, don't rush to re-download the file — first check the template and ollama ps.

Conclusions

  • Basic workflow — download the .gguf, write a Modelfile with a single FROM, execute ollama create my-model -f Modelfile.
  • Argument order in create matters: model name first, then -f Modelfile.
  • Ollama does not directly accept Sharded GGUF — merge them using llama-gguf-split --merge.
  • Choose quantization based on your hardware: Q4_K_M is a safe starting point for most.
  • For agents, always check tools in ollama show — running doesn't guarantee tool calling support.
  • Strange model behavior — first suspect TEMPLATE, not a corrupted file.

What's new in version 0.30 itself and why the integration with llama.cpp opened up the entire GGUF ecosystem — in the Ollama 0.30 review. And if your next step is to choose a model specifically for an agent with tool calling, move on to the comparison of models by reliability.

Sources

Останні статті

Читайте більше цікавих матеріалів

Як запускати GGUF-моделі з Hugging Face в Ollama

Як запускати GGUF-моделі з Hugging Face в Ollama

В огляді Ollama 0.30 я показав базову механіку запуску GGUF у три кроки і пообіцяв окремий розбір з усіма нюансами. Ось він. Тут — повний практичний гайд: де брати GGUF-файл, як правильно написати Modelfile, які команди виконати, як перевірити підтримку tool calling і що робити, коли модель...

Ollama 0.30: що нового — GGUF, Vulkan, llama.cpp і tool calling

Ollama 0.30: що нового — GGUF, Vulkan, llama.cpp і tool calling

Ollama 0.30 вийшов з підтримкою GGUF-моделей з Hugging Face, прискоренням на NVIDIA та Vulkan, який тепер активний за замовчуванням. Це оновлення цікаве не окремими цифрами, а тим, що Ollama дедалі тісніше зростається з llama.cpp — і це впливає на те, які моделі ти зможеш запустити...

OCR у сучасних AI-системах: від сканованих документів до RAG

OCR у сучасних AI-системах: від сканованих документів до RAG

Коротко OCR — це не застаріла технологія. У 2024 році ринок OCR досяг $13,95 млрд і продовжує зростати. Близько 80% корпоративних даних є неструктурованими — сканування, PDF, зображення. OCR — перший крок до їх обробки. У RAG-системах OCR виконує роль шлюзу: без...

AI-моделі для персонажів 2026: DeepSeek, GPT-4o mini та Euryale — що обрав я

AI-моделі для персонажів 2026: DeepSeek, GPT-4o mini та Euryale — що обрав я

Я розробляю власну платформу для спілкування з AI-персонажами — аналог Character.ai, але з власною архітектурою пам'яті, роутингом моделей і категоріями персонажів. Одне з перших практичних питань яке постало: яку LLM використовувати і чи підходить одна модель для всіх типів...

Claude Opus 4.8: бенчмарки, цифри та що за ними стоїть

Claude Opus 4.8: бенчмарки, цифри та що за ними стоїть

Опубліковано: 30 травня 2026 &nbsp;|&nbsp; Anthropic випустила Claude Opus 4.8 і одразу опублікувала таблицю бенчмарків із 15+ метрик. На перший погляд — черговий набір відсотків і позицій у рейтингах. Але якщо читати уважно — за цими цифрами стоїть...

Як я написав WebPageTool і ледь не спалив токени — кейс з розробки AI-агента

Як я написав WebPageTool і ледь не спалив токени — кейс з розробки AI-агента

Один запит користувача. Одна URL. Одинадцять викликів підряд. Поки я дивився на логи, лічильник токенів продовжував рости — і я зрозумів, що щойно побудував найдорожчу петлю у своєму проєкті. Зміст Перший тест Що таке "важка операція" в LLM і чому це важливо...