AI_TOOLS 05 May 2026 32 min read 74 view

Ollama REST API: Integrate into Your Application — Java, Python, JavaScript

Updated: 05 May 2026

Language: 🇺🇦 🇺🇸 🇩🇪 🇪🇸

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

Ollama REST API: Integrate into Your Application — Java, Python, JavaScript

Ollama is not just a CLI tool for running models in the terminal. It's a full-fledged local server with a REST API that listens on port 11434 and accepts requests from any application — Spring Boot, Node.js, Python, or any language with HTTP support. This article provides a comprehensive practical breakdown: what endpoints are available, how to call them, and how to integrate Ollama into a real application.

If you haven't installed Ollama yet — start with the guide to installing on Mac, Windows, and Linux. If you want to understand which models are suitable for different tasks — read the article on choosing Ollama models in 2026.

📚 Table of Contents

📌 Two API Surfaces: Native /api/* vs OpenAI-Compatible /v1/*
📌 POST /api/generate: Basic Text Generation
📌 POST /api/chat: Chat Format and Context Preservation
📌 Streaming: Why and How to Implement
📌 POST /api/embed: Embeddings for RAG
📌 Tool Calling: Connecting External Functions
📌 Java Example: WebClient + Spring Boot
📌 Python Example: Native SDK vs OpenAI SDK
📌 JavaScript / Node.js Example
📌 Model Management and Health Check via API
📌 Error Handling, Timeouts, OLLAMA_HOST
❓ Frequently Asked Questions (FAQ)
✅ Conclusions

🎯 Two API Surfaces: Native /api/* vs OpenAI-Compatible /v1/*

Short Answer: Ollama has two independent APIs. The native /api/* offers full control: streaming with metadata, model management, embeddings, process inspection. The OpenAI-compatible /v1/* is a drop-in replacement for code already working with the ChatGPT API. For new projects, choose the native API. For migrating existing code, use /v1/.

If you already have code that calls the OpenAI API — to switch to local Ollama, you only need to change one line: base_url = "http://localhost:11434/v1". The rest of the code remains unchanged.

Native API (/api/*)

According to the official Ollama documentation, after installation, the API is available at http://localhost:11434/api. The native API supports:

✔️ POST /api/generate — text generation from a prompt
✔️ POST /api/chat — chat with history and tool calling
✔️ POST /api/embed — embedding generation
✔️ GET /api/tags — list of installed models
✔️ GET /api/ps — running models and VRAM usage
✔️ POST /api/pull — model download
✔️ DELETE /api/delete — model deletion

OpenAI-Compatible API (/v1/*)

ML Journey explains: Ollama supports an OpenAI-compatible endpoint, meaning any tool, library, or application that works with the OpenAI API can be connected to local Ollama with a single line change. This includes official OpenAI Python and JS SDKs, LangChain, LlamaIndex, Continue, and hundreds of other tools.

Endpoint	Native (/api/*)	OpenAI-Compatible (/v1/*)
Chat	`/api/chat`	`/v1/chat/completions`
Generation	`/api/generate`	`/v1/completions`
Embeddings	`/api/embed`	`/v1/embeddings`
Model List	`/api/tags`	`/v1/models`
Model Management	✔️ Yes	❌ No
Streaming Metadata	✔️ Full	⚠️ Partial
API Key	Not required	Any string (ignored)

⚠️ What to Watch Out For — My Experience

When I integrated Ollama into WebsCraft, I stepped on the same rake three times. Here's what's worth knowing upfront — to save debugging time.

1. Model Name in /v1/ Must Match Exactly

With the real OpenAI API, the model name is globally stable: gpt-4 always exists. In Ollama, the model must be downloaded locally, and the name must match what ollama list shows.

I received a mysterious 404 model not found several times simply because I passed "llama3" instead of "llama3.2:3b". The first rule when migrating code from OpenAI to Ollama:

# Check the exact name before writing code
ollama list

# NAME                    ID              SIZE    MODIFIED
# llama3.2:3b             ...             2.0 GB  2 days ago
# nomic-embed-text:latest ...             274 MB  5 days ago

If your tool is hardcoded to gpt-3.5-turbo or another OpenAI name — you can copy the model under the desired name:

# Creates an alias: now gpt-3.5-turbo points to llama3.2:3b
ollama cp llama3.2:3b gpt-3.5-turbo

2. Changing Context Window via /v1/ is Non-Obvious

The OpenAI API doesn't have a parameter to change the context size — it's fixed for each model. Therefore, you cannot pass num_ctx via /v1/chat/completions: the parameter is simply ignored.

I discovered this after long documents were unexpectedly truncated — the model silently dropped part of the context instead of returning an error. The solution: create a Modelfile with the desired context and use the new name:

# Create a Modelfile
FROM llama3.2:3b
PARAMETER num_ctx 16384

# Build the new model
ollama create llama3-16k -f Modelfile

# Now call via /v1/ with the new name
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3-16k",
    "messages": [{"role": "user", "content": "...long text..."}]
  }'

There's no such issue with the native /api/chat — there, num_ctx is passed directly in options:

{
  "model": "llama3.2:3b",
  "messages": [...],
  "options": {
    "num_ctx": 16384
  }
}

3. /v1/ and /api/ Responses Have Different Formats

If you switch between the native and OpenAI-compatible APIs — remember that the response format is different. I've encountered KeyError in Python several times simply because I confused where response["message"]["content"] is, and where response.choices[0].message.content is.

Field	Native /api/chat	OpenAI-Compatible /v1/
Response Text	`response["message"]["content"]`	`response.choices[0].message.content`
Generation End	`response["done"] == true`	`response.choices[0].finish_reason == "stop"`
Token Statistics	`eval_count`, `eval_duration`	`usage.completion_tokens`
Tool Calls	`message.tool_calls`	`choices[0].message.tool_calls`

My rule: in one project, use only one API surface. If migrating from OpenAI — stick to /v1/ and OpenAI SDK everywhere. If it's a new project — use the native API everywhere. Mixing the two approaches in one codebase guarantees confusion during debugging.

Conclusion: For a new project, the native API offers more control. For migrating existing OpenAI code — use /v1/ without code changes.

🎯 POST /api/generate: Basic Text Generation

Short Answer: /api/generate is the simplest endpoint: it takes a model and a prompt, and returns text. It does not preserve context between requests. It's suitable for one-off tasks: summarization, translation, classification.

Difference between /api/generate and /api/chat: generate takes a string prompt, chat takes an array of messages with roles. For a chatbot — always use /api/chat. For batch processing — /api/generate is more convenient.

Basic Request via curl

# stream: false — returns the entire response at once
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "prompt": "Explain what a REST API is in three sentences.",
    "stream": false
  }'

Response Format — All Fields

{
  "model": "llama3.2:3b",
  "created_at": "2026-05-01T10:00:00Z",
  "response": "REST API is...",
  "done": true,
  "prompt_eval_count": 15,
  "prompt_eval_duration": 123456789,
  "eval_count": 42,
  "eval_duration": 987654321,
  "total_duration": 1234567890,
  "load_duration": 56789012
}

What each field means:

✔️ prompt_eval_count — number of tokens in the prompt (input)
✔️ eval_count — number of generated tokens (output)
✔️ eval_duration — generation time in nanoseconds
✔️ load_duration — model loading time (0 if already in memory)
✔️ total_duration — total time from request to response

How to Calculate Tokens/Sec from Metadata

The response metadata allows logging the actual model performance. I use this in WebsCraft to monitor generation speed depending on the load:

# Python: calculating tokens/sec
import requests

r = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3.2:3b",
    "prompt": "What are microservices?",
    "stream": False
})
data = r.json()

tokens_per_sec = data["eval_count"] / (data["eval_duration"] / 1e9)
total_sec = data["total_duration"] / 1e9
prompt_tokens = data["prompt_eval_count"]
output_tokens = data["eval_count"]

print(f"Speed: {tokens_per_sec:.1f} tok/s")
print(f"Tokens: {prompt_tokens} input → {output_tokens} output")
print(f"Total time: {total_sec:.2f}s")

// Java: calculating tokens/sec via WebClient
@Service
public class OllamaGenerateService {

    private final WebClient ollamaWebClient;

    public record GenerateResult(String text, double tokensPerSec, int outputTokens) {}

    public Mono<GenerateResult> generate(String prompt) {
        var body = Map.of(
                "model", "llama3.2:3b",
                "prompt", prompt,
                "stream", false
        );

        return ollamaWebClient.post()
                .uri("/api/generate")
                .bodyValue(body)
                .retrieve()
                .bodyToMono(Map.class)
                .map(r -> {
                    var text = (String) r.get("response");
                    var evalCount = ((Number) r.get("eval_count")).intValue();
                    var evalDuration = ((Number) r.get("eval_duration")).longValue();
                    var tokPerSec = evalCount / (evalDuration / 1_000_000_000.0);
                    return new GenerateResult(text, tokPerSec, evalCount);
                });
    }
}

Main Parameters in options

{
  "model": "llama3.2:3b",
  "prompt": "Your text here",
  "stream": false,
  "system": "You are a technical editor. Respond concisely and to the point.",
  "options": {
    "temperature": 0.7,
    "num_ctx": 4096,
    "top_p": 0.9,
    "num_predict": 256
  }
}

✔️ temperature — creativity of the response: 0.1 precise/deterministic, 0.9 variable
✔️ num_ctx — context window size (tokens)
✔️ num_predict — maximum number of tokens in the response
✔️ top_p — nucleus sampling, usually 0.9
✔️ system — system prompt (outside options, a separate field)

🎯 POST /api/chat: Chat Format and Context Preservation

Short Answer: /api/chat is the main endpoint for chat applications. It accepts an array of messages with roles (system, user, assistant), supports tool calling and streaming. To preserve context between requests — pass the full message history.

LLMs have no memory between requests. A chatbot's "memory" is simply the array of messages you pass with each request. The longer the history, the more RAM and time for the response.

Basic Request

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [
      {
        "role": "system",
        "content": "You are a developer assistant. Respond in Ukrainian, concisely."
      },
      {
        "role": "user",
        "content": "What is dependency injection?"
      }
    ],
    "stream": false,
    "keep_alive": "10m"
  }'

keep_alive: How Long to Keep the Model in Memory

By default, Ollama unloads the model from memory 5 minutes after the last request. For a chatbot where requests come frequently — this means a cold-start delay for each new session.

I encountered this in WebsCraft: the first request of a new session took 8–12 seconds instead of 1–2 — the model reloaded each time. The keep_alive parameter solves this:

# Keep the model in memory for 30 minutes
{"keep_alive": "30m"}

# Keep it permanently (until Ollama is restarted)
{"keep_alive": -1}

# Unload immediately after response (for batch tasks where RAM is critical)
{"keep_alive": 0}

# You can also set it via an environment variable (globally for all models):
# OLLAMA_KEEP_ALIVE=30m ollama serve

For a production chatbot, I use "keep_alive": "30m" — the model stays hot between sessions but unloads if there are no requests for a long time.

Preserving Context (Multi-Turn)

# Python: full multi-turn chat cycle
import requests

OLLAMA_URL = "http://localhost:11434/api/chat"
MODEL = "llama3.2:3b"

messages = [
    {"role": "system", "content": "You are a technical assistant. Respond concisely."}
]

def chat(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})

    r = requests.post(OLLAMA_URL, json={
        "model": MODEL,
        "messages": messages,
        "stream": False,
        "keep_alive": "30m"
    })

    reply = r.json()["message"]["content"]
    messages.append({"role": "assistant", "content": reply})
    return reply

# First request
print(chat("What is Spring Boot?"))
# Second request — the model "remembers" the first
print(chat("What are its main advantages?"))
# Third — continuing the context
print(chat("Show a minimal pom.xml for it"))

Trimming History: What to Do When Context Overflows

If the chat is long — the history grows and starts to occupy the entire model context. When messages exceed num_ctx, Ollama silently drops the oldest messages. To control this explicitly — implement trimming manually.

I use a simple approach: always save the system prompt, and trim user/assistant messages to the last N pairs:

def trim_history(messages: list, max_pairs: int = 10) -> list:
    """
    Saves the system prompt and the last max_pairs of user/assistant messages.
    max_pairs=10 → maximum 21 messages (1 system + 20 user/assistant)
    """
    system = [m for m in messages if m["role"] == "system"]
    dialog = [m for m in messages if m["role"] != "system"]

    # Take the last max_pairs * 2 messages (a pair = user + assistant)
    trimmed = dialog[-(max_pairs * 2):]

    return system + trimmed

# Usage in the chat loop:
def chat_with_trim(user_input: str) -> str:
    messages.append({"role": "user", "content": user_input})

    # Trim before each request
    trimmed = trim_history(messages, max_pairs=10)

    r = requests.post(OLLAMA_URL, json={
        "model": MODEL,
        "messages": trimmed,
        "stream": False
    })

    reply = r.json()["message"]["content"]
    messages.append({"role": "assistant", "content": reply})
    return reply

An alternative approach is to trim by tokens, not by the number of messages. But for most applications, limiting by the number of pairs is simpler and sufficiently predictable.

/api/chat Response Format

{
  "model": "llama3.2:3b",
  "created_at": "2026-05-01T10:00:00Z",
  "message": {
    "role": "assistant",
    "content": "Spring Boot is..."
  },
  "done": true,
  "eval_count": 38,
  "eval_duration": 876543210,
  "total_duration": 987654321
}

The eval_count and eval_duration fields are the same as in /api/generate, allowing you to calculate tokens/sec for monitoring.

🎯 Streaming: Why and How to Implement

Short Answer: Streaming is receiving the response token by token, not in one block. By default, Ollama streams. For UI — always enable streaming: the first token arrives in 1–3 seconds, while without streaming, the user waits for the entire response silently.

With stream: true, the first token appears on screen in 1–3 seconds. With stream: false — the entire text appears after the model finishes generation, i.e., in 5–30 seconds depending on the response length. For interactive applications — stream: true.

Real-World Case: How it Works on AskYourDocs

I implemented streaming in my service AskYourDocs — an application where users ask questions about their documents and get answers from a local RAG system based on Ollama.

Without streaming, the first versions of the service looked like this: the user clicked "Send", saw a spinner for 8–15 seconds, then the entire text appeared at once. The feeling was like the application froze. With streaming, the first words appear in 1–2 seconds, and the response "types" before your eyes. The UX difference is striking, even if the total generation time is the same.

Architecture: Ollama streams tokens → Spring Boot reads the NDJSON stream via WebFlux → passes it to the client via SSE (Server-Sent Events) → JavaScript on the frontend appends tokens to the DOM one by one.

Streaming via curl

curl http://localhost:11434/api/chat \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Tell me about microservices"}]
  }'
# stream: true — by default, can be omitted

The response comes as a stream of JSON objects, each on a separate line (NDJSON):

{"model":"llama3.2:3b","message":{"role":"assistant","content":"Micro"},"done":false}
{"model":"llama3.2:3b","message":{"role":"assistant","content":"services"},"done":false}
{"model":"llama3.2:3b","message":{"role":"assistant","content":" are"},"done":false}
...
{"model":"llama3.2:3b","message":{"role":"assistant","content":""},"done":true,"eval_count":87}

Streaming in Python

import requests, json

def stream_chat(model: str, messages: list):
    r = requests.post(
        "http://localhost:11434/api/chat",
        json={"model": model, "messages": messages},
        stream=True
    )
    full_response = ""
    for line in r.iter_lines():
        if line:
            chunk = json.loads(line)
            token = chunk["message"]["content"]
            print(token, end="", flush=True)
            full_response += token
            if chunk.get("done"):
                break
    return full_response

stream_chat("llama3.2:3b", [
    {"role": "user", "content": "Explain what Docker is"}
])

Streaming in Spring Boot via SSE

This is exactly the approach I use in AskYourDocs: Spring Boot reads the NDJSON from Ollama and immediately passes tokens to the client via Server-Sent Events. The frontend receives tokens and appends them to the DOM without reloading the page.

// OllamaStreamService.java
@Service
@RequiredArgsConstructor
public class OllamaStreamService {

    private final WebClient ollamaWebClient;

    /**
     * Streams tokens from Ollama as a Flux<String>.
     * Each element is one token of the model's response.
     */
    public Flux<String> streamChat(String userMessage) {
        var body = Map.of(
                "model", "llama3.2:3b",
                "messages", List.of(
                        Map.of("role", "system",
                               "content", "Respond in Ukrainian, concisely."),
                        Map.of("role", "user", "content", userMessage)
                ),
                "stream", true,
                "keep_alive", "30m"
        );

        return ollamaWebClient.post()
                .uri("/api/chat")
                .bodyValue(body)
                .retrieve()
                .bodyToFlux(String.class)     // each NDJSON line
                .filter(line -> !line.isBlank())
                .mapNotNull(line -> {
                    try {
                        var node = new ObjectMapper().readTree(line);
                        var token = node.path("message").path("content").asText("");
                        var done = node.path("done").asBoolean(false);
                        return done ? null : token; // null terminates the Flux
                    } catch (Exception e) {
                        return null;
                    }
                });
    }
}

// OllamaController.java — SSE endpoint for the frontend
@RestController
@RequestMapping("/api/ai")
@RequiredArgsConstructor
public class OllamaController {

    private final OllamaStreamService streamService;

    /**
     * SSE endpoint: tokens arrive one by one in the browser.
     * The frontend connects via EventSource or fetch with ReadableStream.
     */
    @GetMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> stream(@RequestParam String message) {
        return streamService.streamChat(message);
    }
}

Frontend: Reading SSE via fetch

// Connecting to the SSE endpoint and displaying tokens in real-time
async function streamAnswer(question, outputElement) {
  const controller = new AbortController(); // for cancelling streaming
  const url = `/api/ai/stream?message=${encodeURIComponent(question)}`;

  const res = await fetch(url, { signal: controller.signal });
  const reader = res.body.getReader();
  const decoder = new TextDecoder();

  outputElement.textContent = ""; // clear before response

  try {
    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      // SSE lines have the format "data: token\n\n"
      const lines = decoder.decode(value).split("\n");
      for (const line of lines) {
        if (line.startsWith("data: ")) {
          const token = line.slice(6); // remove "data: "
          outputElement.textContent += token;
        }
      }
    }
  } catch (err) {
    if (err.name !== "AbortError") console.error("Stream error:", err);
  }

  return controller; // return for cancellation possibility
}

// Usage:
const stopBtn = document.getElementById("stop");
const output = document.getElementById("answer");

const controller = await streamAnswer("What is Spring Boot?", output);

// "Stop Generation" button
stopBtn.onclick = () => controller.abort();

Aborting Streaming: The "Stop" Button

In AskYourDocs, I added a "Stop" button — if the response is too long or the model goes off track. This is implemented using AbortController on the frontend (shown above) and cancelling the Flux on the backend:

// Spring Boot: automatically cancels the request to Ollama
// when the client disconnects (browser closed the SSE connection)
// WebFlux does this automatically via Flux.takeUntilOther or
// via the onCancel operator:

public Flux<String> streamChat(String userMessage) {
    return ollamaWebClient.post()
            .uri("/api/chat")
            .bodyValue(body)
            .retrieve()
            .bodyToFlux(String.class)
            .doOnCancel(() ->
                log.info("Client disconnected, streaming cancelled"))
            // ... rest of the operators
}

WebFlux automatically cancels the upstream request to Ollama when the client closes the SSE connection — the model stops generation and frees up RAM. This is important: without proper cancellation, the model continues to generate even after the user closes the tab.

🎯 POST /api/embed: embeddings for RAG

Short answer: /api/embed generates numerical vectors (embeddings) for text. These vectors are needed for semantic search — the foundation of RAG architecture. The best local model for embeddings is nomic-embed-text.

If you don't yet understand what embeddings are — start with the article "What are Embeddings: How AI Understands Text Meaning" before moving on.

How I use /api/embed in WebsCraft

In my RAG pipeline on WebsCraft, I use nomic-embed-text via /api/embed for two tasks: indexing blog articles upon publication and searching for relevant articles when a user queries the chatbot.

Why nomic-embed-text: dimensionality of 768 is sufficient for semantic search, fast generation (~50ms per chunk), minimal RAM usage (~274 MB). During local development, I can run both the embedding model and the generative model simultaneously on a Mac M1 with 16 GB — they don't compete for memory. In production, via OpenRouter, I use openai/text-embedding-3-small, but locally for testing — always nomic-embed-text.

Installing the embedding model

ollama pull nomic-embed-text

Request via curl

curl http://localhost:11434/api/embed \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed-text",
    "input": "Spring Boot is a framework for Java applications"
  }'

Response format

{
  "model": "nomic-embed-text",
  "embeddings": [
    [0.1234, -0.5678, 0.9012, ...]
  ],
  "total_duration": 12345678,
  "load_duration": 1234567,
  "prompt_eval_count": 9
}

The embeddings field is an array of arrays (you can pass multiple texts at once). nomic-embed-text returns a vector of dimension 768.

Batch embeddings (multiple texts at once)

curl http://localhost:11434/api/embed \
  -d '{
    "model": "nomic-embed-text",
    "input": [
      "First sentence for embedding",
      "Second sentence for embedding",
      "Third sentence for embedding"
    ]
  }'

RAG function in Python

import requests
import numpy as np

def embed(texts: list[str], model: str = "nomic-embed-text") -> list[list[float]]:
    """Generates embeddings for a list of texts."""
    r = requests.post(
        "http://localhost:11434/api/embed",
        json={"model": model, "input": texts}
    )
    return r.json()["embeddings"]

def cosine_similarity(a: list, b: list) -> float:
    """Cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Example: finding the closest document
query = "How to configure Spring Boot?"
docs = [
    "Spring Boot auto-configuration simplifies setup",
    "Python Flask is a lightweight web framework",
    "Maven is a build automation tool for Java projects"
]

q_emb = embed([query])[0]
d_embs = embed(docs)

scores = [(doc, cosine_similarity(q_emb, d_emb))
          for doc, d_emb in zip(docs, d_embs)]
scores.sort(key=lambda x: x[1], reverse=True)
print(f"Most relevant: {scores[0][0]} ({scores[0][1]:.3f})")

Embeddings in Java via WebClient

// EmbeddingService.java
@Service
@RequiredArgsConstructor
public class EmbeddingService {

    private final WebClient ollamaWebClient;
    private static final String EMBED_MODEL = "nomic-embed-text";

    /**
     * Generates an embedding for a single text.
     * Returns a vector of dimension 768 for nomic-embed-text.
     */
    public Mono<List<Double>> embed(String text) {
        return embedBatch(List.of(text))
                .map(embeddings -> embeddings.get(0));
    }

    /**
     * Batch embeddings: multiple texts in one request.
     * More efficient than multiple separate requests.
     */
    public Mono<List<List<Double>>> embedBatch(List<String> texts) {
        var body = Map.of("model", EMBED_MODEL, "input", texts);

        return ollamaWebClient.post()
                .uri("/api/embed")
                .bodyValue(body)
                .retrieve()
                .bodyToMono(EmbedResponse.class)
                .map(EmbedResponse::embeddings);
    }

    /**
     * Cosine similarity between two vectors.
     */
    public double cosineSimilarity(List<Double> a, List<Double> b) {
        double dot = 0, normA = 0, normB = 0;
        for (int i = 0; i < a.size(); i++) {
            dot   += a.get(i) * b.get(i);
            normA += a.get(i) * a.get(i);
            normB += b.get(i) * b.get(i);
        }
        return dot / (Math.sqrt(normA) * Math.sqrt(normB));
    }

    // DTO for Ollama response
    record EmbedResponse(List<List<Double>> embeddings) {}
}

// Usage in RAG service:
@Service
@RequiredArgsConstructor
public class RagService {

    private final EmbeddingService embeddingService;

    public Mono<String> findMostRelevant(String query, List<String> docs) {
        return embeddingService.embed(query).flatMap(queryVec ->
            embeddingService.embedBatch(docs).map(docVecs -> {
                double bestScore = -1;
                String bestDoc = "";
                for (int i = 0; i < docs.size(); i++) {
                    double score = embeddingService
                            .cosineSimilarity(queryVec, docVecs.get(i));
                    if (score > bestScore) {
                        bestScore = score;
                        bestDoc = docs.get(i);
                    }
                }
                return bestDoc;
            })
        );
    }
}

In practice, instead of manual cosine similarity, it's better to use a vector database (pgvector, Chroma, Qdrant) — they index vectors and search through millions of records in milliseconds. Manual calculation is suitable for prototypes and small collections up to ~1000 documents.

More on choosing embedding models for RAG — in the article Embedding Models for RAG in 2026: How to Choose and a Comparison.

🎯 Tool Calling: connecting external functions

Short answer: Tool calling is the model's ability to "call" external functions. The model doesn't execute the function itself — it returns a JSON with the function name and arguments, and your code performs the actual call and passes the result back. Supported via /api/chat with the tools parameter.

Before reading further — I recommend reading the article "Tool Use vs Function Calling: How It Works and Its Relation to RAG" — it explains why LLMs describe functions in JSON rather than executing them, and the full call cycle with examples.

Which model supports tool calling

Not all models support tool calling. Supported models include: Llama 3.1/3.2/3.3, Qwen 2.5, Mistral 7B (v0.3+), DeepSeek R1.

ollama pull llama3.2:3b

Basic request with tools

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [
      {"role": "user", "content": "What is the weather in Kharkiv right now?"}
    ],
    "tools": [
      {
        "type": "function",
        "function": {
          "name": "get_weather",
          "description": "Get the current weather for a city",
          "parameters": {
            "type": "object",
            "properties": {
              "city": {
                "type": "string",
                "description": "The name of the city"
              },
              "units": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature units"
              }
            },
            "required": ["city"]
          }
        }
      }
    ],
    "stream": false
  }'

Response with tool_calls

{
  "message": {
    "role": "assistant",
    "content": "",
    "tool_calls": [
      {
        "function": {
          "name": "get_weather",
          "arguments": {
            "city": "Kharkiv",
            "units": "celsius"
          }
        }
      }
    ]
  },
  "done": true
}

Full tool calling cycle in Python

import requests, json

OLLAMA_URL = "http://localhost:11434/api/chat"
MODEL = "llama3.2:3b"

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather for a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "The name of the city"}
            },
            "required": ["city"]
        }
    }
}]

def get_weather(city: str) -> str:
    return f"In {city}: +18°C, cloudy"

def chat_with_tools(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    r = requests.post(OLLAMA_URL, json={
        "model": MODEL,
        "messages": messages,
        "tools": tools,
        "stream": False
    })
    assistant_msg = r.json()["message"]
    messages.append(assistant_msg)

    if assistant_msg.get("tool_calls"):
        for tool_call in assistant_msg["tool_calls"]:
            fn_name = tool_call["function"]["name"]
            fn_args = tool_call["function"]["arguments"]
            # Safely parse JSON arguments
            try:
                args_dict = json.loads(fn_args)
            except json.JSONDecodeError:
                args_dict = {} # Handle invalid JSON

            result = get_weather(**args_dict) if fn_name == "get_weather" else "unknown tool"
            messages.append({"role": "tool", "content": result})

        r2 = requests.post(OLLAMA_URL, json={
            "model": MODEL, "messages": messages, "stream": False
        })
        return r2.json()["message"]["content"]

    # Model responded with text — tool not called
    return assistant_msg["content"]

print(chat_with_tools("What is the weather in Kharkiv right now?"))

Tool calling in Java via WebClient

// ToolCallingService.java
@Service
@RequiredArgsConstructor
public class ToolCallingService {

    private final WebClient ollamaWebClient;
    private static final String MODEL = "llama3.2:3b";
    private static final String OLLAMA_URL = "http://localhost:11434";

    // Tool description in JSON Schema format
    private static final Map<String, Object> WEATHER_TOOL = Map.of(
        "type", "function",
        "function", Map.of(
            "name", "get_weather",
            "description", "Get the current weather for a city",
            "parameters", Map.of(
                "type", "object",
                "properties", Map.of(
                    "city", Map.of(
                        "type", "string",
                        "description", "The name of the city"
                    )
                ),
                "required", List.of("city")
            )
        )
    );

    public Mono<String> chatWithTools(String userMessage) {
        var messages = new ArrayList<>(List.of(
            Map.of("role", "user", "content", userMessage)
        ));

        // Step 1: initial request with tools
        return callOllama(messages, true)
            .flatMap(response -> {
                var msg = (Map<?, ?>) response.get("message");
                var toolCalls = (List<?>) msg.get("tool_calls");

                // If the model didn't call a tool — return the text
                if (toolCalls == null || toolCalls.isEmpty()) {
                    return Mono.just((String) msg.get("content"));
                }

                // Step 2: execute actual calls
                messages.add(msg);
                for (var tc : toolCalls) {
                    var fn = (Map<?, ?>) ((Map<?, ?>) tc).get("function");
                    var fnName = (String) fn.get("name");
                    // Safely cast arguments, assuming they are in a Map
                    var args = (Map<String, Object>) fn.get("arguments");
                    var result = executeFunction(fnName, args);
                    messages.add(Map.of("role", "tool", "content", result));
                }

                // Step 3: final request with the result
                return callOllama(messages, false)
                    .map(r -> (String) ((Map<?, ?>) r.get("message")).get("content"));
            });
    }

    private Mono<Map> callOllama(List<?> messages, boolean withTools) {
        var body = new HashMap<>();
        body.put("model", MODEL);
        body.put("messages", messages);
        body.put("stream", false);
        if (withTools) {
            body.put("tools", List.of(WEATHER_TOOL));
        }

        return ollamaWebClient.post()
                .uri("/api/chat")
                .bodyValue(body)
                .retrieve()
                .bodyToMono(Map.class)
                .timeout(Duration.ofSeconds(60));
    }

    // Function registry — add new tools here
    private String executeFunction(String name, Map<String, Object> args) {
        return switch (name) {
            case "get_weather" -> getWeather((String) args.get("city"));
            default -> "Unknown tool: " + name;
        };
    }

    private String getWeather(String city) {
        // Here is the actual weather API call
        return "In " + city + ": +18°C, cloudy";
    }
}

⚠️ Common mistake: model didn't call a tool

The model is not obligated to call a tool — it can respond with text even if tools are provided. This happens if:

✔️ The question does not require external data according to the model
✔️ The function description (description) is unclear or does not match the question
✔️ The model does not support tool calling (check the list above)

Therefore, always check if tool_calls are present in the response, and handle both cases — with and without a call:

# Python: correct check
assistant = response["message"]

if assistant.get("tool_calls"):
    # Model wants to call a tool — execute it
    ...
else:
    # Model responded with text — return as is
    return assistant["content"]

If you want the model to *always* call a specific tool — use the tool_choice parameter (supported via /v1/):

curl http://localhost:11434/v1/chat/completions \
  -d '{
    "model": "llama3.2:3b",
    "messages": [...],
    "tools": [...],
    "tool_choice": {"type": "function", "function": {"name": "get_weather"}}
  }'

More on how a model decides when to call a tool — in the article How LLMs Decide When to Call a Tool: Decision-Making Mechanics.

🎯 Java Example: WebClient + Spring Boot

Short answer: For Spring Boot, there are two approaches: direct calls via WebClient (flexible, no dependencies) or via Spring AI (more convenient, but an additional library). Below are both options with working code.

RestTemplate in Spring 6+ is deprecated. Use WebClient for non-blocking HTTP requests to Ollama — this is especially important for streaming real-time responses.

⚠️ Important: the code below is for demonstration purposes. Its goal is to show the basic mechanics of interacting with the Ollama API, not a ready-made template for production. Each project has its own architecture: different package structure, different error handling, different configuration storage methods. Adapt it to your needs.

Option 1: WebClient — no additional dependencies

Dependencies in pom.xml:

<dependency>
    <groupId>org.springframework.boot</groupId>
    <artifactId>spring-boot-starter-webflux</artifactId>
</dependency>

Configuration in application.properties (URL is extracted from the code — don't hardcode in beans):

ollama.base-url=http://localhost:11434
ollama.model=llama3.2:3b
ollama.timeout-seconds=60

WebClient configuration:

// WebClientConfig.java
@Configuration
public class WebClientConfig {

    @Value("${ollama.base-url}")
    private String ollamaBaseUrl;

    @Bean
    public WebClient ollamaWebClient() {
        return WebClient.builder()
                .baseUrl(ollamaBaseUrl)
                .defaultHeader(HttpHeaders.CONTENT_TYPE,
                               MediaType.APPLICATION_JSON_VALUE)
                .codecs(c -> c.defaultCodecs()
                              .maxInMemorySize(10 * 1024 * 1024)) // 10MB
                .build();
    }
}

DTOs for request and response:

// OllamaChatRequest.java
public record OllamaChatRequest(
        String model,
        List<Message> messages,
        boolean stream
) {
    public record Message(String role, String content) {}
}

// OllamaChatResponse.java
public record OllamaChatResponse(
        String model,
        Message message,
        boolean done
) {
    public record Message(String role, String content) {}
}

Service with streaming support:

// OllamaService.java
@Service
@RequiredArgsConstructor
public class OllamaService {

    private final WebClient ollamaWebClient;

    @Value("${ollama.model}")
    private String defaultModel;

    @Value("${ollama.timeout-seconds:60}")
    private int timeoutSeconds;

    // Regular request (no streaming)
    public Mono<String> chat(String userMessage) {
        var request = new OllamaChatRequest(
                defaultModel,
                List.of(new OllamaChatRequest.Message("user", userMessage)),
                false
        );

        return ollamaWebClient.post()
                .uri("/api/chat")
                .bodyValue(request)
                .retrieve()
                .bodyToMono(OllamaChatResponse.class)
                .timeout(Duration.ofSeconds(timeoutSeconds))
                .map(r -> r.message().content())
                .onErrorResume(e -> Mono.just("Error: " + e.getMessage()));
    }

    // Streaming (SSE for frontend)
    public Flux<String> chatStream(String userMessage) {
        var body = Map.of(
                "model", defaultModel,
                "messages", List.of(Map.of("role", "user", "content", userMessage)),
                "stream", true
        );

        return ollamaWebClient.post()
                .uri("/api/chat")
                .bodyValue(body)
                .retrieve()
                .bodyToFlux(String.class)
                .filter(line -> !line.isBlank())
                .map(line -> {
                    try {
                        var obj = new ObjectMapper().readTree(line);
                        return obj.path("message").path("content").asText("");
                    } catch (Exception e) {
                        return "";
                    }
                })
                .filter(token -> !token.isEmpty());
    }
}

REST controller:

// OllamaController.java
@RestController
@RequestMapping("/api/ai")
@RequiredArgsConstructor
public class OllamaController {

    private final OllamaService ollamaService;

    @PostMapping("/chat")
    public Mono<Map<String, String>> chat(@RequestBody Map<String, String> req) {
        return ollamaService.chat(req.get("message"))
                .map(r -> Map.of("response", r));
    }

    @GetMapping(value = "/chat/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> chatStream(@RequestParam String message) {
        return ollamaService.chatStream(message);
    }
}

// Test:
// curl -X POST http://localhost:8080/api/ai/chat \
//   -H "Content-Type: application/json" \
//   -d '{"message": "What is Spring WebFlux?"}'

Option 2: Spring AI — minimum code

Dependencies:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-ollama</artifactId>
</dependency>

Configuration in application.properties:

spring.ai.ollama.base-url=http://localhost:11434
spring.ai.ollama.chat.options.model=llama3.2:3b
spring.ai.ollama.chat.options.temperature=0.7
spring.ai.ollama.init.pull-model-strategy=never

Service via Spring AI — regular request and streaming:

@Service
@RequiredArgsConstructor
public class SpringAiOllamaService {

    private final ChatClient chatClient;

    // Regular request
    public String ask(String question) {
        return chatClient.prompt()
                .user(question)
                .call()
                .content();
    }

    // Streaming via Spring AI
    public Flux<String> stream(String question) {
        return chatClient.prompt()
                .user(question)
                .stream()
                .content();
    }
}

When to choose what:

✔️ WebClient — full control over the request, streaming, timeout configuration, and error handling. No additional dependencies.
✔️ Spring AI — quick start and easy switching between providers (Ollama → OpenAI → Anthropic) without changing code.

🎯 Python Example

Short answer: Two approaches: the native ollama library (simpler, more features) or the openai SDK with base_url (if you already have OpenAI code).

⚠️ Important: the examples below are for demonstration purposes. They show the API interaction mechanics, not a ready-made application architecture. In a real project, add error handling, logging, configuration via environment variables, and an appropriate module structure.

Option 1: native ollama library

pip install ollama

import ollama

# Simple request
response = ollama.chat(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "What is a REST API?"}]
)
print(response["message"]["content"])

# Streaming
for chunk in ollama.chat(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Tell me about microservices"}],
    stream=True
):
    print(chunk["message"]["content"], end="", flush=True)

# Embeddings
emb = ollama.embed(model="nomic-embed-text", input="Hello world")
print(f"Dimension: {len(emb['embeddings'][0])}")

Option 2: OpenAI SDK (drop-in replacement)

pip install openai

from openai import OpenAI

# The only changes compared to OpenAI: base_url and api_key (ignored)
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

# Then — standard OpenAI code without changes
response = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[
        {"role": "system", "content": "You are a technical assistant."},
        {"role": "user", "content": "What is Docker?"}
    ]
)
print(response.choices[0].message.content)

# Streaming via OpenAI SDK
stream = client.chat.completions.create(
    model="llama3.2:3b",
    messages=[{"role": "user", "content": "Tell me about CI/CD"}],
    stream=True
)
for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

🎯 Example in JavaScript / Node.js

⚠️ Important: the examples below are for demonstration purposes. The goal is to show the basic mechanics of calling the Ollama API from JavaScript. In a real application, the structure will be different: separate modules, error handling, environment variables for URL and model name.

Option 1: native ollama library

npm install ollama

import ollama from "ollama";

// Simple request
const response = await ollama.chat({
  model: "llama3.2:3b",
  messages: [{ role: "user", content: "What is a REST API?" }],
});
console.log(response.message.content);

// Streaming
const stream = await ollama.chat({
  model: "llama3.2:3b",
  messages: [{ role: "user", content: "Tell me about microservices" }],
  stream: true,
});
for await (const chunk of stream) {
  process.stdout.write(chunk.message.content);
}

// Embeddings
const emb = await ollama.embed({
  model: "nomic-embed-text",
  input: "Hello world",
});
console.log(`Dimension: ${emb.embeddings[0].length}`);

Option 2: fetch API (no dependencies)

async function chatWithOllama(message) {
  const res = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3.2:3b",
      messages: [{ role: "user", content: message }],
      stream: false,
    }),
  });
  const data = await res.json();
  return data.message.content;
}

// Streaming via ReadableStream
async function streamChat(message, onToken) {
  const res = await fetch("http://localhost:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3.2:3b",
      messages: [{ role: "user", content: message }],
    }),
  });

  const reader = res.body.getReader();
  const decoder = new TextDecoder();

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;
    const lines = decoder.decode(value).split("\n").filter(Boolean);
    for (const line of lines) {
      const chunk = JSON.parse(line);
      onToken(chunk.message.content);
      if (chunk.done) return;
    }
  }
}

streamChat("What is GraphQL?", (token) => process.stdout.write(token));

🎯 Model Management and Health Check via API

When Ollama is used not as a local CLI tool, but as a server in a real application, questions arise that go beyond simple chat: how to check if Ollama is running before sending a request, how to avoid cold-start delays on the first session request, how to automatically download the required model at application startup, how to monitor how much memory a model occupies in production. There are separate endpoints for all of this.

I use these endpoints in WebsCraft for two tasks: health check on Spring Boot startup — I check if Ollama is available before registering AI routes, and /api/ps in logs — to see when the model is unloaded and how much VRAM it occupies between requests.

GET /api/tags — list of installed models

curl http://localhost:11434/api/tags

# Response:
{
  "models": [
    {
      "name": "llama3.2:3b",
      "size": 2019393423,
      "details": {
        "parameter_size": "3B",
        "quantization_level": "Q4_K_M"
      }
    }
  ]
}

Useful at application startup: check if the required model is installed, and if not — download it via /api/pull (or return an error).

GET /api/ps — running models and VRAM

curl http://localhost:11434/api/ps

# Response:
{
  "models": [
    {
      "name": "llama3.2:3b",
      "size_vram": 2145386496,
      "expires_at": "2026-05-01T10:05:00Z"
    }
  ]
}

Useful before a request: if the model is already loaded (/api/ps is not empty) — the first request will be without cold-start delay. The expires_at field shows when the model will be unloaded from memory (by default, 5 minutes after the last request).

GET / — health check

curl http://localhost:11434/
# Returns: "Ollama is running"

# Useful in startup scripts:
if curl -s http://localhost:11434/ | grep -q "running"; then
  echo "Ollama is ready"
else
  echo "Ollama is not running — starting..."
  ollama serve &
fi

In Spring Boot, you can do a health check via @EventListener(ApplicationReadyEvent.class) — after the application starts, check Ollama's availability and log the result:

@Component
@RequiredArgsConstructor
public class OllamaHealthChecker {

    private final WebClient ollamaWebClient;

    @EventListener(ApplicationReadyEvent.class)
    public void checkOllamaOnStartup() {
        ollamaWebClient.get()
                .uri("/api/tags")
                .retrieve()
                .bodyToMono(Map.class)
                .subscribe(
                    resp -> log.info("Ollama is available, models: {}",
                                     ((List) resp.get("models")).size()),
                    err  -> log.warn("Ollama is unavailable: {}", err.getMessage())
                );
    }
}

POST /api/pull — download model via API

curl http://localhost:11434/api/pull \
  -d '{"name": "llama3.2:3b"}'

# Python with progress:
import requests, json

def pull_model(name: str):
    r = requests.post("http://localhost:11434/api/pull",
                      json={"name": name}, stream=True)
    for line in r.iter_lines():
        if line:
            status = json.loads(line)
            if "total" in status and "completed" in status:
                pct = 100 * status["completed"] / status["total"]
                print(f"\r{status['status']} {pct:.1f}%", end="")
            else:
                print(status.get("status", ""))

pull_model("nomic-embed-text")

Useful in Docker entrypoint or CI/CD pipeline — automatically download required models on the first deployment, without manual ollama pull on the server.

🎯 Error Handling, Timeouts, OLLAMA_HOST

Common Errors and How to Handle Them

Error	Cause	Solution
`Connection refused :11434`	Ollama is not running	Run `ollama serve` or the Ollama application
`404 model not found`	Model not downloaded	`ollama pull model-name`
Timeout without response	Model too large / cold start	Increase timeout to 120s, or pre-load the model
`500 out of memory`	Not enough RAM	Choose a smaller model or Q4 instead of Q8
`404 on /v1/chat/completions`	Confused /api/ and /v1/	OpenAI SDK → `base_url = localhost:11434/v1`
Response truncated mid-sentence	`num_predict` too small (default 128)	Increase `num_predict` or set to `-1` (no limit)

The last error in the table is the most non-obvious. I myself encountered it when responses suddenly got cut off in the middle of an explanation. The reason: by default, some Ollama builds limit generation to 128 tokens. The solution is to explicitly specify num_predict:

# In a request via /api/chat or /api/generate:
{
  "model": "llama3.2:3b",
  "messages": [...],
  "options": {
    "num_predict": -1   // -1 = no limit
    // or a specific number:
    // "num_predict": 2048
  }
}

Recommended Timeouts

Python (requests):

# Tuple (connect_timeout, read_timeout)
requests.post(url, json=body, timeout=(10, 120))
# 10s for connection, 120s for reading the response
# For large models or long responses — increase read to 300s

Java (WebClient):

// WebClient has two levels of timeouts — both are needed

// 1. HTTP client level timeout (TCP connection and read)
@Bean
public WebClient ollamaWebClient() {
    HttpClient httpClient = HttpClient.create()
            .option(ChannelOption.CONNECT_TIMEOUT_MILLIS, 10_000) // 10s for connection
            .responseTimeout(Duration.ofSeconds(120));            // 120s for response

    return WebClient.builder()
            .baseUrl(ollamaBaseUrl)
            .clientConnector(new ReactorClientHttpConnector(httpClient))
            .codecs(c -> c.defaultCodecs().maxInMemorySize(10 * 1024 * 1024))
            .build();
}

// 2. Reactive timeout at Mono/Flux level (for specific requests)
ollamaWebClient.post()
        .uri("/api/chat")
        .bodyValue(body)
        .retrieve()
        .bodyToMono(OllamaChatResponse.class)
        .timeout(Duration.ofSeconds(120))  // additional protection
        .onErrorMap(TimeoutException.class,
                e -> new RuntimeException("Ollama did not respond within 120s"));

⚠️ For streaming (bodyToFlux), the timeout at the Flux level triggers if more than the specified time passes between tokens — this is not always what you want. For streaming, it's better to rely only on responseTimeout at the HttpClient level.

OLLAMA_HOST — running on another host

# Run Ollama accessible to the network (not just localhost)
OLLAMA_HOST=0.0.0.0:11434 ollama serve

# Or in Docker:
docker run -e OLLAMA_HOST=0.0.0.0:11434 ollama/ollama

# In Python client — replace localhost with server IP:
client = OpenAI(base_url="http://192.168.1.100:11434/v1", api_key="ollama")

# In Java application.properties:
ollama.base-url=http://192.168.1.100:11434

⚠️ Attention: if you expose Ollama externally — add authorization or restrict access via firewall. By default, Ollama does not require authorization — anyone on the network can send requests.

❓ Frequently Asked Questions (FAQ)

How does /api/generate differ from /api/chat?

/api/generate accepts a prompt string and returns text. /api/chat accepts an array of messages with roles (system, user, assistant) and supports tool calling. For chatbots and applications with context — always use /api/chat. For batch generation without context — /api/generate is more convenient.

How to save context between requests?

Ollama does not save context automatically. For multi-turn chat, pass the full message history in each request: after each response, add it to the messages array and pass the entire array in the next request.

What timeout to set for requests?

Depends on model size and response length. For 3B models — 30–60 seconds. For 8B — 60–120 seconds. For the first request after startup (cold start) — add another 10–30 seconds for loading the model into memory.

Do I need an API key for Ollama?

For the native API (/api/*) — no, authorization is not required. For the OpenAI-compatible API (/v1/*) — some SDKs require passing an api_key, but Ollama ignores it. Pass any string: "ollama".

How to run Ollama API in Docker?

docker run -d -p 11434:11434 ollama/ollama — and the API will be available at http://localhost:11434. For GPU acceleration: docker run --gpus all -p 11434:11434 ollama/ollama.

Can I use Ollama in Spring Boot without Spring AI?

Yes. WebClient or RestClient are sufficient for direct HTTP requests to the Ollama API. Spring AI is more convenient if you plan to switch between providers (Ollama → OpenAI → Anthropic) without changing code. For simple integration — WebClient is perfectly adequate.

How to find out how many tokens/sec a model outputs?

Ollama returns metadata in each response — fields eval_count (number of generated tokens) and eval_duration (time in nanoseconds). Divide one by the other:

# Python
data = requests.post(...).json()
tok_per_sec = data["eval_count"] / (data["eval_duration"] / 1e9)
print(f"{tok_per_sec:.1f} tok/s")

For a 3B model on Mac M1 — expect 20–30 tok/s. For 8B — 10–15 tok/s. If you get less than 5 tok/s — the model is too large for the hardware or is partially swapping to disk.

Why doesn't the model call a tool even if tools are passed?

The model is not obligated to call a tool — it can respond with text if it decides that external data is not needed. The three most common reasons are: unclear function description (model doesn't understand when to call it), the question doesn't require external data in the model's opinion, or the model does not support tool calling (check the list of supported models: Llama 3.1+, Qwen 2.5, Mistral v0.3+). Always check for the presence of tool_calls in the response and handle both cases — with and without the call.

✅ Conclusions

Ollama REST API is a simple and powerful tool for integrating local AI into any application. Here's the main takeaway:

✔️ Two API surfaces: native /api/* for full control, /v1/* as a drop-in replacement for OpenAI code
✔️ /api/chat — the main endpoint: supports history, tool calling, and streaming
✔️ Streaming — by default: enable for UI, disable for batch tasks
✔️ /api/embed — for RAG: nomic-embed-text + /api/chat = a complete local RAG pipeline
✔️ Java + WebClient: non-blocking requests, streaming support via Flux
✔️ Error handling: always set timeouts and handle Connection refused

In my projects — WebsCraft and AskYourDocs — I use these exact endpoints: /api/embed for content indexing, /api/chat with streaming for user responses, /api/ps and health check for monitoring. The main thing I've learned after several months of working with the Ollama API is: it doesn't require complex infrastructure — curl, WebClient, or fetch are enough to build a full-fledged AI application without any external API keys.

Next step: if you want to build a complete RAG pipeline with Ollama — article RAG with Ollama: from pipeline to production. If you need a comparison of when Ollama wins over cloud APIs — Ollama vs ChatGPT vs Claude: which task requires the cloud.

Categories

📚 Table of Contents

🎯 Two API Surfaces: Native /api/* vs OpenAI-Compatible /v1/*

Native API (/api/*)

OpenAI-Compatible API (/v1/*)

⚠️ What to Watch Out For — My Experience

1. Model Name in /v1/ Must Match Exactly

2. Changing Context Window via /v1/ is Non-Obvious

3. /v1/ and /api/ Responses Have Different Formats

🎯 POST /api/generate: Basic Text Generation

Basic Request via curl

Response Format — All Fields

How to Calculate Tokens/Sec from Metadata

Main Parameters in options

🎯 POST /api/chat: Chat Format and Context Preservation

Basic Request

keep_alive: How Long to Keep the Model in Memory

Preserving Context (Multi-Turn)

Trimming History: What to Do When Context Overflows

/api/chat Response Format

🎯 Streaming: Why and How to Implement

Real-World Case: How it Works on AskYourDocs

Streaming via curl

Streaming in Python

Streaming in Spring Boot via SSE

Frontend: Reading SSE via fetch

Aborting Streaming: The "Stop" Button

🎯 POST /api/embed: embeddings for RAG

How I use /api/embed in WebsCraft

Installing the embedding model

Request via curl

Response format

Batch embeddings (multiple texts at once)

RAG function in Python

Embeddings in Java via WebClient

🎯 Tool Calling: connecting external functions

Which model supports tool calling

Basic request with tools

Response with tool_calls

Full tool calling cycle in Python

Tool calling in Java via WebClient

⚠️ Common mistake: model didn't call a tool

🎯 Java Example: WebClient + Spring Boot

Option 1: WebClient — no additional dependencies

Option 2: Spring AI — minimum code

🎯 Python Example

Option 1: native ollama library

Option 2: OpenAI SDK (drop-in replacement)

🎯 Example in JavaScript / Node.js

Option 1: native ollama library

Option 2: fetch API (no dependencies)

🎯 Model Management and Health Check via API

GET /api/tags — list of installed models

GET /api/ps — running models and VRAM

GET / — health check

POST /api/pull — download model via API

🎯 Error Handling, Timeouts, OLLAMA_HOST

Common Errors and How to Handle Them

Recommended Timeouts

OLLAMA_HOST — running on another host

❓ Frequently Asked Questions (FAQ)

How does /api/generate differ from /api/chat?

How to save context between requests?

What timeout to set for requests?

Do I need an API key for Ollama?

How to run Ollama API in Docker?

Can I use Ollama in Spring Boot without Spring AI?

How to find out how many tokens/sec a model outputs?

Why doesn't the model call a tool even if tools are passed?

✅ Conclusions

📖 Sources

📬 Don't Miss New Articles

Ready to build a turnkey website?

Останні статті

Ollama REST API: інтеграція у свій застосунок — Java, Python, JavaScript

Ollama vs ChatGPT vs Claude: яка задача вимагає хмари

DeepSeek V4 Pro у 2026: повний розбір — архітектура, бенчмарки і коли переходити вигідно

Міграція з deepseek-chat на DeepSeek V4: що зламається до 24 липня

Що означає GPT-5.5 для ринку AI у 2026 році

GPT-5.5 vs GPT-5.4: що змінилося у 2026 році