Ollama 0.24 + Codex App: How to Run OpenAI Codex Locally Without Subscription

Updated:
Ollama 0.24 + Codex App: How to Run OpenAI Codex Locally Without Subscription

Updated: May 15, 2026

On May 14, 2026, Ollama 0.24 was released — and it's not just another bug fix patch.
This release adds official support for OpenAI's Codex App: now the desktop AI coding agent can be run on any local or cloud model via Ollama. One command — and Codex works with your models, without a mandatory OpenAI subscription.

→ Read the official Ollama Codex App documentation

If you're not yet familiar with Ollama — start with the guide to installing on Mac, Windows, and Linux. If you're interested in comparing models for coding tasks — read the top Ollama models in 2026.

📚 Article Contents

🎯 What has changed: why Ollama 0.24 is not just a patch

Short answer: Ollama 0.24 is the first release that transforms Ollama from a tool for running models into a platform for AI coding agents. Codex App now works on top of Ollama just like it does with the OpenAI API — only the models are local or cloud-based, at your choice.

Before Ollama 0.24, Codex App worked exclusively through the OpenAI API and required a Plus or Pro subscription. Now, all you need is Ollama installed and one command — and Codex gets access to any local model.

What's new in Ollama 0.24 according to the official release on GitHub:

  • ✔️ Codex App integration — official support for the desktop Codex App via ollama launch codex-app
  • ✔️ MLX memory trace logging — memory usage logging for models on Apple Silicon
  • ✔️ Improved MLX sampler — higher generation quality on Mac M-series
  • ✔️ More reliable updates — fixed issues with Ollama App auto-updates
  • ✔️ Response caching for the ollama show command — faster startup

But the main thing is not the list of features. The main thing is the change in concept. Previously, Ollama was the answer to the question "how to run a model locally." Now it's becoming the answer to the question "how to run an AI coding agent locally." Codex App, Claude Code, OpenCode, Copilot CLI — they all now run via ollama launch.

This is a fundamentally different level: not just executing prompts, but a full-fledged agent with access to the repository, terminal, browser, and task execution loop.

🎯 How Codex App works: not an IDE, but an AI agent with an interface

Short answer: Codex App is a desktop application from OpenAI for macOS and Windows. Not an IDE plugin, not code autocompletion. It's a standalone agent that receives a task, writes a plan, executes steps, runs code, and returns the result.

The difference between Copilot and Codex App: Copilot completes a line of code as you type. Codex App receives a task like "add authentication via OAuth" and writes the code itself, runs tests, fixes errors — without your involvement at every step.

⚠️ Important from personal experience: the fact that the agent "writes code itself" doesn't mean it writes it correctly. In practice, AI coding agents often ignore SOLID principles, create God Objects, mix logic in one class, or generate working but ugly code without understanding your architecture.

My rule: treat the Codex App result as a draft, not final code. The agent is good at mechanical work — writing boilerplate, covering with tests, refactoring according to a clear task. But architectural decisions — Single Responsibility, proper layer separation, dependency injection — require your control.

Practical approach: before starting a task, describe architectural constraints to the agent in the prompt. For example: "use the Repository pattern, a separate service layer from the controller, do not put business logic in entities." Without this, the agent will choose the simplest path — and it's not necessarily the right one.

The interaction architecture after connecting Ollama looks like this:

  1. Codex App sends requests to the Ollama OpenAI-compatible endpoint (http://localhost:11434/v1)
  2. Ollama forwards the request to the selected model — local or cloud
  3. The model returns a response in the tool calling / function calling format
  4. Codex App interprets the response, performs actions (writes files, runs commands)
  5. The execution result is returned to the model as context for the next step

How this differs from Cursor or Copilot Chat:

  • Cursor — integrated into the editor, helps during the coding process. Codex App — a separate application, executes tasks asynchronously.
  • Copilot — suggests autocompletions and chat. Codex App — a full execution loop: plan → code → run → verify → fix.
  • Claude Code — CLI agent in the terminal. Codex App — a desktop application with a visual interface, browser, and review mode.

Codex App requires a model with reliable tool calling to work. This is why model selection is critical — we'll discuss it in detail below. If you want to understand the mechanics of tool calling more deeply — read which Ollama models support Tool Calling: tests and benchmarks 2026.

Ollama 0.24 + Codex App: How to Run OpenAI Codex Locally Without Subscription

🎯 Step-by-step installation: ollama launch codex-app

Short answer: Three steps — update Ollama, install Codex App, run one command. Ollama automatically configures Codex to use the local endpoint.

Official Ollama documentation on Codex App integration — support is available from version v0.24.0 and newer.

Step 1. Update Ollama to version 0.24.0+

Check current version:

ollama --version

If the version is lower than 0.24.0 — update:

# macOS and Linux
curl -fsSL https://ollama.com/install.sh | sh

# Windows — download the new installer from https://ollama.com/download

Step 2. Install Codex App

Download the Codex App desktop application for macOS or Windows from the official OpenAI website: developers.openai.com/codex/quickstart.

After installation — open Codex App at least once manually. This is necessary for the application to initialize its config files. After the first launch — close it.

Step 3. Launch via Ollama

ollama launch codex-app

Ollama automatically configures Codex App to use its OpenAI-compatible endpoint and opens the application. The configuration is saved — the next time Codex will open with your model.

Launch with a specific model immediately:

# Cloud model with vision support
ollama launch codex-app --model kimi-k2.6:cloud

# Local model
ollama launch codex-app --model qwen3:14b

# Local with lower RAM consumption
ollama launch codex-app --model gemma4:4b

Restore original Codex settings

If you want to revert Codex App to its previous profile (e.g., back to OpenAI API):

ollama launch codex-app --restore

Before overwriting, Ollama automatically saves a backup in ~/.ollama/backup/codex-app/ (on Windows ~ = user profile folder).

⚠️ Common issues and solutions

Issue Cause Solution
Codex App does not open after the command The application has not been initialized yet Open Codex manually once, then run ollama launch codex-app again
Codex does not switch models The application is already running and has not reloaded Allow Ollama to restart Codex when prompted, or close it manually and run the command again
Model not found Model is not downloaded locally First ollama pull model-name, then ollama launch codex-app
Slow response or timeout Model is too large for the hardware or cold start Choose a smaller model or wait for the initial load

Important: the Codex App profile (ollama launch codex-app) and the Codex CLI profile (ollama launch codex) are separate. Changing one does not affect the other.

🎯 Which model to choose for Codex: comparison by task

Short answer: Codex App is an agent with tool calling and a multi-step execution loop. It requires a model with *reliable* tool calling, not just "support". Weak tool calling = the agent stops in the middle of a task or returns text instead of JSON.

Full list of Ollama models with tool calling support and reliability comparison — in the article which Ollama models support Tool Calling: tests and benchmarks 2026.

Ollama recommends the following models for Codex in their newsletter (May 2026):

Cloud models (via Ollama Cloud)

Model Feature When to choose
kimi-k2.6:cloud Vision support (sees screenshots) When you need to annotate UI or debug via screenshot
glm-5.1:cloud Strong in code, fast For general coding tasks with cloud quality

Local models (without Ollama Cloud subscription)

Model RAM Tool calling When to choose
qwen3:14b ~9 GB Excellent Optimal balance of quality / RAM for most tasks
qwen3:8b ~5 GB Good If RAM is limited, but acceptable quality is needed
gemma4:31b ~20 GB Excellent Maximum local quality, requires a powerful Mac
gemma4:4b ~3 GB Acceptable Weak hardware, simple tasks
nemotron-3-super:cloud cloud Excellent Alternative without a paid Ollama Cloud subscription

Download the model before launching:

# Recommended for most
ollama pull qwen3:14b
 
# If RAM is less than 10 GB
ollama pull qwen3:8b
 
# Maximum local quality
ollama pull gemma4:31b

More details on choosing models for specific hardware — read Ollama on weak hardware: what to run on 8GB RAM.

Key criterion: tool calling reliability

For agent tasks — don't just look at the model size or "overall quality". The main thing is: does the model return correct JSON for tool calls, does it hallucinate arguments, does it correctly handle multi-step tool loops.

If a model doesn't support tool calling properly — it starts inventing arguments, ignores task conditions, or simply responds with text where a structured call is expected. Result: the agent goes in the wrong direction and you waste time fixing instead of working.

From personal experience — the practice of two models: I use two models in parallel. A fast model (llama3.2:3b) covers ~70% of tasks and responds in 1-2 seconds — for regular questions, generating boilerplate, short answers. When precise prompt adherence, complex tool calling, or a multi-step agent is needed — I switch to qwen3:8b or a larger one. After a complex task — I switch back to the fast one. Waiting 8-12 seconds for each response in normal operation mode is too long.

This approach provides a balance between speed and quality — you don't have to wait for a large model to "warm up" for simple requests every time.

🎯 Built-in browser and Review Mode: what they offer in practice

Short answer: Two features that differentiate Codex App from CLI agents — a built-in browser with annotations and a code review mode. According to the official documentation, they actually exist and work. How well they work depends on the model and the complexity of the task.

Built-in browser

According to official Ollama documentation, Codex App can open local servers and websites in a built-in browser — and allows leaving annotations directly on the page as context for the agent.

On paper, this sounds convenient: open a local dev server, highlight an element, write a comment — and the agent understands what needs to be fixed without additional description.

⚠️ Honestly about limitations: the result strongly depends on how accurately the agent interprets your annotation and the page context. For simple UI edits — it works reasonably well. For more complex scenarios (logic bugs, not layout issues) — an annotation in the browser doesn't replace a clear prompt. Vision capabilities (when the model literally "sees" a screenshot) are available only with models that support vision — for example, kimi-k2.6:cloud. With local text models, the agent reads HTML, not images.

Review Mode

According to the documentation, Review Mode allows you to view code changes within Codex App itself, leave comments on specific lines, and ask the agent to refine them.

In principle, this is the same workflow as in GitHub PR review — but without leaving the application. The agent sees its own diff and your comment in the same context, reducing the amount of explanation needed.

⚠️ Honestly about limitations: Review Mode is useful when the agent has done something close to correct and needs minor adjustments. If the agent has fundamentally gone in the wrong direction with the architecture — comments in review won't replace reformulating the task from scratch. It's a tool for refinements, not for fixing fundamental errors.

🎯 Limitations of the local approach: where cloud Codex wins

Short answer: Local Codex via Ollama means privacy, offline access, and no subscription fees. However, there are tasks where cloud OpenAI Codex (on GPT-4o or GPT-5.5) will be noticeably better. It's important to know these boundaries — so you don't waste time on tasks where a local model won't cope.

More details on scenarios where Ollama wins against cloud APIs, and where it loses — read Ollama vs ChatGPT vs Claude: which task requires the cloud.
Criterion Local Codex (Ollama) Cloud Codex (OpenAI)
Code privacy ✅ Code stays on your machine ⚠️ Code is sent to OpenAI servers
Offline operation ✅ Fully offline (local models) ❌ Internet required
Cost ✅ Free after hardware purchase ⚠️ Subscription or per-token payment
Quality on complex tasks ⚠️ Depends on model and hardware ✅ GPT-5.5 is stronger on architectural tasks
Context window ⚠️ Limited by RAM (usually 8k–32k) ✅ Up to 128k+ tokens
Speed on large repos ⚠️ Slower on CPU or weak GPU ✅ Stable speed regardless of hardware
Vision (screenshots) ⚠️ Only with kimi-k2.6:cloud or gemma4 ✅ Native support in GPT-4o / GPT-5.5
Parallel tasks (task tree) ✅ Supported ✅ Supported

Where local Codex clearly wins

  • Private or commercial code — when code cannot be sent to external servers
  • Repetitive tasks — refactoring, writing tests, generating boilerplate where GPT-4 quality is not critical
  • Offline environments — corporate networks without internet access
  • Cost for high volume — if you generate thousands of tokens daily, local is cheaper

More on the advantages of self-hosted AI → read the article

Where cloud Codex is better

  • Large repositories — when the context doesn't fit into 8–16k tokens
  • Complex architecture — where GPT-5.5 level is needed for a correct solution
  • Vision tasks — analyzing UI screenshots without a cloud model
  • Weak hardware — if your Mac or PC can't handle a 14B model

🎯 What is the optimal setup: hardware, model, settings

Short answer: For comfortable work with local Codex, you need at least 16 GB of RAM. 8 GB is possible, but limited. Below are specific recommendations depending on your hardware.

Mac Apple Silicon (recommended option)

RAM Recommended model Expected speed
8 GB qwen3:8b or gemma4:4b ~15–20 tok/s, simple tasks
16 GB qwen3:14b — optimal ~20–30 tok/s, most tasks
32 GB gemma4:31b or qwen3:32b ~15–25 tok/s, complex tasks
64 GB+ qwen3:72b or larger ~10–20 tok/s, maximum quality locally

Windows / Linux with NVIDIA GPU

VRAM Recommended model Note
8 GB qwen3:8b Fully in VRAM, fast
12 GB qwen3:14b (Q4) Fits with Q4_K_M quantization
16 GB+ qwen3:14b or gemma4:27b Comfortable work without swap
24 GB+ gemma4:31b Maximum quality on GPU

Optimal command set to start

# 1. Update Ollama
curl -fsSL https://ollama.com/install.sh | sh

# 2. Download model (for 16 GB RAM)
ollama pull qwen3:14b

# 3. Launch Codex App with the chosen model
ollama launch codex-app --model qwen3:14b

# Next time, it's enough to just:
ollama launch codex-app
# Ollama remembers the chosen model

Context settings for large repositories

By default, the context window depends on VRAM/RAM. To work with large files or multiple files simultaneously, you can increase the context via Modelfile:

# Create a Modelfile with a larger context
FROM qwen3:14b
PARAMETER num_ctx 16384

# Build the new model
ollama create qwen3-codex -f Modelfile

# Launch Codex with this model
ollama launch codex-app --model qwen3-codex

For details on context management and parameter settings, see the article Ollama REST API: Integration into Your Application.

⚙️ Advanced settings: config, environment variables, benchmarks

For most users, ollama launch codex-app is sufficient. But if you want more control over the agent's behavior or to get the most out of a specific model, here's what you can configure manually.

1. Manual editing of ~/.codex/config.toml

The main Codex config is located at:
~/.codex/config.toml — Mac / Linux
%USERPROFILE%\.codex\config.toml — Windows

⚠️ Important: when you run ollama launch codex-app, Ollama itself writes the necessary values to this file and saves a backup of previous settings in ~/.ollama/backup/codex-app/. Manual editing makes sense only if you want to change parameters not available in the standard launch — for example, temperature or system prompt.

Example configuration for a local Ollama provider:

[model_providers.ollama]
name = "Ollama"
base_url = "http://localhost:11434/v1"

[profiles.local-coder]
model_provider = "ollama"
model = "qwen3:14b"
temperature = 0.3

⚠️ Note: the exact structure of the config file may vary depending on the Codex App version. Before editing, check what's already in your file, don't overwrite blindly. The num_ctx parameter may not be supported via config.toml — it's more reliable to use Modelfile as described above to change the context.

2. Ollama environment variables

Ollama supports a number of official environment variables for fine-tuning. The full list is in the official Ollama FAQ documentation. The most useful for working with Codex are:

Variable What it does Example
OLLAMA_HOST Ollama server address and port 0.0.0.0:11434
OLLAMA_KEEP_ALIVE How long to keep the model in memory 30m or -1
OLLAMA_NUM_PARALLEL Number of parallel requests 2
OLLAMA_FLASH_ATTENTION Flash Attention for Apple Silicon 1
OLLAMA_NUM_GPU Number of GPU layers to offload 99 (all layers)

Set before launching:

# macOS / Linux
OLLAMA_KEEP_ALIVE=30m OLLAMA_FLASH_ATTENTION=1 ollama launch codex-app

# or permanently via ~/.zshrc / ~/.bashrc
export OLLAMA_KEEP_ALIVE=30m
export OLLAMA_FLASH_ATTENTION=1

3. Benchmarks: how much local models actually handle

⚠️ Disclaimer: exact SWE-bench figures are constantly updated and depend heavily on the test configuration. The data below is approximate; check the latest values on swebench.com and in official model releases.

Model SWE-bench Verified (approximate) Where to run
GPT-5.5 / Claude Sonnet 4.6 (cloud) ~68–73% OpenAI / Anthropic API
gpt-oss:120b via Ollama ~62% Locally, requires 64+ GB RAM
glm-5.1:cloud / large Qwen3 ~58–68% Ollama Cloud or locally 32B+
qwen3:14b locally not officially tested 16 GB RAM, good for routine tasks

What this implies practically: local models of 14B–32B size handle routine coding tasks well — refactoring, writing tests, generating boilerplate. For complex agentic tasks requiring deep reasoning across multiple files — cloud models have a noticeable advantage. For most real-world tasks, the gap is not as critical as the percentages suggest.

❓ Frequently Asked Questions (FAQ)

Do I need an OpenAI subscription to use Codex App with Ollama?

No. Ollama configures Codex App to use its local endpoint. An OpenAI subscription is only needed if you want to use OpenAI cloud models. For local models, no subscription is required.

Is Codex App only available on macOS?

No. Codex App from OpenAI is available for macOS and Windows. Ollama 0.24 supports integration on both platforms. Linux is not yet supported by Codex App itself.

What is the difference between `ollama launch codex-app` and `ollama launch codex`?

ollama launch codex-app — launches the desktop Codex App with a graphical interface. ollama launch codex — launches Codex CLI in the terminal. These are separate profiles; changing one does not affect the other.

Are my configs saved if I run `ollama launch codex-app`?

Yes. Ollama saves a backup of the original Codex App configs in ~/.ollama/backup/codex-app/ before making any changes. You can restore them with the command ollama launch codex-app --restore.

Which model should I choose if I want to try with minimal requirements?

To start, use qwen3:8b (requires ~5 GB RAM) or gemma4:4b (~3 GB RAM). They support tool calling and provide acceptable quality for simple tasks. For serious work, we recommend qwen3:14b on 16 GB RAM.

Can Codex App via Ollama execute tasks in parallel (task tree)?

Yes, task tree is a feature of Codex App itself, independent of the underlying model. However, parallel task execution puts a load on the model and requires more RAM. On 8 GB, parallel tasks can cause noticeable slowdowns.

Does Codex App see my entire repository?

Codex App gets access to the repository you open in the application. With local models, no code is sent externally. With Ollama Cloud cloud models (kimi-k2.6:cloud, glm-5.1:cloud), requests go through Ollama Cloud.

✅ Conclusions

I tried Ollama 0.24 + Codex App right after its release — and my impression is mixed, but generally positive. It really works: one command and Codex App starts using a local model instead of the OpenAI API. For private code or offline environments — this is enough to try it out.

But it's important to understand that this is not a "replacement" for cloud Codex, but a different tool with different trade-offs. Here's what I learned from practice:

  • ✔️ Installation is simple: update Ollama to 0.24, install Codex App, run ollama launch codex-app — that's it.
  • ✔️ The model decides everything: for most tasks on 16 GB RAM, I use qwen3:14b — reliable tool calling and acceptable speed.
  • ⚠️ Two models are better than one: a fast one (llama3.2:3b) for 70% of tasks, a larger one — when precise tool calling is needed. Waiting 8–12 seconds for every simple response is too long for normal work.
  • ⚠️ Agent code is a draft: Codex writes working code, but often without understanding SOLID principles and your architecture. Always review the result, especially if the task involves multiple layers of the application.
  • ✔️ Built-in browser and Review Mode are convenient for simple UI edits and clarifications after a task is completed. For complex architectural problems, they won't replace proper prompting.
  • ✔️ The local approach wins for private code, offline environments, and large generation volumes where the cloud is expensive.
  • ⚠️ Cloud Codex is better for large repositories where context doesn't fit locally, and for tasks where GPT-5.5 quality is required.

My conclusion: Ollama 0.24 + Codex App is a useful tool if you approach it correctly. Not as an autonomous developer who will do everything themselves, but as a quick way to write a draft version or handle routine tasks — refactoring, tests, boilerplate. Architecture and code review are still up to you.

If you want to understand how tool calling works under the hood of Codex — read Which Ollama Models Support Tool Calling: Tests and Benchmarks 2026. If you are interested in a full RAG pipeline on top of Ollama — RAG with Ollama: From Pipeline to Production.

📖 Sources

Останні статті

Читайте більше цікавих матеріалів

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Ollama 0.24 + Codex App: як запустити локальний AI coding agent

Оновлено: 15 травня 2026 14 травня 2026 вийшла Ollama 0.24 — і це не черговий патч з виправленням багів. Цей реліз додає офіційну підтримку Codex App від OpenAI: тепер десктопний AI coding agent можна запустити на будь-якій локальній або хмарній моделі через Ollama....

Tool RAG: що робити коли у агента забагато інструментів

Tool RAG: що робити коли у агента забагато інструментів

У вас 5 tools — все чудово. У вас 15 tools — починаються проблеми. У вас 50 tools — агент деградує. Але є рішення яке вирішує проблему масштабу елегантно — і ви вже знаєте як воно працює, бо використовуєте його для документів. Ця стаття — частина серії про AI агентів на Spring Boot. Якщо...

Grounding в AI агентах: що робити коли tool call повернув не те

Grounding в AI агентах: що робити коли tool call повернув не те

Уявіть: ваш AI агент отримав запит «яка ціна на Enterprise план?». Він викликав tool. Tool відповів. Агент сформулював відповідь — впевнено, зв'язно, з конкретною цифрою. Клієнт отримав відповідь і пішов задоволений. Проблема в тому що tool повернув порожній результат — документ не...

Я змусив два AI посперечатись про vibe coding — ось що вийшло

Я змусив два AI посперечатись про vibe coding — ось що вийшло

Я очікував що AI здасться через 3 раунди. Він не здався через 8. І це змінило моє розуміння того як працюють мовні моделі. Як виникла ідея Класична проблема AI-агентів — вони занадто ввічливі. Попроси ChatGPT посперечатись — він погодиться через два повідомлення. Мене це дратувало. Я...

Agent Chat: два AI агенти що сперечаються — Spring Boot 4 + Spring AI + Ollama / OpenRouter

Agent Chat: два AI агенти що сперечаються — Spring Boot 4 + Spring AI + Ollama / OpenRouter

Що буде якщо дати двом AI протилежні переконання і змусити їх сперечатись на задану тему? Саме це питання стало відправною точкою для Agent Chat — експерименту де два агенти з різними характерами ведуть діалог в реальному часі, підкріплюючи аргументи реальними фактами з Wikipedia, Tavily,...

GPT-Realtime-2 vs Gemini Live API: що обрати для голосового агента у 2026 році

GPT-Realtime-2 vs Gemini Live API: що обрати для голосового агента у 2026 році

Два флагмани real-time голосового AI вийшли практично одночасно. OpenAI випустила GPT-Realtime-2 7 травня 2026 року. Google запустила Gemini 3.1 Flash Live 26 березня 2026 року. Обидві — speech-to-speech моделі з reasoning всередині. Обидві — для голосових агентів у продакшн. Але під капотом...