AI_TOOLS 19 June 2026 14 min read 61 view

LM Studio 2026: What it is and why to run AI on Mac

Updated: 19 June 2026

Language: 🇺🇦 🇬🇧 🇩🇪 🇪🇸

Dmitro Petrov

A Tech Lead who builds AI/ML systems for production — and writes about how they actually work.

LM Studio 2026: What it is and why to run AI on Mac

In short: LM Studio is a free desktop application for running LLMs locally on Mac with a GUI, MLX acceleration on Apple Silicon, and an OpenAI-compatible API. By mid-2026, MCP transitioned from an experiment to a standard — LM Studio is now not just a chat, but a full-fledged platform for local AI agents. We'll explore how it differs from Ollama and when LM Studio is the right choice.

💻 What is LM Studio in simple terms

LM Studio is a free desktop application from Element Labs that allows you to download and run open-source language models (Llama, DeepSeek, Qwen, Mistral, Gemma, Phi) entirely on your own computer — no cloud, no API keys, no monthly subscription.

Unlike Ollama, which lives in the terminal, LM Studio provides a graphical interface: a built-in model browser from Hugging Face, a chat window similar to ChatGPT, generation parameter settings directly in the UI, and its own local server at localhost:1234 with an OpenAI-compatible API for developers.

I've been using LM Studio in parallel with Ollama for several months now — and in this article, I'll explain why they are not "either/or" but tools for different tasks.

🚀 What has changed in local AI by mid-2026

If you last looked at local AI a year or two ago, the landscape has changed significantly, and not just in terms of model quality. The very purpose of why people turn to local AI has changed: recently, it was mainly about saving on tokens and the curiosity of enthusiasts; now, it's increasingly a conscious choice for privacy and control.

MCP is no longer an experiment — it's a standard

LM Studio gained support for Model Context Protocol (MCP) as an MCP Host back in version 0.3.17 — at the time, it was a novelty shown as a technical demonstration. But the path from "interesting feature" to "standard" turned out to be quick.

By April 2026, version 0.4.10 added OAuth support for MCP servers — now you can connect Linear, Notion, Atlassian with one click via browser authorization, without manually copying tokens and without storing secrets in plain configuration files. LM Studio handles the entire OAuth handshake itself — it opens the service's authorization page in your browser, securely stores the token after confirmation, and the service's tools become immediately available to the model in chat or via API.

In addition to official integrations (there are only four so far — Linear, Notion, Atlassian, and one more service via the official gallery), the community has already assembled a much wider catalog of MCP connectors that work with LM Studio via standard HTTP/SSE transport or local stdio. This means the ecosystem is growing not only thanks to Element Labs but also thanks to the developer community — this is a sign of a mature platform, not a one-off feature.

In practice, this transforms LM Studio from an "advanced chat" into a full-fledged platform for local AI agents that can actually do things — read files on disk, work with your task trackers, search for information via external APIs, and do it in multiple steps, without human intervention at each stage.

Apple M5 provided a tangible leap

Apple officially announced that the M5 chip processes prompts 3.5-4 times faster than the M4, and the time-to-first-token for a dense 14B model now takes less than 10 seconds, and for a 30B MoE architecture — less than 3 seconds. These are no longer marketing promises, but Apple's own figures from their machine learning research blog.

There's a nuance that owners of new hardware should be aware of: if you have an M5 but are using an older version of macOS, you won't get even the memory bandwidth benefits (a 19-27% increase compared to M4). The chip's full potential is unlocked only with the latest macOS — hardware acceleration without corresponding software works only partially.

Tool calling in local models has drastically improved

Just a year ago, local models had poor and unreliable tool calling — this was the main reason why "local AI agent" sounded like an experiment, not a working tool. Now, the situation has changed dramatically: Gemma 4 has jumped from 6.6% to 86.4% tool calling accuracy according to third-party tests — this is not a gradual improvement, but a qualitative leap in a year. Qwen3.5 now shows results that, on many benchmarks, approach flagship cloud models.

This means that a local AI agent via LM Studio with MCP is no longer a toy for demonstrations — it can actually perform multi-step tasks: find information, process it, call the necessary tool, and do it reliably enough for daily use, at least for relatively simple action chains.

Why this actually matters

These three changes are not a random coincidence of technical updates. They form a single picture: local AI in 2026 has ceased to be a compromise. Previously, the choice to "run locally" almost always meant a conscious sacrifice — weaker models, lack of tool calling, slower speeds, inconvenient interface. Now, each of these sacrifices is becoming significantly smaller or disappearing altogether.

And this aligns with a broader trend visible beyond the enthusiast niche: a Cisco survey of 2600 security professionals showed that 92% perceive generative AI as a technology requiring fundamentally new approaches to risk management, and 68% are concerned about data leakage outside the company or to competitors. When your model runs locally on a Mac, these risks simply don't arise, as the data physically never leaves the device.

For a developer, this means a practical thing: it now makes real sense to build workflows around local AI not only for savings or curiosity but because privacy, data control, and already sufficient model quality make it a rational choice — not just an ideological one.

⚖️ How LM Studio Differs from Ollama and ChatGPT

It's common to confuse three entirely different product categories here – even though at first glance they all "just provide access to AI". Let's break it down to the core, because the difference is fundamental.

Criterion	LM Studio	Ollama	ChatGPT
Where it runs	Locally, your Mac	Locally, your Mac	OpenAI Cloud
Interface	GUI application	CLI terminal (desktop app also available)	Web/mobile app
Internet required	Only for model download	Only for model download	Always
Data privacy	Complete – nothing leaves	Complete – nothing leaves	Data is processed on OpenAI servers
Cost	Free	Free	Subscription / tokens
MLX acceleration on Apple Silicon	✅ Yes, from the very start of Apple Silicon support	✅ Yes, from late March 2026 – separate `-mlx` model tags	Not applicable
MCP / Tool calling	✅ MCP Host with OAuth (0.4.10+)	Tool calling is supported, MCP is narrower	✅ Via OpenAI's own plugins/tools

The line about MLX deserves a separate explanation, as the situation changed literally during 2026. For a long time, MLX acceleration was what clearly distinguished LM Studio from Ollama. But at the end of March, Ollama also officially launched its own MLX engine – and as of now, it has even received separate optimizations: operations merged into unified Metal kernels via the MLX just-in-time compiler and support for the NVFP4 format for better quantization quality.

An important nuance: in Ollama, MLX variants of models come as separate tags – for example, gemma4:e4b-mlx instead of the usual gemma4:e4b. And as of mid-2026, these MLX tags in Ollama only support text, without images – if you need vision input, you'll have to use the standard GGUF tag. LM Studio doesn't have this separation – the MLX build is immediately multimodal if the model supports it.

In simple terms: LM Studio and Ollama are two ways to run the same thing locally, with different interfaces and slightly different maturity of individual features at any given moment. ChatGPT is a completely different product category, because your data physically leaves your computer and is processed on someone else's infrastructure.

⚡ MLX vs llama.cpp: why Apple Silicon wins here

LM Studio runs on two engines simultaneously: llama.cpp (GGUF format, works on any platform – Mac, Windows, Linux, with or without GPU) and Apple MLX (only for M-series chips). If you have Apple Silicon – MLX is usually chosen by default when an MLX build exists for the model.

Why there's a speed difference at all

This isn't about marketing, but architecture. MLX is a framework that Apple developed specifically for the unified memory architecture of M-series chips, where the CPU and GPU share the same memory instead of separate pools like in traditional PCs with discrete graphics cards. MLX directly accesses the Metal runtime, bypassing the overhead of GGUF format quantization.

The speed difference is measured, not estimated: the MLX engine is usually 30-50% faster than llama.cpp via Metal on the same hardware – this is confirmed by independent tests, and even by Ollama itself, which was previously purely GGUF-oriented but eventually recognized the advantage and added its own MLX engine. Some narrow tests on specific models (e.g., Gemma 4) show a difference closer to 10-20% – the actual gain depends on the specific model, context size, and how well the MLX build for that model is optimized.

In practice, this means one simple thing: the same model in MLX format will give you noticeably more tokens per second than the GGUF version of the same model on the same Mac. If you're on an M-series and have a choice – MLX is almost always more beneficial, except when you specifically need a feature that is only available in the GGUF variant (e.g., at the time of writing – image processing for some models in Ollama-MLX tags).

What's worth checking in practice

An important nuance that I've personally verified: LM Studio updates its engines independently of the application itself. If a new model suddenly "doesn't load" or gives a strange error – the first thing to check is Settings → Runtime. An outdated engine is the most common cause of such a problem, much more so than the model itself or insufficient RAM. This is especially relevant immediately after a new model is released – there will be a few days to a week lag until the corresponding MLX engine matures and becomes stable for it, so if a model has just been released and behaves strangely – first check if the engine version is outdated, rather than blaming the model itself.

Another practical detail: sometimes a new model initially only gets support in GGUF via llama.cpp, and a full MLX version arrives later – a pattern we've seen with Gemma 4 and other fresh releases. If you see an error like "model architecture not supported" immediately after a new model is released – it's almost always a matter of time, not your configuration.

🎁 What You Get: GUI, MCP Host, API, Offline

In short – here's the full set of what LM Studio provides out of the box, without any additional settings or plugins:

Feature	What it provides in practice
GUI with built-in model browser	Search and download models directly from Hugging Face without leaving the application – no manual file downloads or format parsing
MCP Host	Connect external MCP servers (filesystem, search, Linear, Notion, Atlassian via OAuth) and make them available to the local model – the model gets real "hands" instead of just text
OpenAI-compatible API at `localhost:1234`	Any code written for the OpenAI SDK can be switched to a local model by changing only the base URL. There's also an Anthropic-compatible endpoint `/v1/messages` for those accustomed to the Claude API
Document chat (RAG)	Upload documents and ask questions about their content, without an external pipeline, database, or separate embeddings service
lms CLI and headless daemon (llmster)	For automation without an open application window – for example, on a server, in a Docker container, or in a CI/CD pipeline
Full offline operation	After downloading the model, the internet is no longer needed – even on a plane or in a closed network without internet access

The API compatibility deserves a special mention: the fact that LM Studio supports both OpenAI and Anthropic formats immediately is not a trivial matter. It means you can take an existing project written for the Claude API or GPT, change the base URL to localhost:1234 – and it will work with the local model with practically no code rewriting. This saves real time for prototyping and testing.

🔍 An Honest Nuance: Why Tokens/Sec Numbers Can Be Misleading

Here I want to be as honest as possible, because I've encountered this myself. The speed number that LM Studio shows in the interface during generation doesn't always reflect real performance in long dialogues, and the difference can be dramatic.

The independent benchmark project famstack.dev showed an illustrative example: with a context of ~8500 tokens, LM Studio MLX displayed 57 tokens/second in the UI – this is the number you see during text generation. But the actual *effective* throughput (how much time passed from sending the request to receiving the full response, including processing the entire context) was closer to 3 tokens/second.

The reason is prefill overhead: before the model starts generating new tokens, it first has to "read" and process the entire preceding context. The longer the conversation or document, the longer this phase lasts, and it's this phase, not the generation speed itself, that determines how long you actually wait for a response.

Metric	What it shows	Value at 8500 tokens context
Generation tok/s (in UI)	Speed of generating new tokens – what you see on the screen	~57 tok/s
Effective tok/s (reality)	Output tokens divided by total waiting time (prefill + generation)	~3 tok/s

A practical solution to know: LM Studio MLX by default processes context in chunks of 512 tokens (prefill chunk size). Increasing this value to 4096 or even 8192 can speed up prefill by 1.5-2 times on newer hardware (M3/M4). On older chips like M1, the effect is less pronounced – memory bandwidth is more often the bottleneck there, rather than chunk size.

Practical conclusion: if you plan long agent sessions with a large context (and this is how MCP works – the model constantly keeps tool call results and dialogue history in context) – don't rely on the number from a short test prompt, but check the speed in a scenario that is realistic for you. The "57 tokens per second" figure from the demo on first launch can be misleading about how comfortable it will be to work in a real, long workflow.

🎯 Who is LM Studio for — and when is Ollama still better

This is the question I personally get asked most often — and the honest answer is that it's not an "either/or" contradiction. Both tools ultimately do the same thing: run a model locally and provide an API to it. The difference lies in which path to get there is more convenient for your specific scenario.

Your situation	Recommendation	Why
Want to compare several models visually, switch between GGUF and MLX	LM Studio	Everything is visible immediately in the interface — size, format, downloaded/available models, without memorizing commands
Need an MCP Host with OAuth for Notion, Linear, Atlassian	LM Studio	One-click browser authorization, without manual token management
You are on Apple Silicon and want maximum performance	LM Studio (with a slight advantage)	MLX has been here from the beginning and is more deeply integrated into the UI — although Ollama has also caught up with its own MLX engine
Don't like the terminal, want everything to be visible	LM Studio	The GUI removes the entry barrier — you don't need to remember command syntax
Automating everything via scripts, cron, CI/CD	Ollama	The CLI is more natural for scripts — `ollama run model "prompt"` in one line without launching the GUI
Already have infrastructure built on Ollama	Ollama	No need to duplicate the setup for minor advantages — for example, I already have it integrated into Spring AI projects via `OllamaChatModel`, and there's no point in rewriting the configuration for LM Studio
Need the simplest possible command without extra clicks	Ollama	`ollama run modelname` — and you're already in the chat, without opening windows and navigating menus

In practice, I keep both tools simultaneously — it's not a compromise, but a conscious choice. For quick experiments, comparing several models, or when an MCP with OAuth services is needed — I open LM Studio. For production-like scenarios via Spring AI, where there's already established configuration and automation — I'm still sticking with Ollama. They coexist perfectly on the same Mac simultaneously: LM Studio listens on localhost:1234, Ollama — on localhost:11434, there's no port conflict.

If you're just starting and don't know where to begin — my practical advice: try LM Studio first. The GUI provides a visual understanding of what's happening — what models are available, how much they weigh, how they respond — and this understanding later helps you navigate much better, even if you switch to Ollama for production later.

✅ What you can do with LM Studio right today

Without any code — here are five things you can try immediately after installation to get a working understanding of the tool in one evening, not just "install and forget."

Download your first model through the built-in search — start with something small like Qwen3 7-8B to check that everything works and the model comfortably fits into your memory, before downloading something larger.
Chat — the interface is intuitive, similar to ChatGPT, so there's almost nothing to get used to. Try asking several real work-related questions, not test ones — this way you'll immediately feel the difference between a cloud and a local model in practice.
Connect a document via Document Chat — upload a PDF or notes and ask questions about their content. This is the fastest way to feel that local AI can be truly useful for specific work tasks, not just an interesting experiment.
Connect the first MCP server — for example, the file system, so the model can read files from your disk. This is where the difference between a "chatbot" and an "agent" becomes visible — the model starts doing something real, not just responding with text.
Run a local server with one click and check that localhost:1234 responds to requests — this is the first step to connecting the model to your own code, regardless of whether you're writing in Python, Java, or JavaScript.

None of these five steps require code or a terminal — everything is done with the mouse in the interface. If after this you want to go deeper — connect LM Studio to your own application via API, configure tool calling, or build a local agent — that's where we'll start in the next articles of the series.

In the next article, we'll cover a step-by-step installation guide for Mac — from system requirements (Apple Silicon vs Intel) to the first request via curl and common errors that occur at the start.

Categories