In short: LM Studio is a free desktop application for running LLMs locally on Mac with a GUI, MLX acceleration on Apple Silicon, and an OpenAI-compatible API. By mid-2026, MCP transitioned from an experiment to a standard — LM Studio is now not just a chat, but a full-fledged platform for local AI agents. We'll explore how it differs from Ollama and when LM Studio is the right choice.
💻 What is LM Studio in simple terms
LM Studio is a free desktop application from Element Labs that allows you to download and run open-source language models (Llama, DeepSeek, Qwen, Mistral, Gemma, Phi) entirely on your own computer — no cloud, no API keys, no monthly subscription.
Unlike Ollama, which lives in the terminal, LM Studio provides a graphical interface: a built-in model browser from Hugging Face, a chat window similar to ChatGPT, generation parameter settings directly in the UI, and its own local server at localhost:1234 with an OpenAI-compatible API for developers.
I've been using LM Studio in parallel with Ollama for several months now — and in this article, I'll explain why they are not "either/or" but tools for different tasks.
🚀 What has changed in local AI by mid-2026
If you last looked at local AI a year or two ago, the landscape has changed significantly, and not just in terms of model quality. The very purpose of why people turn to local AI has changed: recently, it was mainly about saving on tokens and the curiosity of enthusiasts; now, it's increasingly a conscious choice for privacy and control.
MCP is no longer an experiment — it's a standard
LM Studio gained support for Model Context Protocol (MCP) as an MCP Host back in version 0.3.17 — at the time, it was a novelty shown as a technical demonstration. But the path from "interesting feature" to "standard" turned out to be quick.
By April 2026, version 0.4.10 added OAuth support for MCP servers — now you can connect Linear, Notion, Atlassian with one click via browser authorization, without manually copying tokens and without storing secrets in plain configuration files. LM Studio handles the entire OAuth handshake itself — it opens the service's authorization page in your browser, securely stores the token after confirmation, and the service's tools become immediately available to the model in chat or via API.
In addition to official integrations (there are only four so far — Linear, Notion, Atlassian, and one more service via the official gallery), the community has already assembled a much wider catalog of MCP connectors that work with LM Studio via standard HTTP/SSE transport or local stdio. This means the ecosystem is growing not only thanks to Element Labs but also thanks to the developer community — this is a sign of a mature platform, not a one-off feature.
In practice, this transforms LM Studio from an "advanced chat" into a full-fledged platform for local AI agents that can actually do things — read files on disk, work with your task trackers, search for information via external APIs, and do it in multiple steps, without human intervention at each stage.
Apple M5 provided a tangible leap
Apple officially announced that the M5 chip processes prompts 3.5-4 times faster than the M4, and the time-to-first-token for a dense 14B model now takes less than 10 seconds, and for a 30B MoE architecture — less than 3 seconds. These are no longer marketing promises, but Apple's own figures from their machine learning research blog.
There's a nuance that owners of new hardware should be aware of: if you have an M5 but are using an older version of macOS, you won't get even the memory bandwidth benefits (a 19-27% increase compared to M4). The chip's full potential is unlocked only with the latest macOS — hardware acceleration without corresponding software works only partially.
Tool calling in local models has drastically improved
Just a year ago, local models had poor and unreliable tool calling — this was the main reason why "local AI agent" sounded like an experiment, not a working tool. Now, the situation has changed dramatically: Gemma 4 has jumped from 6.6% to 86.4% tool calling accuracy according to third-party tests — this is not a gradual improvement, but a qualitative leap in a year. Qwen3.5 now shows results that, on many benchmarks, approach flagship cloud models.
This means that a local AI agent via LM Studio with MCP is no longer a toy for demonstrations — it can actually perform multi-step tasks: find information, process it, call the necessary tool, and do it reliably enough for daily use, at least for relatively simple action chains.
Why this actually matters
These three changes are not a random coincidence of technical updates. They form a single picture: local AI in 2026 has ceased to be a compromise. Previously, the choice to "run locally" almost always meant a conscious sacrifice — weaker models, lack of tool calling, slower speeds, inconvenient interface. Now, each of these sacrifices is becoming significantly smaller or disappearing altogether.
And this aligns with a broader trend visible beyond the enthusiast niche: a Cisco survey of 2600 security professionals showed that 92% perceive generative AI as a technology requiring fundamentally new approaches to risk management, and 68% are concerned about data leakage outside the company or to competitors. When your model runs locally on a Mac, these risks simply don't arise, as the data physically never leaves the device.
For a developer, this means a practical thing: it now makes real sense to build workflows around local AI not only for savings or curiosity but because privacy, data control, and already sufficient model quality make it a rational choice — not just an ideological one.
⚖️ How LM Studio Differs from Ollama and ChatGPT
It's common to confuse three entirely different product categories here – even though at first glance they all "just provide access to AI". Let's break it down to the core, because the difference is fundamental.
Criterion
LM Studio
Ollama
ChatGPT
Where it runs
Locally, your Mac
Locally, your Mac
OpenAI Cloud
Interface
GUI application
CLI terminal (desktop app also available)
Web/mobile app
Internet required
Only for model download
Only for model download
Always
Data privacy
Complete – nothing leaves
Complete – nothing leaves
Data is processed on OpenAI servers
Cost
Free
Free
Subscription / tokens
MLX acceleration on Apple Silicon
✅ Yes, from the very start of Apple Silicon support
✅ Yes, from late March 2026 – separate -mlx model tags
Not applicable
MCP / Tool calling
✅ MCP Host with OAuth (0.4.10+)
Tool calling is supported, MCP is narrower
✅ Via OpenAI's own plugins/tools
The line about MLX deserves a separate explanation, as the situation changed literally during 2026. For a long time, MLX acceleration was what clearly distinguished LM Studio from Ollama. But at the end of March, Ollama also officially launched its own MLX engine – and as of now, it has even received separate optimizations: operations merged into unified Metal kernels via the MLX just-in-time compiler and support for the NVFP4 format for better quantization quality.
An important nuance: in Ollama, MLX variants of models come as separate tags – for example, gemma4:e4b-mlx instead of the usual gemma4:e4b. And as of mid-2026, these MLX tags in Ollama only support text, without images – if you need vision input, you'll have to use the standard GGUF tag. LM Studio doesn't have this separation – the MLX build is immediately multimodal if the model supports it.
In simple terms: LM Studio and Ollama are two ways to run the same thing locally, with different interfaces and slightly different maturity of individual features at any given moment. ChatGPT is a completely different product category, because your data physically leaves your computer and is processed on someone else's infrastructure.
⚡ MLX vs llama.cpp: why Apple Silicon wins here
LM Studio runs on two engines simultaneously: llama.cpp (GGUF format, works on any platform – Mac, Windows, Linux, with or without GPU) and Apple MLX (only for M-series chips). If you have Apple Silicon – MLX is usually chosen by default when an MLX build exists for the model.
Why there's a speed difference at all
This isn't about marketing, but architecture. MLX is a framework that Apple developed specifically for the unified memory architecture of M-series chips, where the CPU and GPU share the same memory instead of separate pools like in traditional PCs with discrete graphics cards. MLX directly accesses the Metal runtime, bypassing the overhead of GGUF format quantization.
The speed difference is measured, not estimated: the MLX engine is usually 30-50% faster than llama.cpp via Metal on the same hardware – this is confirmed by independent tests, and even by Ollama itself, which was previously purely GGUF-oriented but eventually recognized the advantage and added its own MLX engine. Some narrow tests on specific models (e.g., Gemma 4) show a difference closer to 10-20% – the actual gain depends on the specific model, context size, and how well the MLX build for that model is optimized.
In practice, this means one simple thing: the same model in MLX format will give you noticeably more tokens per second than the GGUF version of the same model on the same Mac. If you're on an M-series and have a choice – MLX is almost always more beneficial, except when you specifically need a feature that is only available in the GGUF variant (e.g., at the time of writing – image processing for some models in Ollama-MLX tags).
What's worth checking in practice
An important nuance that I've personally verified: LM Studio updates its engines independently of the application itself. If a new model suddenly "doesn't load" or gives a strange error – the first thing to check is Settings → Runtime. An outdated engine is the most common cause of such a problem, much more so than the model itself or insufficient RAM. This is especially relevant immediately after a new model is released – there will be a few days to a week lag until the corresponding MLX engine matures and becomes stable for it, so if a model has just been released and behaves strangely – first check if the engine version is outdated, rather than blaming the model itself.
Another practical detail: sometimes a new model initially only gets support in GGUF via llama.cpp, and a full MLX version arrives later – a pattern we've seen with Gemma 4 and other fresh releases. If you see an error like "model architecture not supported" immediately after a new model is released – it's almost always a matter of time, not your configuration.
🎁 What You Get: GUI, MCP Host, API, Offline
In short – here's the full set of what LM Studio provides out of the box, without any additional settings or plugins:
Feature
What it provides in practice
GUI with built-in model browser
Search and download models directly from Hugging Face without leaving the application – no manual file downloads or format parsing
MCP Host
Connect external MCP servers (filesystem, search, Linear, Notion, Atlassian via OAuth) and make them available to the local model – the model gets real "hands" instead of just text
OpenAI-compatible API at localhost:1234
Any code written for the OpenAI SDK can be switched to a local model by changing only the base URL. There's also an Anthropic-compatible endpoint /v1/messages for those accustomed to the Claude API
Document chat (RAG)
Upload documents and ask questions about their content, without an external pipeline, database, or separate embeddings service
lms CLI and headless daemon (llmster)
For automation without an open application window – for example, on a server, in a Docker container, or in a CI/CD pipeline
Full offline operation
After downloading the model, the internet is no longer needed – even on a plane or in a closed network without internet access
The API compatibility deserves a special mention: the fact that LM Studio supports both OpenAI and Anthropic formats immediately is not a trivial matter. It means you can take an existing project written for the Claude API or GPT, change the base URL to localhost:1234 – and it will work with the local model with practically no code rewriting. This saves real time for prototyping and testing.
🔍 An Honest Nuance: Why Tokens/Sec Numbers Can Be Misleading
Here I want to be as honest as possible, because I've encountered this myself. The speed number that LM Studio shows in the interface during generation doesn't always reflect real performance in long dialogues, and the difference can be dramatic.
The independent benchmark project famstack.dev showed an illustrative example: with a context of ~8500 tokens, LM Studio MLX displayed 57 tokens/second in the UI – this is the number you see during text generation. But the actual *effective* throughput (how much time passed from sending the request to receiving the full response, including processing the entire context) was closer to 3 tokens/second.
The reason is prefill overhead: before the model starts generating new tokens, it first has to "read" and process the entire preceding context. The longer the conversation or document, the longer this phase lasts, and it's this phase, not the generation speed itself, that determines how long you actually wait for a response.
Metric
What it shows
Value at 8500 tokens context
Generation tok/s (in UI)
Speed of generating new tokens – what you see on the screen
~57 tok/s
Effective tok/s (reality)
Output tokens divided by total waiting time (prefill + generation)
~3 tok/s
A practical solution to know: LM Studio MLX by default processes context in chunks of 512 tokens (prefill chunk size). Increasing this value to 4096 or even 8192 can speed up prefill by 1.5-2 times on newer hardware (M3/M4). On older chips like M1, the effect is less pronounced – memory bandwidth is more often the bottleneck there, rather than chunk size.
Practical conclusion: if you plan long agent sessions with a large context (and this is how MCP works – the model constantly keeps tool call results and dialogue history in context) – don't rely on the number from a short test prompt, but check the speed in a scenario that is realistic for you. The "57 tokens per second" figure from the demo on first launch can be misleading about how comfortable it will be to work in a real, long workflow.
🎯 Who is LM Studio for — and when is Ollama still better
This is the question I personally get asked most often — and the honest answer is that it's not an "either/or" contradiction. Both tools ultimately do the same thing: run a model locally and provide an API to it. The difference lies in which path to get there is more convenient for your specific scenario.
Your situation
Recommendation
Why
Want to compare several models visually, switch between GGUF and MLX
LM Studio
Everything is visible immediately in the interface — size, format, downloaded/available models, without memorizing commands
Need an MCP Host with OAuth for Notion, Linear, Atlassian
LM Studio
One-click browser authorization, without manual token management
You are on Apple Silicon and want maximum performance
LM Studio (with a slight advantage)
MLX has been here from the beginning and is more deeply integrated into the UI — although Ollama has also caught up with its own MLX engine
Don't like the terminal, want everything to be visible
LM Studio
The GUI removes the entry barrier — you don't need to remember command syntax
Automating everything via scripts, cron, CI/CD
Ollama
The CLI is more natural for scripts — ollama run model "prompt" in one line without launching the GUI
Already have infrastructure built on Ollama
Ollama
No need to duplicate the setup for minor advantages — for example, I already have it integrated into Spring AI projects via OllamaChatModel, and there's no point in rewriting the configuration for LM Studio
Need the simplest possible command without extra clicks
Ollama
ollama run modelname — and you're already in the chat, without opening windows and navigating menus
In practice, I keep both tools simultaneously — it's not a compromise, but a conscious choice. For quick experiments, comparing several models, or when an MCP with OAuth services is needed — I open LM Studio. For production-like scenarios via Spring AI, where there's already established configuration and automation — I'm still sticking with Ollama. They coexist perfectly on the same Mac simultaneously: LM Studio listens on localhost:1234, Ollama — on localhost:11434, there's no port conflict.
If you're just starting and don't know where to begin — my practical advice: try LM Studio first. The GUI provides a visual understanding of what's happening — what models are available, how much they weigh, how they respond — and this understanding later helps you navigate much better, even if you switch to Ollama for production later.
✅ What you can do with LM Studio right today
Without any code — here are five things you can try immediately after installation to get a working understanding of the tool in one evening, not just "install and forget."
Download your first model through the built-in search — start with something small like Qwen3 7-8B to check that everything works and the model comfortably fits into your memory, before downloading something larger.
Chat — the interface is intuitive, similar to ChatGPT, so there's almost nothing to get used to. Try asking several real work-related questions, not test ones — this way you'll immediately feel the difference between a cloud and a local model in practice.
Connect a document via Document Chat — upload a PDF or notes and ask questions about their content. This is the fastest way to feel that local AI can be truly useful for specific work tasks, not just an interesting experiment.
Connect the first MCP server — for example, the file system, so the model can read files from your disk. This is where the difference between a "chatbot" and an "agent" becomes visible — the model starts doing something real, not just responding with text.
Run a local server with one click and check that localhost:1234 responds to requests — this is the first step to connecting the model to your own code, regardless of whether you're writing in Python, Java, or JavaScript.
None of these five steps require code or a terminal — everything is done with the mouse in the interface. If after this you want to go deeper — connect LM Studio to your own application via API, configure tool calling, or build a local agent — that's where we'll start in the next articles of the series.
In the next article, we'll cover a step-by-step installation guide for Mac — from system requirements (Apple Silicon vs Intel) to the first request via curl and common errors that occur at the start.