If you still think that LLMs are trained like this: "copy the whole internet → press the Train button" – you are wrong by hundreds of millions of dollars.
ChatGPT, Claude, and Gemini undergo three fundamentally different training stages. And the most important one is not pre-training. Spoiler: in 2025–2026, human evaluation of responses will gradually disappear, replaced by automatic verifiers. RLHF is no longer in trend.
Here is a complete guide that explains how it actually works – with numbers, tables, and links to primary sources.
⚡ TL;DR
- ✅ Pre-training: the model reads 10–15 trillion tokens and learns to predict the next word — this is the foundation of everything
- ✅ Post-training (SFT + alignment): turns a "text predictor" into a useful assistant
- ✅ RLHF is outdated: in 2025–2026, a modular stack SFT → DPO → GRPO/RLVR became the standard
- ✅ Cost: GPT-4 — ~$78M, Gemini Ultra — ~$191M in compute (Stanford AI Index 2025)
- 🎯 You will get: an understanding of the full LLM training cycle, real numbers, and the current 2026 stack
- 👇 Below are detailed explanations, tables, and links to primary sources
📚 Article Content
🎯 How LLMs are Trained — in 60 Seconds
Every modern LLM goes through three sequential stages: pre-training (the model learns language on trillions of tokens), supervised fine-tuning or SFT (the model learns to respond as an assistant), and alignment (human or automatic evaluation of responses shapes the final behavior). In 2025–2026, a fourth stage, mid-training, has emerged between pre-training and SFT for specialized data.
The model does not "know" language after pre-training — it simply predicts the next token very well. The transformation into a useful assistant happens in the subsequent stages.
Imagine you are training a new employee. First, they spend years reading books, articles, and documentation — this is pre-training. Then, they undergo an internship where they observe experienced colleagues responding to requests — this is SFT. Finally, managers evaluate their work and provide feedback — this is alignment through RLHF or DPO. The second stage is impossible without the first. Without the third, the model is technically proficient but unpredictable in its behavior.
Why the order of stages is critical
If you skip pre-training and immediately do fine-tuning, the model will not have a basic understanding of language. If you skip alignment, the model may respond technically correctly but dangerously or not as the user expects. Each stage builds on the previous one, and an error at an early stage cannot be corrected without retraining.
- ✔️ Pre-training: language, facts, logic — learned from data
- ✔️ SFT: assistant response format — learned from examples
- ✔️ Alignment: values and behavior — learned from evaluations and comparisons
Conclusion: LLM training is not a single operation but a sequential pipeline with clearly defined roles for each stage.
📌 Scaling laws: why more is indeed better
What are scaling laws in LLMs
Scaling laws are empirical regularities: model quality predictably increases with the increase in the number of parameters, data volume, and computation. According to Epoch AI, training compute for notable AI models doubles approximately every five months. This explains why training costs $78–191M and why labs don't stop.
Scaling laws are not optimism, but measurable mathematics. If you double the compute, the model quality increases predictably.
In 2020, OpenAI published the first scaling laws for neural networks. The essence: loss (model error) decreases according to a power law with an increase in parameters, data, and compute. That is, if you want a model twice as good, you need not twice, but tens of times more resources.
In 2022, DeepMind refined these laws in the "Chinchilla" paper (Hoffmann et al., 2022). The conclusion: previous models, including GPT-3, were "undertrained" — they had too many parameters relative to the number of tokens. The optimal ratio is approximately 20 tokens per parameter. GPT-3 (175B parameters) should have been trained on ~3.5T tokens, not 300B.
Why this explains training costs
Modern frontier models deliberately violate the Chinchilla optimum in favor of more tokens. Llama 3.3, for example, was trained on ~15 trillion tokens — much more than "needed" for optimal training. The reason is pragmatic: a smaller model trained on a larger number of tokens is cheaper to infer with the same quality.
- ✔️ More parameters → better pattern memorization
- ✔️ More tokens → better generalization
- ✔️ More compute → faster convergence to minimum loss
Conclusion: scaling laws are the mathematical basis of the AI "arms race," explaining both billion-dollar budgets and the constant growth in model sizes.
📌 Pre-training: the model reads the entire internet
What happens during LLM pre-training
Pre-training is learning to predict the next token on massive text corpora: CommonCrawl (web pages), books, code, Wikipedia, scientific articles. Modern models process 10–15 trillion tokens. The goal is not to memorize facts but to learn language structure, logic, and cause-and-effect relationships.
Pre-training is not learning to "respond." It's learning to "understand text" through endless gap-filling.
The pre-training task is technically simple: the model sees a sequence of tokens and tries to predict the next one. If it reads "Kyiv is the capital," the model should predict "of Ukraine." The error is compared to the correct answer, and the neural network weights are adjusted. This process is repeated trillions of times.
Where does the data come from? The main source is CommonCrawl: monthly snapshots of billions of web pages. Books (Books3, Project Gutenberg), GitHub (code), Wikipedia, scientific articles (ArXiv, PubMed), forums (Reddit, Stack Overflow) are added to it. Each source undergoes filtering: duplicates, spam, adult content, and texts with errors are removed. For more details on how AI platforms process web data, see our article How Crawling Works in the AI Era.
Why "clean data" is running out
The problem of 2025–2026: high-quality unique text on the internet is running out. According to researchers' estimates, at current consumption rates, available high-quality data for pre-training may be exhausted by 2026–2028. This is one of the reasons why the industry has moved to synthetic data (more on this in section 9).
- ✔️ CommonCrawl is the foundation but requires aggressive filtering
- ✔️ Code is particularly valuable: structured, logical, verified
- ✔️ Mathematical texts improve reasoning even for non-mathematical tasks
📌 Tokenization and Data Curation: How Text Becomes Numbers
What is tokenization and why it's important
Tokenization is the first step after data collection: text is broken down into small pieces (tokens) that the model can process. Data curation is the filtering and cleaning of data before tokenization. Without quality tokenization, even the largest model will be slow and inaccurate.
A token is what the model actually "sees." A human sees "hello," the model sees [243, 567, 12]. Understanding tokens is key to understanding the cost and limitations of LLMs.
Tokenization is the process of converting text into numbers. Since a neural network cannot process letters or words directly, all text is first broken down into tokens, and then each token is assigned a unique ID. The most common algorithm is Byte Pair Encoding (BPE), used by GPT, Llama, Claude, and Gemini.
Data curation is what happens before tokenization: removing duplicates, spam, adult content, PII (personally identifiable information), and normalizing text. For GPT-4, it's estimated that out of 50+ trillion raw CommonCrawl tokens, about 13 trillion remained after filtering.
Why this is important for cost and context
- 🔹 API costs are calculated per token. Ukrainian text costs 2–3 times more than English because it takes up more tokens.
- 🔹 Context window is limited by tokens. English can fit 2–3 times more meaning than Cyrillic.
- 🔹 Understanding quality depends on how well the text is broken down into meaningful units.
For more details on how tokenization works, why one word can cost 1 or 10 tokens, what glitch tokens are and how they break GPT, as well as complete API price tables in 2026 — read our separate article:
What are Tokens in ChatGPT, Claude, and Gemini: How AI Sees Your Text and How Much It Costs (2026).
Section Conclusion: tokenization is not a technical detail but the foundation of LLM economics. Understanding tokens helps optimize API costs, use the context window more effectively, and avoid unexpected model behavior.
📌 Mid-training: the hidden stage between pre and post
What is mid-training in LLMs
Mid-training is a relatively new stage that emerged in 2024–2025 between pre-training and post-training. The model processes highly specialized data (mathematics, code, synthetic reasoning sequences) using the same algorithm as pre-training, but on smaller and higher-quality corpora. Meta uses a separate mid-training stage with synthetic reasoning data for Llama 4.
Mid-training is "fine-tuning" after coarse pre-training: the model already knows language, now it's shown how to think step by step.
The concept of mid-training emerged as a response to a practical problem: post-training (SFT + RLHF) is effective for model behavior but poorly develops deep reasoning abilities. And adding mathematical problems to pre-training is inefficient due to their small share in the overall corpus.
The solution: after the main pre-training, run another training round — smaller in volume, but higher quality and more thematic. This is how Meta prepares Llama 4 for reasoning tasks: a separate mid-training on synthetic step-by-step reasoning before the final post-training.
How mid-training differs from fine-tuning
Fine-tuning changes behavior and response format. Mid-training changes internal representations — "what" the model knows, not "how" it responds. Technically, it's the same next-token prediction algorithm, but on different data and for fewer steps.
Section Conclusion: mid-training is the new standard for frontier models, allowing "embedding" reasoning abilities without redoing the entire pre-training.
📌 SFT: how a "predictor" becomes an assistant
What is SFT in LLM training
Supervised Fine-Tuning (SFT) is training on "prompt → quality response" pairs prepared by humans or stronger models. After pre-training, the model can generate text but doesn't know the assistant format. SFT teaches it: to answer questions, not continue text; to be helpful, not just plausible.
SFT is the difference between "a model that can write anything" and "a model that answers your request."
After pre-training, if you write "How to make an omelet?", the model might respond by continuing in the style of a cooking blog, a Wikipedia article, or a recipe as a list of ingredients — depending on what was most common in the training data. SFT fixes the format: the response should be direct, helpful, and in a conversational format.
SFT data consists of thousands or tens of thousands of "prompt → response" pairs. They are prepared by human annotators (expensive) or generated using stronger models (cheaper, but with the risk of inheriting errors). OpenAI used ~13K SFT examples for the first InstructGPT. Modern models use hundreds of thousands or more.
Instruction tuning as a type of SFT
Instruction tuning is SFT where prompts are formulated as explicit instructions ("Translate this text," "Write a summary," "Correct the errors"). This is what turns a basic language model into a "useful assistant." Google's FLAN and OpenAI's InstructGPT are the first large-scale examples of this approach.
- ✔️ SFT teaches response format and tone
- ✔️ Instruction tuning teaches following specific commands
- ✔️ Without SFT, the model is technically proficient but "doesn't understand" what is wanted
Section Conclusion: SFT is a relatively inexpensive stage (compared to pre-training) but critically important: it's what makes the model an "assistant" and not just a text generator.
📌 RLHF: human evaluation as a training signal
How RLHF works
RLHF (Reinforcement Learning from Human Feedback) is a method where humans compare several model responses and choose the better one. From these comparisons, a reward model is trained — a separate neural network that has learned to predict human preferences. Then, the main model learns through RL to maximize the reward model's score. It was RLHF that transformed GPT-3 into ChatGPT.
RLHF solved a problem that SFT cannot: teaching the model not just to "respond correctly," but to respond in a way that humans find useful.
The RLHF mechanism consists of three steps. First, annotators see the same prompt with two or more model response options and choose the better one. Second, a reward model is trained on these comparisons — it predicts which response a human would choose. Third, the main model learns through the PPO (Proximal Policy Optimization) algorithm to generate responses that the reward model scores highly.
OpenAI showed an impressive result: a 1.3B parameter model trained via RLHF outperformed a 175B parameter model trained only via SFT. This means that alignment is more important than size for practical usefulness.
Reward model — the invisible judge
The reward model is a separate neural network trained to predict human ratings. It sees a prompt and a response and outputs a number — how "good" that response is. During RLHF, the main model tries to maximize this score without deviating too much from the base SFT version (this is controlled by a KL-divergence penalty).
Why RLHF is expensive and complex
Classic PPO-based RLHF requires holding four large models in memory simultaneously: the main model (policy model), a frozen copy of the SFT model (reference model), the reward model, and a critic/value model. For frontier models with billions of parameters, this requires thousands of GPUs and specialized infrastructure. Human annotators add significant costs: it's estimated that 600 high-quality annotations cost about $60,000.
- ✔️ RLHF teaches the model human preferences, not just correct answers
- ✔️ The reward model replaces humans during training — but is itself trained on human evaluations
- ✔️ PPO requires 4 models in memory — the main reason why alternatives are sought
Brief comparison: RLHF vs RLVR
| Method |
What it optimizes |
Limitations |
RLHF (Reinforcement Learning from Human Feedback) |
Human preferences — subjective quality, tone, style, safety |
Subjectivity Different annotators have different opinions. Expensive and slow. |
RLVR (Reinforcement Learning with Verifiable Rewards) |
Objective reward — correctness of math, code, precise facts |
Limited domains Works only where an automatic verifier exists (math, code, structured tasks). |
Conclusion: RLHF is better for creative and subjective tasks (writing texts, tone of voice, safety). RLVR is for tasks with a clearly correct answer (math, programming, logic). In 2025–2026, the industry is moving towards a combination of both approaches.
📌 DPO, GRPO, and RLVR: The Post-RLHF Era of 2025–2026
What Replaces RLHF in 2026
In 2025–2026, classic RLHF is no longer the dominant method. The modern stack: SFT for base alignment → DPO or SimPO for preference alignment → GRPO/DAPO with verifiable rewards for reasoning. DPO eliminates the need for a separate reward model. RLVR (Reinforcement Learning with Verifiable Rewards) replaces human annotators with automatic verifiers for math and code.
The "pretrain → RLHF with human labels" recipe is no longer the standard. Every major model of 2025 uses a different post-training stack.
DPO: Alignment Without a Reward Model
Direct Preference Optimization (Rafailov et al., 2023) solves the same problem as RLHF but without a separate reward model and without RL optimization. DPO frames the alignment task as classification: the model sees pairs of (chosen response, rejected response) and learns to directly increase the probability of the chosen one. The result is comparable to RLHF but 40–75% cheaper in compute. Meta uses DPO as part of Llama 4's alignment stack.
GRPO: RL Without a Critic Model
Group Relative Policy Optimization (DeepSeek, 2024) is an algorithm that replaces PPO in RLHF. Instead of a separate critic/value model, GRPO samples multiple responses to a single query and compares them. This removes one of the four models from memory while maintaining or improving quality. GRPO is already used in NVIDIA's Nemotron 3 Super and DeepSeek R1.
RLVR: Verifier Instead of Human
Reinforcement Learning with Verifiable Rewards is the most significant change of 2025. The idea is simple: for math, code, and structured tasks, human evaluation is not needed—an automatic verifier is sufficient. A unit test or a math checker provides a binary signal (correct/incorrect)—faster, cheaper, and more stable than human feedback. DeepSeek R1-Zero was trained purely through RLVR without any SFT examples—and the model independently developed self-reflection and chain-of-thought capabilities.
DAPO: RLVR for Long Responses
DAPO from ByteDance/Tsinghua (2025) addresses a specific problem: GRPO's instability when training reasoning models with long chain-of-thought responses. DAPO trained Qwen2.5-32B to 50 points on AIME 2024, surpassing DeepSeek-R1-Zero with 50% fewer training steps. The system is fully open-source.
| Method |
Reward model |
Critic model |
Human labels |
2026 Application |
| PPO-RLHF |
✅ Required |
✅ Required |
✅ Required |
Rarely, only in large labs |
| DPO |
❌ Not required |
❌ Not required |
✅ Required (pairs) |
Standard for alignment |
| GRPO |
✅ Required |
❌ Not required |
Partially |
Reasoning models |
| RLVR |
❌ Verifier |
❌ Not required |
❌ Not required |
Math, code, reasoning |
Conclusion: The modern alignment stack is modular: SFT → DPO → GRPO/RLVR. Each component solves a specific problem and can be replaced depending on budget and goals.
📌 Data Contamination: When the Test Enters Training
What is Data Contamination in LLMs
Data contamination is a situation where test examples from benchmarks end up in the model's training data. The result: the model shows high scores not because it's "smart," but because it "saw the answers." This is a serious problem for evaluating the real capabilities of LLMs in 2025–2026.
MMLU 95% doesn't always mean "smart model." Sometimes it means "the model saw these questions during training."
The problem is systemic: CommonCrawl contains billions of pages, including forums where people discuss benchmark questions, academic sites with test samples, and repositories with datasets. Quality filtering is difficult: formally, an MMLU test might be quoted in an article that passes all quality filters.
By the way, Common Crawl itself (where data for training GPT-5, Gemini, and other models is sourced) actively scans sites through its bot, CCBot. If you want your content to be included in AI knowledge bases, not just contamination tests, you should understand how this crawler works. For more on why CCBot visits even new sites, how Harmonic Centrality influences indexing priorities, and whether to block AI bots—read my article: The Era of AI Crawlers: How CCBot Turns Your Site into a Knowledge Base for GPT-5 and Gemini.
In 2024–2025, several independent studies found signs of contamination in top models. Meta and Google publish "contamination reports" along with Llama and Gemini releases—but the detection methodology remains a subject of discussion.
How to Check for Contamination Yourself
You don't need access to the model's training data to suspect contamination. Here are three practical methods:
- 🔍 n-gram overlap: If the model's answer to a test question contains unique phrases from the training dataset (e.g., the exact wording from arXiv or GitHub), it's a red flag. The longer the match, the higher the probability of contamination.
- 📅 Questions after knowledge cutoff: Ask the model questions about an event that definitely occurred after its stated cutoff. If it answers with exact dates, details, or quotes—something is amiss. A "clean" model should say "I don't know" or "That's after my training date."
- 🎯 Membership inference attacks: A more complex method: compare the model's behavior on questions that were almost certainly in the training data (e.g., the first lines of "Hamlet") with those that definitely weren't (recent scientific preprints). A difference in confidence or accuracy can indicate contamination.
How Developers Are Combating This
Solutions: new benchmarks updated after each major release (LiveBench), "dynamic benchmarks" with new task generation (LiveCodeBench), private test sets that are not published before evaluation. For more on how AI platforms process and filter data—read the article How AI Platforms Choose Sources.
What to Do If You Suspect Contamination
First, don't trust a single benchmark. Compare models across 3-5 different tests, preferably from different domains. Second, test the model on your own, non-public data. Third, pay attention to models that publish their contamination reports (Meta, Google)—this is a sign of integrity, but not a guarantee of purity.
Section Conclusion: Data contamination is a systemic problem that complicates objective model comparison. Relying solely on benchmarks is a flawed strategy. Always test the model on your own data and use multiple independent evaluation sources.
📌 Synthetic Data: New Fuel for LLMs
Why LLMs Need Synthetic Data
Synthetic data consists of training examples generated by other AI models, not written by humans. It addresses the deficit of high-quality real data, allows for the generation of infinite examples for rare tasks, and is the foundation of the new learning era of 2025–2026. However, pure synthetic data doesn't outperform real data—optimal mixtures are Real + Synthetic.
Synthetic data doesn't replace real data—it complements it where real data is scarce.
Microsoft's Phi series (Phi-1, Phi-2, Phi-3) was the first to show that a small model trained on "textbook-quality" synthetic data could compete with much larger models on real data. Phi-4 (14B parameters) outperforms models 3–4 times larger on many reasoning benchmarks.
The study Demystifying Synthetic Data in LLM Pre-training (2025) provides a practical conclusion: a mixture of synthetic and real data (33–67% synthetic) consistently outperforms both pure synthetic and pure real data separately. Complete replacement of real data with synthetic leads to "model collapse"—gradual degradation of quality.
RLVR + Synthetic Data = Closed Loop
The most promising direction for 2026: a model generates its own training tasks, evaluates responses via a verifier (RLVR), and learns from the results. No humans in the loop. This is how DeepSeek R1 achieved results comparable to o1, with significantly lower human labeling costs.
Conclusion: Synthetic data is not a substitute for real data but a necessary supplement. The optimal approach for 2026 is curated real data + targeted synthetic data for the model's weak spots.
📌 How Much Does Training Cost: Real Figures
How Much Does It Cost to Train a Frontier LLM
According to the Stanford AI Index 2025 and Epoch AI: GPT-4 cost ~$78M in compute, Gemini Ultra—~$191M, Meta Llama 3.1 405B—~$170M. This is compute only; considering R&D personnel and infrastructure, the real figures are higher.
Compute cost doubles every five months. But inference cost drops 9–900 times per year—thanks to quantization and MoE.
| Model |
Compute cost (estimate) |
Parameters |
Source |
| Original Transformer (2017) |
~$900 |
65M |
Stanford AI Index |
| GPT-3 (2020) |
~$4.6M |
175B |
OpenAI / Epoch AI |
| GPT-4 (2023) |
~$78M |
Unknown (≈1.8T) |
Stanford AI Index 2025 |
| Gemini Ultra (2023) |
~$191M |
Unknown |
Stanford AI Index 2025 |
| Llama 3.1 405B (2024) |
~$170M |
405B dense |
Epoch AI |
| DeepSeek V3 (2024) |
$5.6M (claimed)* |
671B MoE (37B active) |
DeepSeek |
*This is only the compute cost for the final pre-training run on an H800 GPU cluster. It does not include costs for prior experiments, failed runs, R&D personnel, infrastructure, and data curation. According to independent analysts, the real total cost is 3–5 times higher.
Why Inference is Getting Cheaper While Training Gets More Expensive
An industry paradox: the cost of training frontier models increases 2–3× annually, but the cost of querying a model drops dramatically. According to the Stanford AI Index 2025, a query to a GPT-3.5 level model dropped from $20 per million tokens in November 2022 to $0.07 in October 2024—a 280-fold decrease in 18 months.
How Quantization Changes LLM Economics
Quantization is storing model weights not in 16-bit (FP16) but in 4- or 8-bit precision (GPTQ, AWQ, GGUF methods). This allows:
- 🚀 Running Llama 3 70B on a single consumer GPU (24GB VRAM) instead of a cluster of 8×A100s
- 🚀 Reducing inference cost by 5–10 times with minimal quality loss (1–2% on benchmarks)
- 🚀 Running LLMs on CPUs (via llama.cpp) for tasks without latency requirements
It is thanks to quantization that inference costs have dropped 280 times—the same hardware now runs models that were inaccessible a year ago.
Why DeepSeek V3 is So Cheap: MoE + Optimizations
DeepSeek V3 achieved $5.6M due to three factors:
- ⚡ MoE architecture: 671B parameters, but only ~37B are activated per token—less compute per step
- ⚡ FP8 training: Using 8-bit precision instead of standard FP16/BF16—half the memory and compute
- ⚡ Chinese electricity and hardware prices: H800s are cheaper there than H100s in the US/Europe
For information on the cost of using various models via API in 2026, see our detailed article on AI costs (link will be available after publication).
- ✔️ Training frontier models: $78–191M and more for compute alone
- ✔️ Fine-tuning open models: $50K–$500K (or $10–$100 with LoRA on a single GPU)
- ✔️ Inference: from $0.03 (DeepSeek V3) to $15 (o1) per million tokens depending on the model
Section Conclusion: Training frontier models is becoming more expensive, but access to already trained models is becoming cheaper thanks to quantization and MoE. For most businesses, inference cost is more important than training cost. And if you want to fine-tune, LoRA on a single GPU costs less than a restaurant dinner.
📌 Knowledge Cutoff: Why AI is "Frozen in Time"
What is Knowledge Cutoff in LLMs
Knowledge cutoff is the date after which a model has no knowledge of world events. This is a direct consequence of pre-training: the model learns from a static dataset collected up to a certain point. After training is complete, the weights are frozen. ChatGPT doesn't know about yesterday's news any more than a book printed a year ago.
Knowledge cutoff is not a bug but an architectural feature. The model hasn't "forgotten" new events—it simply never saw them.
After pre-training, the model is "frozen": its weights are fixed, and new information does not automatically enter them. If an important event occurs—an election, a scientific discovery, a new product—the model doesn't know about it if this information appeared after the cutoff.
Current knowledge cutoffs as of 2026: Claude Sonnet 4.5—early 2025, GPT-4o—October 2023, Gemini 2.5 Pro—early 2025. Models are not updated continuously—a new major release comes out every few months or less often.
How the Problem is Solved: RAG and Web Search
Two main solutions. The first is Retrieval-Augmented Generation (RAG): before responding, the model receives relevant documents from an up-to-date knowledge base and uses them in context. The second is web search: the model can search for information in real-time (like ChatGPT with Search enabled or Perplexity). For more on the difference between LLMs and RAG—read the article LLM vs RAG in 2026.
Conclusion: Knowledge cutoff is a fundamental limitation of static training. For tasks requiring up-to-date information, either RAG or web search on top of an LLM is needed.
📌 Open-source vs Closed Models: When to Choose What
Open-source or Closed LLM — What's Better
The choice depends on the task, budget, and privacy requirements. Closed models (GPT, Claude, Gemini) offer better out-of-the-box results but have more expensive inference and provider dependency. Open-source (Llama 4, Mistral, DeepSeek, Qwen) provides full control, the possibility of local deployment, and zero inference cost, but requires a technical team.
By 2026, the quality gap between open and closed models will have significantly narrowed. Llama 4 Scout competes with GPT-4o class on many tasks.
As recently as 2022, open-source models were significantly weaker than closed ones. In 2023–2024, Llama 2, Mistral, and DeepSeek substantially closed the gap. By 2025–2026, Meta's Llama 4 and DeepSeek V3/R1 will compete with top closed models on most practical tasks.
| Criterion |
Open-source (Llama 4, DeepSeek, Mistral) |
Closed (GPT, Claude, Gemini) |
| Inference Cost |
$0 (local) or very cheap (API) |
$1.25–$15 / 1M tokens |
| Data Privacy |
Full (local deployment) |
Data passes through the provider |
| Customization |
Fine-tuning, full control |
Limited (prompt-level or fine-tuning API) |
| Out-of-the-box Quality |
Very good (2026) |
Best (frontier models) |
| Technical Requirements |
DevOps/ML team required |
API key + a few lines of code |
When to Choose Open-source
Local deployment via Ollama is justified if you have privacy requirements (medicine, finance, law), a large volume of requests where inference cost is critical, or a need for fine-tuning for a specific domain task.
For budget configurations: useful models can be run even on modest hardware. For details on which models work on laptops with 8 GB RAM, what tasks they solve (code, text, reasoning), and how to get the most out of limited resources — read the article Ollama on 8 GB RAM: Which Models Work in 2026.
For a general overview of local AI, comparison with cloud solutions, and use cases — read the article Ollama in 2026.
Section Conclusion: in 2026, there is no single winner — there is the right tool for the specific task. Closed models are for quick starts and maximum quality. Open-source is for control, privacy, and scale.
📌 Mixture of Experts (MoE): Why More Parameters Don't Mean More Expensive Inference
What is MoE in LLMs
Mixture of Experts (MoE) is an architecture where a model consists of many "experts" (separate MLP blocks), but only a small subset of them is activated for each token. This allows for a model with hundreds of billions of parameters (DeepSeek V3 — 671B), but inference costs as much as a model 5–10 times smaller. MoE is one of the main reasons why inference costs are falling faster than model sizes are growing.
Without MoE, we would have hit a ceiling long ago: a trillion-parameter model would cost $1000 per million tokens. MoE makes large models economically viable.
Imagine having 100 specialists instead of one giant brain. For a physics question, you would activate only 2-3 physicists, not all 100. MoE works similarly: for each token, a "gating network" decides which 1-2 experts (out of dozens or hundreds) will receive the data. The others remain inactive.
Who Uses MoE in 2026:
- ✔️ DeepSeek V3 / R1 — 671B parameters, ~37B activated per token
- ✔️ Mixtral 8x7B / 8x22B — 8 experts, 2 activated
- ✔️ GPT-4 (according to unconfirmed reports) — 16 experts, 111B parameters, 2 activated
- ✔️ Qwen 2.5-MoE — 64 experts, 14B activated
Why MoE is a Game Changer
Before MoE, if you wanted a better model, you increased parameters (GPT-3: 175B) and got a linear increase in inference cost. MoE breaks this connection: you can have 671B parameters (DeepSeek V3), but inference costs as much as for ~37B parameters. This is 5–18 times cheaper.
A simple example: if DeepSeek V3 were a dense model, its inference would cost ~$15-20 per million tokens. The actual price of DeepSeek API is ~$0.27 (input) / $1.10 (output). This is precisely thanks to MoE.
The Downside: MoE Training is More Complex
MoE is excellent for inference but creates challenges during training:
- ⚡ Uneven Load: some experts may be "more popular" than others, requiring additional loss functions for balancing
- ⚡ Higher Memory: all 671B parameters still need to fit into GPUs (or be distributed across devices)
- ⚡ Fine-tuning: standard fine-tuning works worse; special methods are needed (MoE-specific LoRA, or fine-tuning only the gating network)
Conclusion: MoE is the "secret weapon" of large models in 2025–2026. It explains how DeepSeek competes with GPT-4o at 10× lower cost, and why open-source models can be huge yet accessible.
❓ Frequently Asked Questions (FAQ)
How long does it take to train a GPT-like model?
Pre-training a frontier model takes from several weeks to several months on a cluster of thousands of GPUs. GPT-4, by estimates, was trained for several months on thousands of A100s. The full cycle from the start of pre-training to release is 6–18 months, including post-training, evaluations, and safety tests.
Can I train my own LLM from scratch?
Technically, yes, but it's economically accessible only to large organizations. Pre-training a small model (7B parameters) costs from $50K to $500K. For most businesses, it's more rational to take an open-source base model (Llama 4, Mistral) and fine-tune it for their task — this costs from a few hundred to a few thousand dollars.
What is fine-tuning and how does it differ from training from scratch?
Fine-tuning is the further training of an already trained model on a new, narrow dataset. The model retains knowledge from pre-training but adapts to a new task or style. Unlike training from scratch, fine-tuning requires orders of magnitude less data and compute. LoRA and QLoRA allow fine-tuning even on consumer GPUs.
Why doesn't ChatGPT know current news?
Due to knowledge cutoff: the model is trained on data up to a certain date and does not automatically receive new knowledge. ChatGPT solves this by integrating web search (Search), but the base model remains "frozen." Without search or RAG, an LLM won't know about events after the cutoff.
Are synthetic data safe for training?
Synthetic data is safe when used correctly — as a supplement to real data. Complete replacement of real data with synthetic data leads to "model collapse": each subsequent generation of the model degrades slightly because it learns from the outputs of the previous one. The optimal solution is a mix of real and synthetic data in a 33–67% synthetic proportion.
What is RLVR and how is it better than RLHF?
RLVR (Reinforcement Learning with Verifiable Rewards) uses an automatic verifier instead of human evaluators. For tasks with a clear correct answer (math, code, logic), the verifier is faster, cheaper, and more stable than a human. RLHF remains necessary for tasks without a clear "correct answer" — creative writing, subjective preferences, tone nuances.
Why train a model on code if it's text-based?
Code is particularly valuable data even for general LLMs. It is structured, logical, verifiable (code either works or it doesn't), and contains concentrated cause-and-effect relationships. Models trained on a larger proportion of code show better results on reasoning tasks even outside of programming.
✅ Conclusions
- 🔹 LLM training is a four-stage pipeline: pre-training (language) → mid-training (specialization) → SFT (format) → alignment (behavior). None can be skipped.
- 🔹 Classic RLHF is dead in 2025. It has been replaced by a modular stack: DPO for alignment, GRPO/RLVR for reasoning — cheaper, faster, and more stable.
- 🔹 Cost of training frontier models: $78–191M in compute (Stanford AI Index 2025). However, inference costs are falling 9–900 times per year — thanks to quantization and MoE.
- 🔹 Synthetic data is not a panacea. Optimal mix: 33–67% synthetic + the rest real data. Pure synthetic leads to "model collapse."
- 🔹 Knowledge cutoff is not a bug, but a feature. The model doesn't "forget" news — it never saw it. Solved via RAG or web search.
- 🔹 Open-source has almost caught up to closed models. Llama 4, DeepSeek, and Qwen compete with GPT-4o and Claude on most practical tasks.
Main takeaway: LLM training in 2026 is not a monolithic process, but a modular stack where each component evolves separately. Understanding this stack allows not only to choose the right model for the task but also to critically evaluate marketing claims about the "best model in the world."
🎯 Sharp thesis to remember: If not for safety regulations and legal risks, closed models would have lost their relevance for 80% of business tasks. Open-source is cheaper, controlled, and almost as high-quality. Choosing a closed model today is often a choice of convenience and brand, rather than technical superiority.
Next article in the series: LLM Context Window — why AI forgets and how much it costs.
Also read:
Embeddings in Simple Terms: How AI Understands Meaning, Not Just Words — a fundamental guide on how text is converted into numbers and why this is the basis of RAG and semantic search.
Embedding Models for RAG in 2026 — a comprehensive guide to choosing with a comparison of 10+ models, prices, and a real-world case.