TL;DR in 30 seconds: DeepSeek V4 Flash is a MoE model with 284B parameters (13B active), a 1M token context, and an MIT license. Released on April 24, 2026. Costs $0.14/$0.28 per million tokens — cheaper than Claude Haiku 4.5, Gemini 3.1 Flash, and GPT-5.4 Nano. Available via Ollama Cloud on NVIDIA Blackwell without downloading 160GB weights. Details below.
How I Learned About This Release
On the morning of April 25th, an email arrived from Ollama: "DeepSeek-V4-Flash is now available to run on Ollama's cloud using the latest NVIDIA Blackwell hardware." Just like that — no big announcements, just an email from a service I use daily for local model deployment.
I've been following DeepSeek since R1 — back then, the model literally crashed NVIDIA's stock and rewrote all perceptions of how much frontier-class training costs. V4 was long-awaited, delayed several times. And here it is.
This article is not a press release summary. I'll try to break down what's truly important for a developer building LLM-based products — just as I'm building my RAG system.
Context: What Came Before V4
If you've only followed DeepSeek superficially, here's a brief timeline:
- December 2024: DeepSeek V3 — the first open-source model that truly competes with GPT-4o in quality with open weights.
- January 2025: R1 — a reasoning model on par with OpenAI's o1, trained for pennies compared to competitors. NVIDIA stock dropped by hundreds of billions.
- December 2025: V3.2 — an evolutionary update with 671B parameters.
- April 24, 2026: V4 Flash and V4 Pro — a new architecture, not just "more parameters."
It's important to understand: V4 is not V3.2+. It's a new architecture with a fundamentally different approach to long context. Details below.
Flash vs Pro: Two Different Products
DeepSeek released two models simultaneously, and they are often confused. Here are the main differences:
| Parameter | V4 Flash | V4 Pro |
|---|---|---|
| Parameters (Total) | 284B | 1.6T |
| Active per Token | 13B | 49B |
| Context | 1M tokens | 1M tokens |
| Max Output | 384K tokens | 384K tokens |
| Weight (HuggingFace) | 160 GB | 865 GB |
| Input (cache miss) | $0.14/M | $1.74/M |
| Input (cache hit) | $0.028/M | $0.145/M |
| Output | $0.28/M | $3.48/M |
| License | MIT | MIT |
The key insight hidden in these numbers: the input price for Flash and Pro is almost the same at cache hit ($0.028 vs $0.145), but the output is 12 times cheaper for Flash ($0.28 vs $3.48). For most production tasks, output constitutes the main part of the cost. This means Flash is not a "cheap version," but a separate product for a different class of tasks.
Prices confirmed by official documentation: api-docs.deepseek.com/quick_start/pricing
Also an important note from the official documentation: the old names deepseek-chat and deepseek-reasoner will be deprecated. They now correspond to deepseek-v4-flash in non-thinking and thinking modes. If you have old code, plan your migration by July 24, 2026.
Architecture: What's Really New
Most reviews at this point just copy paragraphs from the tech report. I'll try to explain what it means practically.
Architecture: What's Really New
Most reviews at this point copy three lines from the tech report and move on. I'll try to explain what these changes mean practically — for a developer who needs to understand not "what is the model's architecture," but "why it behaves this way and what I should do about it."
DeepSeek V4 has three key architectural innovations: Hybrid Attention (CSA + HCA), Manifold-Constrained Hyper-Connections, and the Muon optimizer. Let's break down each.
Hybrid Attention: CSA + HCA
To understand why this is needed, first — the problem it solves.
In a standard transformer, the self-attention mechanism grows quadratically with context length. This means: if you double the context length, computations increase fourfold. At 1M tokens, standard attention becomes practically impossible — both in terms of inference cost and memory for the KV cache.
DeepSeek V4 solves this through two complementary mechanisms:
CSA (Compressed Sparse Attention) — instead of each token "looking" at all other tokens in the context, CSA selectively focuses on the most relevant parts. It's similar to how an experienced reader scans a long document: they don't read every word but know where to find the important information. For most tokens in a long context, full attention is excessive; CSA cuts out this excess.
HCA (Heavily Compressed Attention) — goes further and aggressively compresses the KV cache, storing a compressed representation instead of the full one. A smaller KV cache means less GPU memory and faster inference with long contexts.
Together, the effect is: with a 1M token context, DeepSeek V4 Pro uses only 27% of FLOPs and 10% of KV cache compared to V3.2. Flash, with 13B active parameters, is even more efficient than Pro.
What this means practically for you:
- RAG with large chunks: Instead of aggressive chunking into 512–1024 tokens, you can pass larger document segments. Less context loss at chunk boundaries — potentially better response quality.
- Analyzing large codebases: 1M tokens can realistically be an entire repository. Previously, this was a marketing figure; now, at $0.028/M on cache hit, it's a real option.
- Long conversations: The model can retain the entire conversation context without forced history truncation.
An important caveat: CSA and HCA are approximations. In theory, the model might miss something important in a very long context where relevant details are scattered throughout the document. In practice, DeepSeek reports 83.5% on MRCR 1M (needle-in-a-haystack at 1M tokens) — a strong result, but not 100%. For critical tasks where "not missing anything" is crucial — test on your own data.
Source: huggingface.co/deepseek-ai/DeepSeek-V4-Flash
mHC: Manifold-Constrained Hyper-Connections
In a standard transformer, each layer adds its representation to the previous one via a residual connection — simple addition. This simple operation has been both a strength and a weakness: it allows gradients to flow back during training (solving the vanishing gradient problem) but doesn't let layers "negotiate" how to combine their representations.
mHC replaces simple addition with a more expressive mechanism where each connection between layers can have its own trainable parameters. The "manifold constraint" is a mathematical condition that prevents these weights from diverging during training, maintaining stability.
The practical effect for the end-user is twofold:
- More stable quality on complex tasks: Standard residual connections sometimes lead to "dips" — a query is similar to a previous one, but the response is suddenly worse. mHC reduces this variability through better signal stabilization between layers.
- Improved quality with large reasoning budget: When the model "thinks" for a long time (Think Max mode), it's important that the signal doesn't degrade in deeper layers. mHC directly addresses this problem.
For regular API usage, you won't "see" mHC directly — but this detail explains why Flash-Max in Think Max mode can approach Pro's quality on reasoning tasks, despite its significantly smaller size.
Muon Optimizer
This third innovation relates to the training process, not the model architecture itself. Muon is a next-generation optimizer, an alternative to AdamW used by most modern LLMs.
Technically, Muon applies gradient orthogonalization via Nesterov's method, which has two effects: faster convergence during training and less sensitivity to the learning rate. For you as a user, it means one thing: the model is trained better for the same amount of tokens. DeepSeek trained both models on 32T tokens — significantly more than V3.2.
Three Reasoning Modes: A Practical Guide
Both models support three modes, but DeepSeek's documentation names them slightly differently than what's written in reviews. Officially:
- Non-Thinking — inference without internal chain-of-thought. The response is generated immediately, without "thinking" tokens. Fastest and cheapest in terms of output tokens.
- Thinking (High) — the model generates internal reasoning before responding. Thinking tokens are consumed but are not priced the same as completion tokens — technically, they are reasoning tokens and are billed separately. For most complex tasks, this is the optimal balance.
- Think Max — maximum budget for internal reasoning. DeepSeek recommends a minimum of 384K context for this mode — this is an important detail: if your context is shorter, the model will truncate the reasoning, and quality will drop.
How to enable via API (by default, deepseek-v4-flash includes Thinking mode):
# Non-Thinking — cheapest
{
"model": "deepseek-v4-flash",
"messages": [...],
"thinking": {"type": "disabled"}
}
# Thinking (High) — default
{
"model": "deepseek-v4-flash",
"messages": [...],
"thinking": {"type": "enabled", "budget_tokens": 8000}
}
# Think Max — for complex tasks
{
"model": "deepseek-v4-flash",
"messages": [...],
"thinking": {"type": "enabled", "budget_tokens": 32000}
}
My practical guide to the modes:
| Task | Mode | Why |
|---|---|---|
| RAG chat, FAQ answers | Non-Thinking | Context is already provided by the retrieval layer; reasoning is redundant. |
| Code generation, refactoring | Thinking (High) | Needs "thinking" but not excessively. |
| Complex bugs, architectural decisions | Think Max | The task requires deep analysis; tokens are justified. |
| Mathematics, proofs | Think Max | Where Flash-Max approaches Pro in quality. |
| Classification, structured output | Non-Thinking | Simple task — reasoning only adds cost. |
In my RAG system, I use Non-Thinking as the default: the retrieval layer already does the "heavy lifting" of finding relevant context, and additional reasoning from the model doesn't improve response quality but increases latency and cost. I keep Think Max for manual tests and quality comparisons — not for production.
Documentation on thinking mode: api-docs.deepseek.com/guides/thinking_mode
Benchmarks: What to Take Seriously and What Not To
I'm accustomed to being skeptical of self-reported benchmarks — especially when a model is just released and no one has had time for independent testing yet. So, let's analyze the data in context: what they measure, where Flash is truly good, where it falls short, and what in these numbers is worth ignoring altogether.
What These Benchmarks Actually Mean
Before the numbers, important context on how to read DeepSeek's tables.
Firstly, almost all numbers from DeepSeek are self-reported. Independent confirmations as of the publication date of this article are not yet available. This doesn't mean they are lies — DeepSeek has a reputation from V3 and R1, where their benchmarks were confirmed. But "trust, but verify."
Secondly, almost all of Flash's strong numbers are in Flash-Max mode, meaning with the maximum reasoning token budget. In regular Thinking mode, the numbers will be lower. In Non-Thinking, even lower. For API tasks where speed and cost are important, you likely won't be using Max mode constantly.
Coding: Where Flash is Strongest
In coding tasks, Flash shows the best results relative to its price. Key numbers from the official model card and tech report:
| Benchmark | Flash-Max | Pro-Max | Claude Opus 4.6 | What it Measures |
|---|---|---|---|---|
| SWE-bench Verified | 79% | 80.6% | 80.8% | Real GitHub issues |
| LiveCodeBench | ~91% | 93.5% | 88.8% | Competitive programming |
| Terminal Bench 2.0 | 56.9% | 67.9% | 65.4% | Agent tasks in the terminal |
| SWE-bench Pro | ~48% | 55.4% | — | More complex real issues |
SWE-bench Verified is the most important of these benchmarks because it uses real tasks from real repositories (django, scikit-learn, matplotlib, etc.). Not synthetic, not olympiad problems. Flash-Max at 79% is only 1.6 points behind Pro-Max and 1.8 behind Claude Opus 4.6. With a 12x difference in output price, this is a very narrow gap.
LiveCodeBench — tasks from Codeforces, LeetCode, AtCoder. Flash is slightly weaker than Pro, but both outperform Claude Opus 4.6. Important: this is competitive programming, and these tasks are rarely encountered in real development. But for assessing "can the model think algorithmically" — it's a relevant benchmark.
Where Flash Noticeably Lags Behind Pro
Here, it's important to be honest — and the numbers speak for themselves.
Terminal Bench 2.0: 56.9% vs 67.9% for Pro — this is the largest gap between Flash and Pro among coding benchmarks. Terminal Bench measures an agent's ability to independently perform long-term tasks in the terminal: installing dependencies, running tests, fixing errors, interacting with the file system. An 11-point difference here is significant. It means a Flash agent gets "stuck" more often on long autonomous tasks where there's no human intervention.
MCPAtlas: Flash-Max is weaker. MCPAtlas evaluates working with a large number of external tools via MCP (Model Context Protocol). Pro-Max scores 73.6%, Flash-Max is noticeably lower. If your agent needs to juggle dozens of tools in one session — Flash is not the best choice.
Knowledge and reasoning: HLE, SimpleQA, MMLU-Pro. Here, model size makes a difference. Flash scores 86.4% on MMLU-Pro, Pro scores 87.5%. The difference is small, but on HLE (Humanity's Last Exam — the most complex cross-domain questions), Flash lags more noticeably. For tasks requiring a broad factual base — Pro is better.
| Benchmark | Flash-Max | Pro-Max | What it Measures |
|---|---|---|---|
| HLE (Humanity's Last Exam) | 34.8 | 37.7 | Most complex expert-level questions |
| MMLU-Pro | 86.4% | 87.5% | Broad academic knowledge base |
| GPQA Diamond | 88.1 | 90.1 | PhD-level science questions |
| Terminal Bench 2.0 | 56.9% | 67.9% | Autonomous agent tasks |
Source of figures: huggingface.co/deepseek-ai/DeepSeek-V4-Flash and felloai.com/deepseek-v4/
One Nuance About Flash Not Found in Reviews
Most materials compare Flash and Pro based on overall numbers. But there's an important technical detail from the tech report: Flash, with a 1M token context, uses only 10% of FLOPs and 7% of KV cache compared to V3.2. For Pro, it's 27% and 10% respectively.
This means Flash is more efficient than Pro even in relative terms with long contexts — and this is why it can compete in quality at a significantly smaller size. A small model that doesn't waste resources on "excessive" attention in long contexts can outperform a larger model with a standard architecture on tasks where context is important, not just parameter count.
Mathematics: Where Flash is Unexpectedly Strong
This is a less known fact, but in formal mathematics, Flash-Max shows results close to Pro. On Putnam-200 Pass@8, Flash-Max scores 81.0 — significantly higher than Seed-2.0-Pro (35.5) and Gemini-3-Pro (26.5). This is a non-standard benchmark, and there are questions about the methodology, but the result is impressive.
On IMOAnswerBench, Flash-Max is also close to Pro. For tasks requiring mathematical reasoning with a large thinking budget — Flash-Max can be more cost-effective even compared to more expensive closed models.
Overall Honest Assessment: What V4 Truly Means for the Market
DeepSeek itself wrote in the tech report that V4 "trails state-of-the-art frontier models by approximately 3 to 6 months." This is rare honesty from an AI lab — most manufacturers don't publish such formulations in official materials.
GPT-5.4 and Gemini 3.1 Pro are ahead in knowledge and the most complex reasoning tasks. Claude Opus 4.6 is ahead on HLE and SWE-bench Verified (minimally, but ahead). These are facts.
But there's another side to this comparison. Here's the real difference in output cost between Flash and leading closed models:
| Model | Output $/M | Times More Expensive Than Flash |
|---|---|---|
| DeepSeek V4 Flash | $0.28 | — |
| GPT-5.4 Nano | ~$1.20 | 4.3× |
| Gemini 3.1 Flash | ~$1.05 | 3.75× |
| Claude Haiku 4.5 | ~$4.00 | 14.3× |
| Claude Opus 4.7 | ~$25.00 | 89× |
| GPT-5.5 | ~$30.00 | 107× |
An open-source model with an MIT license, lagging behind the closed frontier by 3–6 months, while costing 14 times less than Claude Haiku — that's the main argument. Not "DeepSeek is the best," but "DeepSeek changes the de facto cost/quality calculation for most product tasks."
For my RAG, the practical question isn't "which benchmark is higher," but "where is the quality sufficient for my users at an acceptable cost." It's precisely for such choices that these numbers are important — not as a ranking of winners, but as input data for decision-making.