DeepSeek V4 Flash in 2026: what it is, how much it costs, and how to run it without a GPU

Updated:
DeepSeek V4 Flash in 2026: what it is, how much it costs, and how to run it without a GPU

TL;DR in 30 seconds: DeepSeek V4 Flash is a MoE model with 284B parameters (13B active), a 1M token context, and an MIT license. Released on April 24, 2026. Costs $0.14/$0.28 per million tokens — cheaper than Claude Haiku 4.5, Gemini 3.1 Flash, and GPT-5.4 Nano. Available via Ollama Cloud on NVIDIA Blackwell without downloading 160GB weights. Details below.

How I Learned About This Release

On the morning of April 25th, an email arrived from Ollama: "DeepSeek-V4-Flash is now available to run on Ollama's cloud using the latest NVIDIA Blackwell hardware." Just like that — no big announcements, just an email from a service I use daily for local model deployment.

I've been following DeepSeek since R1 — back then, the model literally crashed NVIDIA's stock and rewrote all perceptions of how much frontier-class training costs. V4 was long-awaited, delayed several times. And here it is.

This article is not a press release summary. I'll try to break down what's truly important for a developer building LLM-based products — just as I'm building my RAG system.

Context: What Came Before V4

If you've only followed DeepSeek superficially, here's a brief timeline:

  • December 2024: DeepSeek V3 — the first open-source model that truly competes with GPT-4o in quality with open weights.
  • January 2025: R1 — a reasoning model on par with OpenAI's o1, trained for pennies compared to competitors. NVIDIA stock dropped by hundreds of billions.
  • December 2025: V3.2 — an evolutionary update with 671B parameters.
  • April 24, 2026: V4 Flash and V4 Pro — a new architecture, not just "more parameters."

It's important to understand: V4 is not V3.2+. It's a new architecture with a fundamentally different approach to long context. Details below.

Flash vs Pro: Two Different Products

DeepSeek released two models simultaneously, and they are often confused. Here are the main differences:

Parameter V4 Flash V4 Pro
Parameters (Total) 284B 1.6T
Active per Token 13B 49B
Context 1M tokens 1M tokens
Max Output 384K tokens 384K tokens
Weight (HuggingFace) 160 GB 865 GB
Input (cache miss) $0.14/M $1.74/M
Input (cache hit) $0.028/M $0.145/M
Output $0.28/M $3.48/M
License MIT MIT

The key insight hidden in these numbers: the input price for Flash and Pro is almost the same at cache hit ($0.028 vs $0.145), but the output is 12 times cheaper for Flash ($0.28 vs $3.48). For most production tasks, output constitutes the main part of the cost. This means Flash is not a "cheap version," but a separate product for a different class of tasks.

Prices confirmed by official documentation: api-docs.deepseek.com/quick_start/pricing

Also an important note from the official documentation: the old names deepseek-chat and deepseek-reasoner will be deprecated. They now correspond to deepseek-v4-flash in non-thinking and thinking modes. If you have old code, plan your migration by July 24, 2026.

Architecture: What's Really New

Most reviews at this point just copy paragraphs from the tech report. I'll try to explain what it means practically.

Architecture: What's Really New

Most reviews at this point copy three lines from the tech report and move on. I'll try to explain what these changes mean practically — for a developer who needs to understand not "what is the model's architecture," but "why it behaves this way and what I should do about it."

DeepSeek V4 has three key architectural innovations: Hybrid Attention (CSA + HCA), Manifold-Constrained Hyper-Connections, and the Muon optimizer. Let's break down each.

Hybrid Attention: CSA + HCA

To understand why this is needed, first — the problem it solves.

In a standard transformer, the self-attention mechanism grows quadratically with context length. This means: if you double the context length, computations increase fourfold. At 1M tokens, standard attention becomes practically impossible — both in terms of inference cost and memory for the KV cache.

DeepSeek V4 solves this through two complementary mechanisms:

CSA (Compressed Sparse Attention) — instead of each token "looking" at all other tokens in the context, CSA selectively focuses on the most relevant parts. It's similar to how an experienced reader scans a long document: they don't read every word but know where to find the important information. For most tokens in a long context, full attention is excessive; CSA cuts out this excess.

HCA (Heavily Compressed Attention) — goes further and aggressively compresses the KV cache, storing a compressed representation instead of the full one. A smaller KV cache means less GPU memory and faster inference with long contexts.

Together, the effect is: with a 1M token context, DeepSeek V4 Pro uses only 27% of FLOPs and 10% of KV cache compared to V3.2. Flash, with 13B active parameters, is even more efficient than Pro.

What this means practically for you:

  • RAG with large chunks: Instead of aggressive chunking into 512–1024 tokens, you can pass larger document segments. Less context loss at chunk boundaries — potentially better response quality.
  • Analyzing large codebases: 1M tokens can realistically be an entire repository. Previously, this was a marketing figure; now, at $0.028/M on cache hit, it's a real option.
  • Long conversations: The model can retain the entire conversation context without forced history truncation.

An important caveat: CSA and HCA are approximations. In theory, the model might miss something important in a very long context where relevant details are scattered throughout the document. In practice, DeepSeek reports 83.5% on MRCR 1M (needle-in-a-haystack at 1M tokens) — a strong result, but not 100%. For critical tasks where "not missing anything" is crucial — test on your own data.

Source: huggingface.co/deepseek-ai/DeepSeek-V4-Flash

mHC: Manifold-Constrained Hyper-Connections

In a standard transformer, each layer adds its representation to the previous one via a residual connection — simple addition. This simple operation has been both a strength and a weakness: it allows gradients to flow back during training (solving the vanishing gradient problem) but doesn't let layers "negotiate" how to combine their representations.

mHC replaces simple addition with a more expressive mechanism where each connection between layers can have its own trainable parameters. The "manifold constraint" is a mathematical condition that prevents these weights from diverging during training, maintaining stability.

The practical effect for the end-user is twofold:

  • More stable quality on complex tasks: Standard residual connections sometimes lead to "dips" — a query is similar to a previous one, but the response is suddenly worse. mHC reduces this variability through better signal stabilization between layers.
  • Improved quality with large reasoning budget: When the model "thinks" for a long time (Think Max mode), it's important that the signal doesn't degrade in deeper layers. mHC directly addresses this problem.

For regular API usage, you won't "see" mHC directly — but this detail explains why Flash-Max in Think Max mode can approach Pro's quality on reasoning tasks, despite its significantly smaller size.

Muon Optimizer

This third innovation relates to the training process, not the model architecture itself. Muon is a next-generation optimizer, an alternative to AdamW used by most modern LLMs.

Technically, Muon applies gradient orthogonalization via Nesterov's method, which has two effects: faster convergence during training and less sensitivity to the learning rate. For you as a user, it means one thing: the model is trained better for the same amount of tokens. DeepSeek trained both models on 32T tokens — significantly more than V3.2.

Three Reasoning Modes: A Practical Guide

Both models support three modes, but DeepSeek's documentation names them slightly differently than what's written in reviews. Officially:

  • Non-Thinking — inference without internal chain-of-thought. The response is generated immediately, without "thinking" tokens. Fastest and cheapest in terms of output tokens.
  • Thinking (High) — the model generates internal reasoning before responding. Thinking tokens are consumed but are not priced the same as completion tokens — technically, they are reasoning tokens and are billed separately. For most complex tasks, this is the optimal balance.
  • Think Max — maximum budget for internal reasoning. DeepSeek recommends a minimum of 384K context for this mode — this is an important detail: if your context is shorter, the model will truncate the reasoning, and quality will drop.

How to enable via API (by default, deepseek-v4-flash includes Thinking mode):

# Non-Thinking — cheapest
{
  "model": "deepseek-v4-flash",
  "messages": [...],
  "thinking": {"type": "disabled"}
}

# Thinking (High) — default
{
  "model": "deepseek-v4-flash",
  "messages": [...],
  "thinking": {"type": "enabled", "budget_tokens": 8000}
}

# Think Max — for complex tasks
{
  "model": "deepseek-v4-flash",
  "messages": [...],
  "thinking": {"type": "enabled", "budget_tokens": 32000}
}

My practical guide to the modes:

Task Mode Why
RAG chat, FAQ answers Non-Thinking Context is already provided by the retrieval layer; reasoning is redundant.
Code generation, refactoring Thinking (High) Needs "thinking" but not excessively.
Complex bugs, architectural decisions Think Max The task requires deep analysis; tokens are justified.
Mathematics, proofs Think Max Where Flash-Max approaches Pro in quality.
Classification, structured output Non-Thinking Simple task — reasoning only adds cost.

In my RAG system, I use Non-Thinking as the default: the retrieval layer already does the "heavy lifting" of finding relevant context, and additional reasoning from the model doesn't improve response quality but increases latency and cost. I keep Think Max for manual tests and quality comparisons — not for production.

Documentation on thinking mode: api-docs.deepseek.com/guides/thinking_mode

Benchmarks: What to Take Seriously and What Not To

I'm accustomed to being skeptical of self-reported benchmarks — especially when a model is just released and no one has had time for independent testing yet. So, let's analyze the data in context: what they measure, where Flash is truly good, where it falls short, and what in these numbers is worth ignoring altogether.

What These Benchmarks Actually Mean

Before the numbers, important context on how to read DeepSeek's tables.

Firstly, almost all numbers from DeepSeek are self-reported. Independent confirmations as of the publication date of this article are not yet available. This doesn't mean they are lies — DeepSeek has a reputation from V3 and R1, where their benchmarks were confirmed. But "trust, but verify."

Secondly, almost all of Flash's strong numbers are in Flash-Max mode, meaning with the maximum reasoning token budget. In regular Thinking mode, the numbers will be lower. In Non-Thinking, even lower. For API tasks where speed and cost are important, you likely won't be using Max mode constantly.

Coding: Where Flash is Strongest

In coding tasks, Flash shows the best results relative to its price. Key numbers from the official model card and tech report:

Benchmark Flash-Max Pro-Max Claude Opus 4.6 What it Measures
SWE-bench Verified 79% 80.6% 80.8% Real GitHub issues
LiveCodeBench ~91% 93.5% 88.8% Competitive programming
Terminal Bench 2.0 56.9% 67.9% 65.4% Agent tasks in the terminal
SWE-bench Pro ~48% 55.4% More complex real issues

SWE-bench Verified is the most important of these benchmarks because it uses real tasks from real repositories (django, scikit-learn, matplotlib, etc.). Not synthetic, not olympiad problems. Flash-Max at 79% is only 1.6 points behind Pro-Max and 1.8 behind Claude Opus 4.6. With a 12x difference in output price, this is a very narrow gap.

LiveCodeBench — tasks from Codeforces, LeetCode, AtCoder. Flash is slightly weaker than Pro, but both outperform Claude Opus 4.6. Important: this is competitive programming, and these tasks are rarely encountered in real development. But for assessing "can the model think algorithmically" — it's a relevant benchmark.

Where Flash Noticeably Lags Behind Pro

Here, it's important to be honest — and the numbers speak for themselves.

Terminal Bench 2.0: 56.9% vs 67.9% for Pro — this is the largest gap between Flash and Pro among coding benchmarks. Terminal Bench measures an agent's ability to independently perform long-term tasks in the terminal: installing dependencies, running tests, fixing errors, interacting with the file system. An 11-point difference here is significant. It means a Flash agent gets "stuck" more often on long autonomous tasks where there's no human intervention.

MCPAtlas: Flash-Max is weaker. MCPAtlas evaluates working with a large number of external tools via MCP (Model Context Protocol). Pro-Max scores 73.6%, Flash-Max is noticeably lower. If your agent needs to juggle dozens of tools in one session — Flash is not the best choice.

Knowledge and reasoning: HLE, SimpleQA, MMLU-Pro. Here, model size makes a difference. Flash scores 86.4% on MMLU-Pro, Pro scores 87.5%. The difference is small, but on HLE (Humanity's Last Exam — the most complex cross-domain questions), Flash lags more noticeably. For tasks requiring a broad factual base — Pro is better.

Benchmark Flash-Max Pro-Max What it Measures
HLE (Humanity's Last Exam) 34.8 37.7 Most complex expert-level questions
MMLU-Pro 86.4% 87.5% Broad academic knowledge base
GPQA Diamond 88.1 90.1 PhD-level science questions
Terminal Bench 2.0 56.9% 67.9% Autonomous agent tasks

Source of figures: huggingface.co/deepseek-ai/DeepSeek-V4-Flash and felloai.com/deepseek-v4/

One Nuance About Flash Not Found in Reviews

Most materials compare Flash and Pro based on overall numbers. But there's an important technical detail from the tech report: Flash, with a 1M token context, uses only 10% of FLOPs and 7% of KV cache compared to V3.2. For Pro, it's 27% and 10% respectively.

This means Flash is more efficient than Pro even in relative terms with long contexts — and this is why it can compete in quality at a significantly smaller size. A small model that doesn't waste resources on "excessive" attention in long contexts can outperform a larger model with a standard architecture on tasks where context is important, not just parameter count.

Mathematics: Where Flash is Unexpectedly Strong

This is a less known fact, but in formal mathematics, Flash-Max shows results close to Pro. On Putnam-200 Pass@8, Flash-Max scores 81.0 — significantly higher than Seed-2.0-Pro (35.5) and Gemini-3-Pro (26.5). This is a non-standard benchmark, and there are questions about the methodology, but the result is impressive.

On IMOAnswerBench, Flash-Max is also close to Pro. For tasks requiring mathematical reasoning with a large thinking budget — Flash-Max can be more cost-effective even compared to more expensive closed models.

Overall Honest Assessment: What V4 Truly Means for the Market

DeepSeek itself wrote in the tech report that V4 "trails state-of-the-art frontier models by approximately 3 to 6 months." This is rare honesty from an AI lab — most manufacturers don't publish such formulations in official materials.

GPT-5.4 and Gemini 3.1 Pro are ahead in knowledge and the most complex reasoning tasks. Claude Opus 4.6 is ahead on HLE and SWE-bench Verified (minimally, but ahead). These are facts.

But there's another side to this comparison. Here's the real difference in output cost between Flash and leading closed models:

Model Output $/M Times More Expensive Than Flash
DeepSeek V4 Flash $0.28
GPT-5.4 Nano ~$1.20 4.3×
Gemini 3.1 Flash ~$1.05 3.75×
Claude Haiku 4.5 ~$4.00 14.3×
Claude Opus 4.7 ~$25.00 89×
GPT-5.5 ~$30.00 107×

An open-source model with an MIT license, lagging behind the closed frontier by 3–6 months, while costing 14 times less than Claude Haiku — that's the main argument. Not "DeepSeek is the best," but "DeepSeek changes the de facto cost/quality calculation for most product tasks."

For my RAG, the practical question isn't "which benchmark is higher," but "where is the quality sufficient for my users at an acceptable cost." It's precisely for such choices that these numbers are important — not as a ranking of winners, but as input data for decision-making.

DeepSeek V4 Flash in 2026: what it is, how much it costs, and how to run it without a GPU

How to run DeepSeek V4 Flash without a GPU

Flash weighs 160 GB on HuggingFace. Local execution requires a multi-GPU server with tens of gigabytes of VRAM — not a Mac, not a laptop, not even a mid-range workstation. But there are three ways to use the model right now without any specialized hardware.

Option 1: Ollama Cloud — the easiest start

On April 25th, Ollama sent an official email to subscribers: Flash is available on their cloud, hosted on NVIDIA Blackwell. The commands below are verified from the source, not theoretical examples.

Step 1: Install or update Ollama to the latest version. The ollama launch command appeared in January 2026 — if your version is older, it won't work.

# Recommended: official installer — always the latest version
curl -fsSL https://ollama.com/install.sh | sh

# or download .dmg / .exe directly from ollama.com/download
# (Homebrew might lag behind the current release by 1-2 weeks)

Step 2: Sign in — cloud models require an Ollama account:

ollama signin

This will open a browser to ollama.com/connect — your machine will be registered there via a public SSH key. After confirmation, the credentials are saved locally and used automatically for all subsequent cloud requests. Without this step, :cloud models won't launch.

For CI/CD or headless environments where a browser is unavailable — an alternative via an API key from your account settings page:

export OLLAMA_API_KEY=ollama_...  # instead of ollama signin

Step 3: Launch — depending on what you need:

# Just chat with the model in the terminal
ollama run deepseek-v4-flash:cloud

# With Claude Code — agentic coding in your repository
ollama launch claude --model deepseek-v4-flash:cloud

# With OpenClaw — an alternative coding agent
ollama launch openclaw --model deepseek-v4-flash:cloud

# With Hermes Agent — for research and automation tasks
ollama launch hermes --model deepseek-v4-flash:cloud

Important detail: unlike local models, you don't need to ollama pull — the :cloud model launches instantly without downloading to your disk. No env variables, no config files — this is precisely the "killer feature" of ollama launch that appeared in January 2026. Before that, you had to manually specify the API endpoint, select the model, and edit the configs of each agent separately.

What happens under the hood with :cloud

When you run deepseek-v4-flash:cloud, the local Ollama server acts as an authorized proxy: your request goes to Ollama's servers, is processed there on a Blackwell GPU, and the result is returned to you. Nothing is downloaded locally except Ollama itself.

Technically, it looks like this: the local daemon receives a request, detects the :cloud suffix, normalizes the model name for the remote endpoint, attaches auth headers from your SSH key, and proxies the request to Ollama's cloud infrastructure. The response is streamed back in real-time — just like with a local model. From your code or agent's perspective, nothing changes; it still connects to localhost:11434.

According to Ollama, models are hosted through NVIDIA Cloud Providers (NCPs) with a condition of zero logging and zero data retention. Prompts are not stored or used for training — this is confirmed in the official documentation. Ollama also notes that data may be processed in the USA, Europe, and Singapore depending on load.

Ollama Cloud Limits and Pricing

It's important to understand before you start: Ollama Cloud is not an unlimited service. Here's the current table from ollama.com/pricing:

Tier Price Concurrent Models Volume
Free $0 1 Light usage, model evaluation
Pro $20/mo 3 50× more than Free
Max $100/mo 10 5× more than Pro

Limits are measured in GPU time (not tokens) and reset every 5 hours and weekly. Free is sufficient for testing and evaluating the model. For production agents or long coding sessions, Pro or Max is required.

Important warning from the official Ollama email: "Please bear with us as we continue to add GPU capacity." The model was released yesterday, and the infrastructure is still being stabilized. In the first few weeks, expect potential queues and increased latency. For production-critical tasks in the first month, I would recommend DeepSeek API directly — the capacity there is more stable.

ollama launch documentation: ollama.com/blog/launch
Claude Code with Ollama: docs.ollama.com/integrations/claude-code

Option 2: DeepSeek API directly

The most direct path to the model without intermediaries. Suitable if you already have code using the OpenAI SDK — the change is minimal.

Get an API key at platform.deepseek.com — registration is free, and there's a starting credit for testing.

Python (OpenAI-compatible format):

from openai import OpenAI

client = OpenAI(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com"
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Hello"}]
)

print(response.choices[0].message.content)

With thinking mode enabled (High by default, controllable):

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain this algorithm..."}],
    # thinking is enabled by default
    # to disable: add extra_body={"thinking": {"type": "disabled"}}
    max_tokens=8000
)

Anthropic-compatible format — if your code is written for the Anthropic SDK, DeepSeek supports the same API format via a separate endpoint:

import anthropic

client = anthropic.Anthropic(
    api_key="your-deepseek-key",
    base_url="https://api.deepseek.com/anthropic"
)

message = client.messages.create(
    model="deepseek-v4-flash",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Hello"}]
)

For my RAG system on WebsCraft (Spring Boot + OpenRouter), this is the most interesting option: I can test Flash with my real queries, compare it with the current llama-3.3-70b, and get concrete numbers on quality and cost. Tests will be in the next article.

API documentation: api-docs.deepseek.com

Option 3: OpenRouter — if you need a single API for multiple models

OpenRouter has already added Flash. This is convenient if you have code where you switch between multiple providers or want to A/B test Flash against other models without changing your code.

from openai import OpenAI

client = OpenAI(
    api_key="your-openrouter-key",
    base_url="https://openrouter.ai/api/v1"
)

response = client.chat.completions.create(
    model="deepseek/deepseek-v4-flash",  # model string on OpenRouter
    messages=[{"role": "user", "content": "Hello"}]
)

The price on OpenRouter is the same: $0.14/M input, $0.28/M output. OpenRouter adds a small margin on top, but it's minimal and compensated by the convenience of a single billing and the ability to fallback to another model if one is unavailable.

Model page: openrouter.ai/deepseek/deepseek-v4-flash

Which option to choose: a quick comparison

Criterion Ollama Cloud DeepSeek API OpenRouter
Ease of start ⭐⭐⭐ ⭐⭐ ⭐⭐
Coding agents ⭐⭐⭐ (native support) ⭐ (setup required) ⭐ (setup required)
Current stability ⭐ (new, capacity is scaling) ⭐⭐⭐ ⭐⭐⭐
Multi-model routing ⭐ (Ollama models only) ⭐ (DeepSeek only) ⭐⭐⭐
Price Free tier available / $20 Pro Pay-per-use Pay-per-use + margin
Privacy Zero retention (via NCPs) DeepSeek policy OpenRouter policy

My practical plan: Ollama Cloud for testing agents and quick starts, DeepSeek API directly for production integration into RAG. OpenRouter — as a fallback and for A/B tests alongside other models.

Price in market context

This is where Flash truly shines. Comparison with models of a similar class ("fast/efficient" tier):

Model Input $/M Output $/M
DeepSeek V4 Flash $0.14 $0.28
GPT-5.4 Nano ~$0.30 ~$1.20
Gemini 3.1 Flash ~$0.35 ~$1.05
Claude Haiku 4.5 ~$0.80 ~$4.00
DeepSeek V4 Pro $1.74 $3.48

Flash is 2x cheaper than its closest competitor (GPT-5.4 Nano) for input and 4x cheaper for output. At the same time, it has an MIT license and open weights for self-hosting.

DeepSeek V4 Flash in 2026: what it is, how much it costs, and how to run it without a GPU

Geopolitical irony, which few notice

This section is not about politics for politics' sake. It's about how the context surrounding V4 directly affects how much you can rely on this model in the long term — and why the MIT license is more important here than it seems.

V3 and accusations of sanctions violations

To understand V4, you need to know the backstory. DeepSeek V3 (December 2024) and R1 (January 2025) were trained on Nvidia chips — and that's where the problem arose. After the release, Washington accused DeepSeek of acquiring prohibited Nvidia H100/H800 chips bypassing American export restrictions. No direct evidence was publicly presented, DeepSeek confirmed nothing, but the issue remained open.

V4 is a direct response to this situation.

V4 and Huawei Ascend: a strategic pivot

DeepSeek has not publicly disclosed the hardware on which V4 was trained. But on the release day, April 24, Huawei officially announced that its entire Ascend supernode lineup fully supports DeepSeek V4 — and this is no coincidence. According to The Information and Reuters, DeepSeek gave Huawei and Cambricon early access to V4 for optimization, deliberately withholding such access from Nvidia.

Moreover: according to The Information, V4 could have been released earlier, but the team delayed the release for several months — precisely because of working with Huawei and Cambricon to rewrite the model's architectural components for their chips.

This is DeepSeek's first major model designed from the ground up for non-Nvidia hardware.

What is Ascend 950PR and how powerful is it

Honestly about its capabilities: Huawei Ascend 950PR is not Nvidia H100, let alone Blackwell. According to Counterpoint Research analysts, Ascend 910C (the predecessor to 950PR) provides approximately 60% of H100's inference performance. And H100 is already two generations behind the current Nvidia Blackwell. This means that today, American chips are approximately five times more powerful than their Chinese counterparts, and this gap is projected to increase to 17 times by 2027.

But there's a nuance, pointed out by analyst Wei Sun from Counterpoint Research: if an AI system can achieve frontier-level results on significantly weaker hardware, it means that hardware sanctions become a less effective tool. DeepSeek effectively demonstrates this thesis.

Timeline a week before release: everything happened simultaneously

The release timing is important. Here's what was happening in parallel:

  • April 23 — White House OSTP director Michael Kratsios officially accused Chinese organizations of "industrial-scale IP theft" from American AI labs. DeepSeek was mentioned separately as a company that distilled models from OpenAI and Anthropic
  • April 23 — Jensen Huang (CEO of Nvidia) stated on the Dwarkesh podcast that if DeepSeek optimizes its models for Huawei instead of Nvidia, it would be "a horrible outcome for America"
  • April 24 — V4 is released, clearly optimized for Huawei Ascend. SMIC (Huawei's chip manufacturer) shares jumped 10% in Hong Kong
  • April 24 — Chinese MFA: US accusations are "baseless" and are "slander against the achievements of the Chinese AI industry"

The release of V4 at this specific moment is not a coincidence. It's a demonstration: "we can do it without your hardware."

The Ollama paradox: trained on Huawei, hosted on Blackwell

And here lies pure geopolitical irony.

Official statement from Ollama on April 25: "DeepSeek-V4-Flash is now available to run on Ollama's cloud using the latest NVIDIA Blackwell hardware."

That is: the model is trained (or at least optimized) for Huawei Ascend — and is hosted by an American company on American Nvidia Blackwell. The same model, two different chip stacks, two different jurisdictions, one open MIT-licensed weights file.

This became possible precisely thanks to the MIT license and open weights. A closed model like GPT-5.x or Gemini 3.1 Pro cannot do this: it is tied to the provider's infrastructure and terms of use. DeepSeek V4 Flash can.

Practical implications for developers

Geopolitics is the background. But it has direct practical consequences for those building products on LLMs:

Availability risk. If US-China tensions worsen, the US government could theoretically pressure hosting providers regarding the servicing of DeepSeek models. The MIT license and open weights are an insurance policy: the model can be migrated to your own infrastructure or another cloud provider. This is not possible with GPT or Claude.

Supply chain for inference. DeepSeek is clearly building an independent Chinese chip stack. This means that in the future, you may have a choice: host Flash through western providers (Ollama, OpenRouter, AWS Bedrock) or through Chinese clouds (Alibaba Cloud, Tencent Cloud). Competition between them benefits the developer — it drives prices down.

Questions about training data and distillation. Anthropic and OpenAI have publicly accused DeepSeek of distilling their models — using the output of GPT/Claude to train DeepSeek. DeepSeek has not officially admitted this. For the developer, the practical question is different: if you are building a product where responsibility for training data is important (regulated industries, enterprise contracts) — this is a risk to consider.

What doesn't change. The MIT license is clear: you can use, modify, and commercialize without additional permissions. The geopolitics surrounding DeepSeek do not invalidate your rights under MIT. The model is yours after downloading.

Sources: The Next Web: Jensen Huang on Huawei and DeepSeek, ResultSense: DeepSeek V4 on Huawei Ascend, TrendForce: Ascend 950PR and CUDA independence

My personal assessment

I test AI models not in a vacuum — I have a specific RAG system: Spring Boot + nomic-embed-text for embedding + PostgreSQL pgvector for storage + OpenRouter as a provider. Currently, I use meta-llama/llama-3.3-70b via OpenRouter in production for chat.

Flash via DeepSeek API or OpenRouter is my next candidate for A/B testing. Reasons:

  • Price: almost twice as cheap as the current solution for output
  • 1M context: my RAG passes large chunks of documents — long context is important
  • Cache hit pricing: if the system prompt is unchanged between requests, $0.028/M is almost free

What I'm leaving open: quality on Ukrainian language queries. The model is trained primarily on English and Chinese data. My actual tests will be in a separate article comparing Flash vs Gemini Flash vs Claude Haiku 4.5 for RAG.

For now: for API products where price is important, Flash is definitely worth testing. For complex agent tasks where humans are out of the loop — wait for independent benchmarks or take Pro.

Conclusion

DeepSeek V4 Flash is not a revolution, but a very strong argument for reconsidering your AI stack. In short:

  • The cheapest frontier-class model in its price segment
  • MIT license and open weights — a rarity for this level
  • 1M context at an acceptable price — finally realistic for production
  • On SWE-bench, Flash lags behind Pro by 1.6 points — but is 12 times cheaper for output
  • Weaker than closed-source on knowledge and complex agent tasks — and DeepSeek honestly states this
  • Through Ollama Cloud, it can be launched right now without a GPU — but the infrastructure is not yet stabilized

DeepSeek V4 Technical Report: huggingface.co (DeepSeek_V4.pdf)
Official page of the Flash model: huggingface.co/deepseek-ai/DeepSeek-V4-Flash
TechCrunch: DeepSeek closes the gap with frontier models

Frequently Asked Questions (FAQ)

What is DeepSeek V4 Flash?

DeepSeek V4 Flash is an open MoE model from the Chinese lab DeepSeek, released on April 24, 2026. It has 284B parameters (13B active per token), supports a 1M token context, and is available for $0.14/$0.28 per million tokens.

How does Flash differ from V4 Pro?

Flash is smaller and significantly cheaper: output costs $0.28/M compared to $3.48/M for Pro. On most benchmarks, Flash lags behind Pro by 1-2 points. Pro is suitable for complex agent tasks, Flash is for API products, RAG, and price-sensitive tasks.

How to run DeepSeek V4 Flash without a GPU?

Through Ollama Cloud: ollama run deepseek-v4-flash:cloud or ollama launch claude --model deepseek-v4-flash:cloud. The model runs on Ollama servers, so local download of 160 GB is not required. An alternative is DeepSeek API or OpenRouter.

How much does DeepSeek V4 Flash API cost?

$0.14/M tokens for input (cache miss), $0.028/M (cache hit), $0.28/M for output. Official source: api-docs.deepseek.com/quick_start/pricing

Is DeepSeek V4 Flash suitable for RAG?

Potentially yes — especially thanks to its low output price and large context. Cache hit pricing ($0.028/M input) makes repeated requests with the same system prompt almost free. Practical testing on real tasks will be in the next article in this cluster.

Останні статті

Читайте більше цікавих матеріалів

DeepSeek V4 Flash у 2026: що це, скільки коштує і як запустити без GPU

DeepSeek V4 Flash у 2026: що це, скільки коштує і як запустити без GPU

TL;DR за 30 секунд: DeepSeek V4 Flash — MoE-модель з 284B параметрами (13B активних), контекстом 1M токенів і MIT-ліцензією. Вийшла 24 квітня 2026 року. Коштує $0.14/$0.28 за мільйон токенів — дешевше за Claude Haiku 4.5, Gemini 3.1 Flash і GPT-5.4 Nano. Доступна через Ollama Cloud на NVIDIA...

Claude Opus 4.7 для RAG: як я тестував модель на реальних документах

Claude Opus 4.7 для RAG: як я тестував модель на реальних документах

Коротко про що ця стаття: 17 квітня я взяв свіжий Claude Opus 4.7 і прогнав його через свою RAG-систему AskYourDocs на тестовому наборі з ~400 публічних юридичних документів (зразки договорів, нормативні акти, шаблони з відкритих джерел). Порівняв з Llama 3.3 70B, на якій у мене зараз...

Claude Opus 4.7: детальний огляд моделі Anthropic у 2026

Claude Opus 4.7: детальний огляд моделі Anthropic у 2026

TL;DR за 30 секунд: Claude Opus 4.7 — новий флагман Anthropic, який вийшов 16 квітня 2026 року. Головне: +10.9 пунктів на SWE-bench Pro (64.3% проти 53.4% у Opus 4.6), вища роздільна здатність vision (3.75 MP), нова memory на рівні файлової системи та новий рівень міркування xhigh. Ціна...

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Gemma 4 26B MoE: підводні камені і коли це реально виграє

Коротко: Gemma 4 26B MoE рекламують як "якість 26B за ціною 4B". Це правда щодо швидкості інференсу — але не щодо пам'яті. Завантажити потрібно всі 18 GB. На Mac з 24 GB — свопінг і 2 токени/сек. Комфортно працює на 32+ GB. Читай перш ніж завантажувати. Що таке MoE і чому 26B...

Reasoning mode в Gemma 4: як вмикати, коли потрібно і скільки коштує — 2026

Reasoning mode в Gemma 4: як вмикати, коли потрібно і скільки коштує — 2026

Коротко: Reasoning mode — це вбудована здатність Gemma 4 "думати" перед відповіддю. Увімкнений за замовчуванням. На M1 16 GB з'їдає від 20 до 73 секунд залежно від задачі. Повністю вимкнути через Ollama не можна — але можна скоротити через /no_think. Читай коли це варто робити, а коли...

Gemma 4: повний огляд — розміри, ліцензія, порівняння з Gemma 3

Gemma 4: повний огляд — розміри, ліцензія, порівняння з Gemma 3

Коротко: Gemma 4 — нове покоління відкритих моделей від Google DeepMind, випущене 2 квітня 2026 року. Чотири розміри: E2B, E4B, 26B MoE і 31B Dense. Ліцензія Apache 2.0 — можна використовувати комерційно без обмежень. Підтримує зображення, аудіо, reasoning mode і 256K контекст. Запускається...