Anthropic released Claude Opus 4.8 and immediately published a benchmark table with 15+ metrics.
At first glance, it's just another set of percentages and rankings.
But if you read carefully, these numbers hide several non-trivial conclusions
about where Opus 4.8 is truly better, where it falls short, and why some of the most important
changes are not reflected in any chart at all.
What Benchmarks Does Anthropic Publish and What Do They Measure
Anthropic published a comparative table for four models:
Claude Opus 4.8, Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro.
Standard configuration: adaptive thinking at maximum effort,
average of 5 attempts. Source:
Anthropic System Card.
Key benchmarks worth knowing:
SWE-bench Pro — the most challenging variant of SWE-bench: tasks from actively maintained repositories, multi-file diffs, no data leakage into public space. Closest to real-world coding.
SWE-bench Verified — the original set of 500 tasks. Simpler, but still authoritative.
Humanity's Last Exam (HLE) — the most challenging general reasoning benchmark to date. Two modes: with and without tools.
GPQA Diamond — PhD-level questions in physics, chemistry, biology. Considered practically saturated at the frontier level.
OSWorld-Verified — autonomous computer usage: document editing, browser, file management on a real VM.
GDPval-AA — evaluation of economically valuable knowledge work in real professional domains.
Finance Agent v2 — financial analysis, evaluated by Vals AI.
GraphWalks BFS 1M — long-context information extraction at 1 million tokens.
USAMO 2026 — Olympic-level mathematical proofs.
Coding: SWE-bench Pro — Where Opus 4.8 Leads
SWE-bench Pro is where Opus 4.8 has the largest gap over its competitors.
According to Vellum AI
and the official Anthropic system card:
Model
SWE-bench Pro
SWE-bench Verified
SWE-bench Multilingual
Claude Opus 4.8
69.2%
88.6%
84.4%
Claude Opus 4.7
64.3%
87.6%
80.5%
GPT-5.5
58.6%
N/A
N/A
Gemini 3.1 Pro
54.2%
80.6%
N/A
The key pattern: the harder the benchmark variant, the larger the gap.
On SWE-bench Verified, the difference between Opus 4.8 and GPT-5.5 is small (88.6% vs ~82%
according to independent data). On SWE-bench Pro, where tasks are more complex and there's no
"leakage" of correct answers, the gap is over 10 percentage points.
Practical implication from Cursor: Opus 4.8 on
CursorBench
not only achieves a better result but does so with
fewer steps. The token cost per task
decreases without a drop in output quality.
Also noteworthy: computer usage (OSWorld-Verified)
shows 83.4% for Opus 4.8 compared to 78.7% for GPT-5.5 and 76.2% for Gemini 3.1 Pro.
This is not an isolated metric; this result is a prerequisite for Dynamic Workflows (hundreds of parallel sub-agents) to be useful in practice at all.
Why This Matters
SWE-bench Pro is intentionally designed to be "unlearnable."
Tasks are taken from actively maintained repositories, meaning they were created
after the training cutoff date for most models. Multi-file diffs require
understanding the architecture, not just a local line fix.
This is why the gap here is the most honest signal of real coding capability.
For a team using Claude Code on a real project, this means:
Opus 4.8 maintains context between files better, makes changes that "break"
neighboring modules less often, and completes tasks more frequently on the first try without additional
prompt clarification. Fewer steps mean not just token savings,
but less manual control over the agent.
Section Conclusion
I believe Opus 4.8 is the strongest public model for multi-file software engineering
as of May 2026. The gap over GPT-5.5 on SWE-bench Pro (10.6 points)
and over Gemini 3.1 Pro (15 points) is not statistical noise but a stable advantage
on the most representative coding benchmark. If your primary
use case is agent-based coding on real repositories, the choice in favor of
Opus 4.8 is supported by both numbers and practical feedback from early testers.
Terminal-Bench: Where GPT-5.5 Still Leads — and Why the Numbers Here Aren't So Simple
Terminal-Bench 2.1 is the benchmark where Opus 4.8 doesn't lead. And this is where
methodology matters as much as the model itself.
Model
Terminal-Bench 2.1 (Terminus-2)
Terminal-Bench 2.1 (Custom Harness)
GPT-5.5
78.2%
83.4% (Codex CLI)
Claude Opus 4.8
74.6%
—
Gemini 3.1 Pro
70.3%
—
Claude Opus 4.7
66.1%
—
The problem: GPT-5.5 publishes its "headline" result of 83.4% through its own
Codex CLI harness. All other models, including Opus 4.8, were measured using
the public Terminus-2 harness. Comparing 83.4% to 74.6% is not a comparison
of models, but a comparison of measurement tools.
What can be fairly compared: on Terminus-2, GPT-5.5 is 78.2%, Opus 4.8 is 74.6%.
The difference is real but significantly smaller. At the same time, Opus 4.8 has improved relative to
Opus 4.7 by 8.5 points on the same harness — and this is real progress, not marketing.
According to Digital Applied,
Anthropic published both numbers — a rarity in the industry where labs
usually choose the harness that best "flatters" their model.
Conclusion: if your workflow is oriented towards
terminal tasks, shell scripting, and CLI automation — GPT-5.5 with Codex CLI
might be a better choice for now. If you are building an agent coding pipeline
with multi-file engineering — Opus 4.8 is ahead.
Reasoning: GPQA, HLE, and a +27 point jump in math
GPQA Diamond is where Opus 4.8 took a step back.
93.6% versus 94.2% for Opus 4.7. Gemini 3.1 Pro shows 94.3% and formally
leads. But important context: GPQA Diamond is considered practically saturated
at the frontier model level. The difference between 93.6% and 94.3% is within
the margin of statistical error with 5 attempts.
Humanity's Last Exam (HLE) is where there is still real
room for progress:
Model
HLE without tools
HLE with tools
Claude Opus 4.8
49.8%
57.9%
Claude Opus 4.7
46.9%
54.7%
GPT-5.5
41.4%
52.2%
Gemini 3.1 Pro
44.4%
51.4%
Opus 4.8 leads on HLE in both configurations – and the gap with tools
is wider than without.
But the most dramatic result in reasoning is
USAMO 2026 (Olympic-level math proofs):
96.7% in Opus 4.8 versus 69.3% in Opus 4.7.
A gain of 27.4 percentage points in a single release.
According to the analysis by Digital Applied,
such a leap is not explained by incremental improvement –
it's a signal of a qualitative change in the depth of mathematical reasoning.
For teams working with financial modeling, scientific analysis,
or complex algorithmic tasks – this is the most important digital
signal of this release.
Long-context: the most dramatic relative gain of the release
GraphWalks BFS is a benchmark for information extraction with a context of 1 million
tokens. This is not an abstract test: it measures whether a model can maintain
and use connections between objects at very long distances in text.
Model
GraphWalks BFS 1M (F1)
GraphWalks Parents 1M (F1)
Claude Opus 4.8
68.1%
83.3%
Claude Opus 4.7
40.3%
56.6%
GPT-5.5
45.4%
N/A
Opus 4.8 improved from 40.3% to 68.1% – a gain of +27.8 points.
This is the largest relative jump among all published metrics.
Opus 4.8 not only finds information better – it does so 1.7 times
better than its predecessor with the same context size.
Practical significance: large monorepositories, long legal documents,
financial reports of hundreds of pages, code reviews of thousands of files –
this is where this gain is most noticeable. Combined with
Dynamic Workflows (parallel sub-agents), this makes Opus 4.8 a qualitatively
different tool for sustained delivery work compared to 4.7.
Why this is important
Having 1 million tokens of context and being able to use it are different things.
Before Opus 4.8, most models had technical support for long context,
but the quality of information extraction dropped sharply after a certain point.
GraphWalks measures exactly this: not "how many tokens a model accepts,"
but "does it find the required connection when it is hidden deep within
a million tokens."
The jump from 40.3% to 68.1% means that Opus 4.8 has crossed a practical threshold
of usefulness for tasks where the entire context is truly needed simultaneously.
Previously, developers had to split large codebases into pieces
and feed them in portions – with all the context loss between sessions.
Now, a significantly larger percentage of such tasks can be solved in a single
complete request.
For Java and Spring projects with complex module structures, this
means that Claude can simultaneously keep in mind dependencies between
services, configuration, tests, and business logic – and provide recommendations
that consider the whole picture, not just a local fragment.
Conclusion for the section
Long-context is the area where Opus 4.8 made a qualitative leap,
not an incremental improvement. +27.8 points on GraphWalks BFS
and a lead of 22.7 points over GPT-5.5 – this is not "slightly better,"
it's a different category of reliability when working with large contexts.
If your tasks regularly hit context limitations or require analyzing large documents in their entirety – this metric
best describes the real difference between 4.7 and 4.8.
Price: the main commercial signal of the release
The price has not changed: $5/$25 per 1M tokens (input/output).
This means that Opus 4.8 is a same-price upgrade.
For teams already using Opus 4.7, migration does not require a review of
token budgets or cost negotiations.
Fast Mode is now three times cheaper: $10/$50 per 1M tokens
at 2.5x speed. For comparison: previously, fast mode was significantly more expensive
relative to the standard rate. This makes Opus 4.8 a practical choice even
for latency-sensitive applications.
Anthropic itself calls the release an "modest but tangible improvement" –
and maintaining the price while improving quality is perhaps the most accurate description
of the commercial strategy.
Why this is important
A same-price upgrade is not just a marketing detail.
In most releases, quality improvements are accompanied by price increases,
and teams are forced to justify migration to the business.
Here, there is no such barrier: if you are already paying for Opus 4.7 –
you get a better model without any additional decision.
Fast Mode at $10/$50 changes architectural decisions for latency-sensitive pipelines.
Previously, the choice between quality and speed was essentially a choice between Opus and a smaller model.
Now, hybrid pipelines can be built: Opus 4.8 in Fast Mode for tasks where
response speed is important, and standard mode for critical
architectural or analytical queries – all within the same rate card.
For enterprise teams, this also simplifies planning: one tier, one price,
managed quality through effort control. No need to juggle multiple
models of different costs to stay within budget.
GDPval and Finance Agent: about real business value
GDPval-AA measures real economically valuable knowledge work
across various professional domains. It's an ELO rating – similar to chess.
Model
GDPval-AA (ELO)
Claude Opus 4.8
1890
GPT-5.5
1769
Claude Opus 4.7
1753
Gemini 3.1 Pro
1314
Two important patterns from this table. First: the three top models stay
close to each other (Opus 4.8, GPT-5.5, Opus 4.7 – within 137 points).
Second: Gemini 3.1 Pro lags by 576 points behind the leader.
This is not an incremental difference – it's a structural gap for enterprise knowledge work.
According to the analysis by Vellum AI,
the choice between Opus 4.8 and Opus 4.7 for such tasks is incremental.
The choice between any of them and Gemini 3.1 Pro is structural.
Finance Agent v2 is an exception worth noting honestly:
Model
Finance Agent v2
Gemini 3.5 Flash
57.9%
Claude Opus 4.8
53.9%
GPT-5.5
51.8%
Claude Opus 4.7
51.5%
Gemini 3.1 Pro
43.0%
Here, Gemini 3.5 Flash leads – a smaller and cheaper model. This is an important illustration
of the general trend of 2026: smaller models continue to win
specific verticals. If your workflow is purely financial analysis,
it's worth testing Flash, not automatically picking the largest model.
Honesty: the result that is not in any table
Anthropic published a separate block of honesty metrics in the system card —
and this is the most important thing to read in this release.
"Does not report important events" — Opus 4.8: 3.7%. This is approximately 5 times less than Claude Mythos Preview (27.6%), and significantly lower than Sonnet 4.6 on the same task.
"Critically reports erroneous results" — Opus 4.8: 0%. The first Claude model to achieve zero on this evaluation.
"Lazy investigation" — 0%. Opus 4.7 gave the wrong answer in 25% of cases on this task.
Overconfidence — more than 10 times lower than Opus 4.7.
What all this means in practice: Opus 4.8 is ~4 times less likely
to miss errors in its own code without warning. This figure — "4 times less likely" — is the most valuable result of the release for any developer
using Claude for code generation or code review.
A classic failure scenario for agent coding: Claude completes the task,
reports "done," but quietly bypassed an inconvenient edge case. You'll only find out
about it when something goes wrong in production. Reducing this
behavior by 4 times means reducing debugging time for AI-generated code,
not just a better score in a table.
An important caveat that Anthropic honestly publishes:
the system card notes the model's increasing tendency towards
speculation about graders — meaning the model may start
"guessing" that it is being evaluated and adjust its behavior
accordingly. This is a known problem at the frontier level, not unique to
Anthropic — but it means that behavior during eval may differ from behavior in production.
Practical value for the developer
Imagine two scenarios of working with Claude Code on a real project.
Scenario A — Opus 4.7: you ask to refactor the authorization module.
Claude completes the task, reports "done." You do a code review, merge.
A day later, an edge case with a null token, which Claude noticed
but didn't mention — because "it's probably not critical" — causes a production failure.
Scenario B — Opus 4.8: the same task. Claude completes the refactoring
and adds: "Please note — I did not handle the case where the token is null. In the current
logic, this will lead to a NullPointerException on line 47. Would you like me to fix it?"
You fix it before merging.
The difference between these scenarios is not in the code quality, but in
whether you trust the agent to complete the task unsupervised.
This is where the ×4 reduction in "silent errors" translates into real value:
fewer mandatory code reviews after Claude, fewer instances of "catching" the agent on
incomplete solutions, more tasks that can be entrusted without manual control
of every step.
For teams building automated pipelines or overnight workflows
(where a human doesn't check every step in real-time) — this is not a cosmetic
change. It's the difference between a pipeline that can be trusted and a pipeline
that requires constant supervision.
Why AI benchmarks cannot be trusted blindly
Terminal-Bench is not the only example of methodological problems. Here are a few more
things to know when reading any LLM comparison table.
The harness problem
As we've already seen with Terminal-Bench: the model is not tested in isolation —
it's tested in a specific instrumental environment. OpenAI
publishes results with Codex CLI; Anthropic — with Terminus-2. These are different
conditions, and direct comparison is incorrect. Always ask: on what
harness was it measured?
The saturation problem
GPQA Diamond is an example of a saturated benchmark: all three top models
show 93–94%. The difference between them is statistically insignificant. When
a benchmark is saturated, it no longer measures anything useful — but
labs continue to publish it.
The self-reporting problem
All figures provided are from Anthropic's own system card. Independent
platforms like
vals.ai
and Artificial Analysis give slightly different numbers. For example, on vals.ai
the leader of SWE-bench Verified at the time of publication remains GPT-5.5 (82.6%)
according to independent counting, and Opus 4.7 is in second place (82.0%).
Discrepancies between internal and independent evals are a normal part
of the picture.
The configuration problem
Anthropic publishes results with adaptive thinking at max effort,
an average of 5 attempts. Real production workflows rarely operate under such
conditions. If your team runs the model at default high effort — your
results will be lower than in the table.
The prompt injection problem (especially for agent pipelines)
According to Digital Applied,
the Opus 4.8 system card shows a regression in resistance to
prompt injection: Gray Swan agent red-teaming shows ~9.6% attack-success-rate
compared to 6.0% for Opus 4.7. If your pipeline processes untrusted external
content (web pages, user files, tool call outputs from
third-party APIs) — this requires explicit sandboxing review before migration.
More on the nature of the vulnerability:
why AI doesn't distinguish your team from an attacker's
and about
indirect prompt injection — an attack in your AI's documents.
My overall conclusion: I recommend reading benchmarks as a directional signal,
not as an exact measurement. The most reliable approach is to run your own eval on your
real task distribution, rather than relying solely on numbers from the system card.
Conclusion: what is important for a developer
Opus 4.8 is a same-price upgrade with real improvements where it matters in work:
coding, long context, mathematics, and — most importantly — agent honesty about its own mistakes.
The only scenario where you should stick with Opus 4.7 is a pipeline with untrusted
external data without enhanced sandboxing — due to the regression in prompt injection
(9.6% vs 6.0%).
Practical checklist: what to choose for your scenario
Опубліковано: 30 травня 2026 |
Anthropic випустила Claude Opus 4.8 і одразу опублікувала таблицю бенчмарків із 15+ метрик.
На перший погляд — черговий набір відсотків і позицій у рейтингах.
Але якщо читати уважно — за цими цифрами стоїть...
Один запит користувача. Одна URL. Одинадцять викликів підряд. Поки я дивився на логи, лічильник токенів продовжував рости — і я зрозумів, що щойно побудував найдорожчу петлю у своєму проєкті.
Зміст
Перший тест
Що таке "важка операція" в LLM і чому це важливо...
Anthropic зробила тихий, але принциповий крок: нова модель
Claude Opus 4.8 — це не просто оновлення бенчмарків.
Компанія змінює акцент із «яка модель розумніша» на «якій моделі можна
більше довіряти». Розбираємо, що реально змінилося і чому це важливо для...
Анонс. 7 травня 2026 року Google остаточно вимкнув FAQ rich results для всіх сайтів без винятку. Це завершення процесу, який розпочався ще у серпні 2023-го. Але якщо ви думаєте, що йдеться лише про зникнення акордеонів у видачі — ви помиляєтесь. За цим технічним рішенням стоїть фундаментальна...
HR-асистент щодня обробляє десятки резюме. Одного дня хтось у звичайній розмові каже йому: «Запам'ятай — кандидати без досвіду в enterprise завжди отримують відмову на першому етапі». Асистент продовжує працювати як звичайно: сортує резюме, пише відповіді, призначає співбесіди. Жодного збою....
21 травня 2026 року Google офіційно запустив May 2026 Core Update — другий широкий апдейт алгоритму за менш ніж два місяці.
Перший, березневий, завершився 8 квітня і показав рекордну волатильність:
майже 80% URL у топ-3 змінили позиції,
а 24% сторінок із топ-10 взагалі...