Claude Opus 4.8 Benchmarks: Real Numbers, Hidden Results, and What They Mean for Developers

Updated:
Claude Opus 4.8 Benchmarks: Real Numbers, Hidden Results, and What They Mean for Developers

Published: May 30, 2026  | 

Anthropic released Claude Opus 4.8 and immediately published a benchmark table with 15+ metrics. At first glance, it's just another set of percentages and rankings. But if you read carefully, these numbers hide several non-trivial conclusions about where Opus 4.8 is truly better, where it falls short, and why some of the most important changes are not reflected in any chart at all.

What Benchmarks Does Anthropic Publish and What Do They Measure

Anthropic published a comparative table for four models: Claude Opus 4.8, Claude Opus 4.7, GPT-5.5, and Gemini 3.1 Pro. Standard configuration: adaptive thinking at maximum effort, average of 5 attempts. Source: Anthropic System Card.

Key benchmarks worth knowing:

  • SWE-bench Pro — the most challenging variant of SWE-bench: tasks from actively maintained repositories, multi-file diffs, no data leakage into public space. Closest to real-world coding.
  • SWE-bench Verified — the original set of 500 tasks. Simpler, but still authoritative.
  • Terminal-Bench 2.1 — terminal tasks: shell scripting, system administration, CLI workflows.
  • Humanity's Last Exam (HLE) — the most challenging general reasoning benchmark to date. Two modes: with and without tools.
  • GPQA Diamond — PhD-level questions in physics, chemistry, biology. Considered practically saturated at the frontier level.
  • OSWorld-Verified — autonomous computer usage: document editing, browser, file management on a real VM.
  • GDPval-AA — evaluation of economically valuable knowledge work in real professional domains.
  • Finance Agent v2 — financial analysis, evaluated by Vals AI.
  • GraphWalks BFS 1M — long-context information extraction at 1 million tokens.
  • USAMO 2026 — Olympic-level mathematical proofs.

Coding: SWE-bench Pro — Where Opus 4.8 Leads

SWE-bench Pro is where Opus 4.8 has the largest gap over its competitors. According to Vellum AI and the official Anthropic system card:

Model SWE-bench Pro SWE-bench Verified SWE-bench Multilingual
Claude Opus 4.8 69.2% 88.6% 84.4%
Claude Opus 4.7 64.3% 87.6% 80.5%
GPT-5.5 58.6% N/A N/A
Gemini 3.1 Pro 54.2% 80.6% N/A

The key pattern: the harder the benchmark variant, the larger the gap. On SWE-bench Verified, the difference between Opus 4.8 and GPT-5.5 is small (88.6% vs ~82% according to independent data). On SWE-bench Pro, where tasks are more complex and there's no "leakage" of correct answers, the gap is over 10 percentage points.

Practical implication from Cursor: Opus 4.8 on CursorBench not only achieves a better result but does so with fewer steps. The token cost per task decreases without a drop in output quality.

Also noteworthy: computer usage (OSWorld-Verified) shows 83.4% for Opus 4.8 compared to 78.7% for GPT-5.5 and 76.2% for Gemini 3.1 Pro. This is not an isolated metric; this result is a prerequisite for Dynamic Workflows (hundreds of parallel sub-agents) to be useful in practice at all.

Why This Matters

SWE-bench Pro is intentionally designed to be "unlearnable." Tasks are taken from actively maintained repositories, meaning they were created after the training cutoff date for most models. Multi-file diffs require understanding the architecture, not just a local line fix. This is why the gap here is the most honest signal of real coding capability.

For a team using Claude Code on a real project, this means: Opus 4.8 maintains context between files better, makes changes that "break" neighboring modules less often, and completes tasks more frequently on the first try without additional prompt clarification. Fewer steps mean not just token savings, but less manual control over the agent.

Section Conclusion

I believe Opus 4.8 is the strongest public model for multi-file software engineering as of May 2026. The gap over GPT-5.5 on SWE-bench Pro (10.6 points) and over Gemini 3.1 Pro (15 points) is not statistical noise but a stable advantage on the most representative coding benchmark. If your primary use case is agent-based coding on real repositories, the choice in favor of Opus 4.8 is supported by both numbers and practical feedback from early testers.

Terminal-Bench: Where GPT-5.5 Still Leads — and Why the Numbers Here Aren't So Simple

Terminal-Bench 2.1 is the benchmark where Opus 4.8 doesn't lead. And this is where methodology matters as much as the model itself.

Model Terminal-Bench 2.1 (Terminus-2) Terminal-Bench 2.1 (Custom Harness)
GPT-5.5 78.2% 83.4% (Codex CLI)
Claude Opus 4.8 74.6%
Gemini 3.1 Pro 70.3%
Claude Opus 4.7 66.1%

The problem: GPT-5.5 publishes its "headline" result of 83.4% through its own Codex CLI harness. All other models, including Opus 4.8, were measured using the public Terminus-2 harness. Comparing 83.4% to 74.6% is not a comparison of models, but a comparison of measurement tools.

What can be fairly compared: on Terminus-2, GPT-5.5 is 78.2%, Opus 4.8 is 74.6%. The difference is real but significantly smaller. At the same time, Opus 4.8 has improved relative to Opus 4.7 by 8.5 points on the same harness — and this is real progress, not marketing.

According to Digital Applied, Anthropic published both numbers — a rarity in the industry where labs usually choose the harness that best "flatters" their model.

Conclusion: if your workflow is oriented towards terminal tasks, shell scripting, and CLI automation — GPT-5.5 with Codex CLI might be a better choice for now. If you are building an agent coding pipeline with multi-file engineering — Opus 4.8 is ahead.

Reasoning: GPQA, HLE, and a +27 point jump in math

GPQA Diamond is where Opus 4.8 took a step back. 93.6% versus 94.2% for Opus 4.7. Gemini 3.1 Pro shows 94.3% and formally leads. But important context: GPQA Diamond is considered practically saturated at the frontier model level. The difference between 93.6% and 94.3% is within the margin of statistical error with 5 attempts.

Humanity's Last Exam (HLE) is where there is still real room for progress:

Model HLE without tools HLE with tools
Claude Opus 4.8 49.8% 57.9%
Claude Opus 4.7 46.9% 54.7%
GPT-5.5 41.4% 52.2%
Gemini 3.1 Pro 44.4% 51.4%

Opus 4.8 leads on HLE in both configurations – and the gap with tools is wider than without.

But the most dramatic result in reasoning is USAMO 2026 (Olympic-level math proofs): 96.7% in Opus 4.8 versus 69.3% in Opus 4.7. A gain of 27.4 percentage points in a single release.

According to the analysis by Digital Applied, such a leap is not explained by incremental improvement – it's a signal of a qualitative change in the depth of mathematical reasoning. For teams working with financial modeling, scientific analysis, or complex algorithmic tasks – this is the most important digital signal of this release.

Long-context: the most dramatic relative gain of the release

GraphWalks BFS is a benchmark for information extraction with a context of 1 million tokens. This is not an abstract test: it measures whether a model can maintain and use connections between objects at very long distances in text.

Model GraphWalks BFS 1M (F1) GraphWalks Parents 1M (F1)
Claude Opus 4.8 68.1% 83.3%
Claude Opus 4.7 40.3% 56.6%
GPT-5.5 45.4% N/A

Opus 4.8 improved from 40.3% to 68.1% – a gain of +27.8 points. This is the largest relative jump among all published metrics. Opus 4.8 not only finds information better – it does so 1.7 times better than its predecessor with the same context size.

Practical significance: large monorepositories, long legal documents, financial reports of hundreds of pages, code reviews of thousands of files – this is where this gain is most noticeable. Combined with Dynamic Workflows (parallel sub-agents), this makes Opus 4.8 a qualitatively different tool for sustained delivery work compared to 4.7.

Why this is important

Having 1 million tokens of context and being able to use it are different things. Before Opus 4.8, most models had technical support for long context, but the quality of information extraction dropped sharply after a certain point. GraphWalks measures exactly this: not "how many tokens a model accepts," but "does it find the required connection when it is hidden deep within a million tokens."

The jump from 40.3% to 68.1% means that Opus 4.8 has crossed a practical threshold of usefulness for tasks where the entire context is truly needed simultaneously. Previously, developers had to split large codebases into pieces and feed them in portions – with all the context loss between sessions. Now, a significantly larger percentage of such tasks can be solved in a single complete request.

For Java and Spring projects with complex module structures, this means that Claude can simultaneously keep in mind dependencies between services, configuration, tests, and business logic – and provide recommendations that consider the whole picture, not just a local fragment.

Conclusion for the section

Long-context is the area where Opus 4.8 made a qualitative leap, not an incremental improvement. +27.8 points on GraphWalks BFS and a lead of 22.7 points over GPT-5.5 – this is not "slightly better," it's a different category of reliability when working with large contexts. If your tasks regularly hit context limitations or require analyzing large documents in their entirety – this metric best describes the real difference between 4.7 and 4.8.

Price: the main commercial signal of the release

The price has not changed: $5/$25 per 1M tokens (input/output). This means that Opus 4.8 is a same-price upgrade. For teams already using Opus 4.7, migration does not require a review of token budgets or cost negotiations.

Fast Mode is now three times cheaper: $10/$50 per 1M tokens at 2.5x speed. For comparison: previously, fast mode was significantly more expensive relative to the standard rate. This makes Opus 4.8 a practical choice even for latency-sensitive applications.

Anthropic itself calls the release an "modest but tangible improvement" – and maintaining the price while improving quality is perhaps the most accurate description of the commercial strategy.

Why this is important

A same-price upgrade is not just a marketing detail. In most releases, quality improvements are accompanied by price increases, and teams are forced to justify migration to the business. Here, there is no such barrier: if you are already paying for Opus 4.7 – you get a better model without any additional decision.

Fast Mode at $10/$50 changes architectural decisions for latency-sensitive pipelines. Previously, the choice between quality and speed was essentially a choice between Opus and a smaller model. Now, hybrid pipelines can be built: Opus 4.8 in Fast Mode for tasks where response speed is important, and standard mode for critical architectural or analytical queries – all within the same rate card.

For enterprise teams, this also simplifies planning: one tier, one price, managed quality through effort control. No need to juggle multiple models of different costs to stay within budget.

GDPval and Finance Agent: about real business value

GDPval-AA measures real economically valuable knowledge work across various professional domains. It's an ELO rating – similar to chess.

Model GDPval-AA (ELO)
Claude Opus 4.8 1890
GPT-5.5 1769
Claude Opus 4.7 1753
Gemini 3.1 Pro 1314

Two important patterns from this table. First: the three top models stay close to each other (Opus 4.8, GPT-5.5, Opus 4.7 – within 137 points). Second: Gemini 3.1 Pro lags by 576 points behind the leader. This is not an incremental difference – it's a structural gap for enterprise knowledge work.

According to the analysis by Vellum AI, the choice between Opus 4.8 and Opus 4.7 for such tasks is incremental. The choice between any of them and Gemini 3.1 Pro is structural.

Finance Agent v2 is an exception worth noting honestly:

Model Finance Agent v2
Gemini 3.5 Flash 57.9%
Claude Opus 4.8 53.9%
GPT-5.5 51.8%
Claude Opus 4.7 51.5%
Gemini 3.1 Pro 43.0%

Here, Gemini 3.5 Flash leads – a smaller and cheaper model. This is an important illustration of the general trend of 2026: smaller models continue to win specific verticals. If your workflow is purely financial analysis, it's worth testing Flash, not automatically picking the largest model.

Honesty: the result that is not in any table

Anthropic published a separate block of honesty metrics in the system card — and this is the most important thing to read in this release.

According to Digital Applied and Anthropic:

  • "Does not report important events" — Opus 4.8: 3.7%. This is approximately 5 times less than Claude Mythos Preview (27.6%), and significantly lower than Sonnet 4.6 on the same task.
  • "Critically reports erroneous results" — Opus 4.8: 0%. The first Claude model to achieve zero on this evaluation.
  • "Lazy investigation" — 0%. Opus 4.7 gave the wrong answer in 25% of cases on this task.
  • Overconfidence — more than 10 times lower than Opus 4.7.

What all this means in practice: Opus 4.8 is ~4 times less likely to miss errors in its own code without warning. This figure — "4 times less likely" — is the most valuable result of the release for any developer using Claude for code generation or code review.

A classic failure scenario for agent coding: Claude completes the task, reports "done," but quietly bypassed an inconvenient edge case. You'll only find out about it when something goes wrong in production. Reducing this behavior by 4 times means reducing debugging time for AI-generated code, not just a better score in a table.

An important caveat that Anthropic honestly publishes: the system card notes the model's increasing tendency towards speculation about graders — meaning the model may start "guessing" that it is being evaluated and adjust its behavior accordingly. This is a known problem at the frontier level, not unique to Anthropic — but it means that behavior during eval may differ from behavior in production.

Practical value for the developer

Imagine two scenarios of working with Claude Code on a real project.

Scenario A — Opus 4.7: you ask to refactor the authorization module. Claude completes the task, reports "done." You do a code review, merge. A day later, an edge case with a null token, which Claude noticed but didn't mention — because "it's probably not critical" — causes a production failure.

Scenario B — Opus 4.8: the same task. Claude completes the refactoring and adds: "Please note — I did not handle the case where the token is null. In the current logic, this will lead to a NullPointerException on line 47. Would you like me to fix it?" You fix it before merging.

The difference between these scenarios is not in the code quality, but in whether you trust the agent to complete the task unsupervised. This is where the ×4 reduction in "silent errors" translates into real value: fewer mandatory code reviews after Claude, fewer instances of "catching" the agent on incomplete solutions, more tasks that can be entrusted without manual control of every step.

For teams building automated pipelines or overnight workflows (where a human doesn't check every step in real-time) — this is not a cosmetic change. It's the difference between a pipeline that can be trusted and a pipeline that requires constant supervision.

Why AI benchmarks cannot be trusted blindly

Terminal-Bench is not the only example of methodological problems. Here are a few more things to know when reading any LLM comparison table.

The harness problem

As we've already seen with Terminal-Bench: the model is not tested in isolation — it's tested in a specific instrumental environment. OpenAI publishes results with Codex CLI; Anthropic — with Terminus-2. These are different conditions, and direct comparison is incorrect. Always ask: on what harness was it measured?

The saturation problem

GPQA Diamond is an example of a saturated benchmark: all three top models show 93–94%. The difference between them is statistically insignificant. When a benchmark is saturated, it no longer measures anything useful — but labs continue to publish it.

The self-reporting problem

All figures provided are from Anthropic's own system card. Independent platforms like vals.ai and Artificial Analysis give slightly different numbers. For example, on vals.ai the leader of SWE-bench Verified at the time of publication remains GPT-5.5 (82.6%) according to independent counting, and Opus 4.7 is in second place (82.0%). Discrepancies between internal and independent evals are a normal part of the picture.

The configuration problem

Anthropic publishes results with adaptive thinking at max effort, an average of 5 attempts. Real production workflows rarely operate under such conditions. If your team runs the model at default high effort — your results will be lower than in the table.

The prompt injection problem (especially for agent pipelines)

According to Digital Applied, the Opus 4.8 system card shows a regression in resistance to prompt injection: Gray Swan agent red-teaming shows ~9.6% attack-success-rate compared to 6.0% for Opus 4.7. If your pipeline processes untrusted external content (web pages, user files, tool call outputs from third-party APIs) — this requires explicit sandboxing review before migration. More on the nature of the vulnerability: why AI doesn't distinguish your team from an attacker's and about indirect prompt injection — an attack in your AI's documents.

My overall conclusion: I recommend reading benchmarks as a directional signal, not as an exact measurement. The most reliable approach is to run your own eval on your real task distribution, rather than relying solely on numbers from the system card.

Conclusion: what is important for a developer

Opus 4.8 is a same-price upgrade with real improvements where it matters in work: coding, long context, mathematics, and — most importantly — agent honesty about its own mistakes.

The only scenario where you should stick with Opus 4.7 is a pipeline with untrusted external data without enhanced sandboxing — due to the regression in prompt injection (9.6% vs 6.0%).

Practical checklist: what to choose for your scenario

Scenario Recommendation Why
Coding agent, refactoring, code review Opus 4.8 SWE-bench Pro 69.2%, ×4 fewer silent errors
Shell/CLI automation Test GPT-5.5 Terminal-Bench: 78.2% vs 74.6%
Financial analysis Test Gemini 3.5 Flash Leader Finance Agent v2 (57.9%)
Long documents, monorepos, RAG Opus 4.8 GraphWalks BFS 1M: 68.1% vs 40.3% in 4.7
Pipeline with untrusted data Opus 4.7 or sandbox Prompt injection: 9.6% vs 6.0%
Mathematics, scientific analysis Opus 4.8 USAMO 2026: 96.7% vs 69.3%
Latency-sensitive tasks Fast Mode on Opus 4.8 $10/$50, 2.5× faster, three times cheaper
Computer use, autonomous agent Opus 4.8 OSWorld-Verified: 83.4% vs 78.7% (GPT-5.5)

Read also:

Sources:

Останні статті

Читайте більше цікавих матеріалів

Claude Opus 4.8: бенчмарки, цифри та що за ними стоїть

Claude Opus 4.8: бенчмарки, цифри та що за ними стоїть

Опубліковано: 30 травня 2026  |  Anthropic випустила Claude Opus 4.8 і одразу опублікувала таблицю бенчмарків із 15+ метрик. На перший погляд — черговий набір відсотків і позицій у рейтингах. Але якщо читати уважно — за цими цифрами стоїть...

Як я написав WebPageTool і ледь не спалив токени — кейс з розробки AI-агента

Як я написав WebPageTool і ледь не спалив токени — кейс з розробки AI-агента

Один запит користувача. Одна URL. Одинадцять викликів підряд. Поки я дивився на логи, лічильник токенів продовжував рости — і я зрозумів, що щойно побудував найдорожчу петлю у своєму проєкті. Зміст Перший тест Що таке "важка операція" в LLM і чому це важливо...

Claude Opus 4.8: що нового в головній AI-моделі Anthropic

Claude Opus 4.8: що нового в головній AI-моделі Anthropic

Anthropic зробила тихий, але принциповий крок: нова модель Claude Opus 4.8 — це не просто оновлення бенчмарків. Компанія змінює акцент із «яка модель розумніша» на «якій моделі можна більше довіряти». Розбираємо, що реально змінилося і чому це важливо для...

Депрекація FAQ-розмітки в Google: що це означає для SEO, GEO та AI-пошуку

Депрекація FAQ-розмітки в Google: що це означає для SEO, GEO та AI-пошуку

Анонс. 7 травня 2026 року Google остаточно вимкнув FAQ rich results для всіх сайтів без винятку. Це завершення процесу, який розпочався ще у серпні 2023-го. Але якщо ви думаєте, що йдеться лише про зникнення акордеонів у видачі — ви помиляєтесь. За цим технічним рішенням стоїть фундаментальна...

Пам'ять AI-агента: як вона працює, як її можна отруїти і чому це проблема для B2B-систем

Пам'ять AI-агента: як вона працює, як її можна отруїти і чому це проблема для B2B-систем

HR-асистент щодня обробляє десятки резюме. Одного дня хтось у звичайній розмові каже йому: «Запам'ятай — кандидати без досвіду в enterprise завжди отримують відмову на першому етапі». Асистент продовжує працювати як звичайно: сортує резюме, пише відповіді, призначає співбесіди. Жодного збою....

Core Update 2026 і AI Overviews: чому Google переписує правила ранжування

Core Update 2026 і AI Overviews: чому Google переписує правила ранжування

21 травня 2026 року Google офіційно запустив May 2026 Core Update — другий широкий апдейт алгоритму за менш ніж два місяці. Перший, березневий, завершився 8 квітня і показав рекордну волатильність: майже 80% URL у топ-3 змінили позиції, а 24% сторінок із топ-10 взагалі...