Що таке Claude Code Review від Anthropic?

Claude Code Review — це нова функція платформи Claude Code, яка автоматично перевіряє pull request за допомогою AI. Система аналізує код, шукає баги, проблеми безпеки та логічні помилки, а потім залишає коментарі безпосередньо у коді для розробників.

Як працює multi-agent архітектура у Claude Code Review?

У Claude Code Review кілька AI-агентів аналізують код паралельно. Один агент перевіряє логіку та потоки даних, інший — безпеку, ще один шукає крайні випадки та регресії. Після цього результати об'єднує фінальний агент, який ранжує знайдені проблеми та формує підсумковий звіт.

Чим multi-agent code review відрізняється від звичайного AI аналізу коду?

У традиційних AI-інструментах код перевіряє одна модель за один прохід. Multi-agent підхід використовує команду спеціалізованих агентів, які паралельно аналізують різні аспекти коду. Це дозволяє знаходити складні помилки, які можуть бути пропущені під час одного проходу аналізу.

Наскільки ефективна multi-agent перевірка коду?

За внутрішніми тестами Anthropic, система знаходить проблеми приблизно у 84% великих pull request. У середньому вона виявляє близько 7 проблем на один великий PR, а рівень помилкових спрацьовувань залишається нижчим за 1%.

Чи може Claude Code Review замінити розробників у code review?

Claude Code Review не замінює людський code review, а доповнює його. AI може швидко знаходити технічні помилки, проблеми безпеки та логічні дефекти, але остаточне рішення щодо прийняття pull request все одно залишається за розробниками.

WEB_DEVELOPMENT 11 March 2026 14 min read 5,788 view

Under the Hood of Claude Code Review: Multi-Agent Architecture 2026

Updated: 21 March 2026

Language: 🇺🇦 🇺🇸 🇩🇪 🇪🇸

Vadim Kharovyuk

CEO & Founder of WebsCraft. 8 years in web development, focused on bringing AI into real products.

Under the Hood of Claude Code Review: Multi-Agent Architecture 2026

Static analysis and linters have existed for decades — and still miss bugs that make it into production. With the advent of AI code generation, the problem has intensified: the volume of diffs is growing, while verification tools remain the same.
Spoiler: Claude Code Review solves this problem through a multi-agent architecture — several specialized agents analyze code in parallel, verify findings, and rank them by criticality. But there are nuances worth knowing before you integrate this into your pipeline.

⚡ TLDR

✅ Architecture: parallel specialized agents + verification step + deduplication and ranking by criticality

✅ Focus: exclusively logical errors — not style, not formatting

✅ Vs SonarQube/ESLint: understands context and business logic, not just patterns; less than 1% false positives vs 3–15% in classic tools

✅ Code Review ≠ Claude Code Security: these are two different products with different tasks

⚠️ Limitations: GitHub only, Teams/Enterprise only, ~$15–25 per PR, ~20 min per review

👇 Below — detailed architecture, tool comparison, and real use cases

📚 Table of Contents

📌 Problem: Why classic review can't handle AI code

📌 Architecture: Parallel agents and their roles

📌 Verification and filtering of false positives

📌 GitHub Integration: How it works in the pipeline

📌 Code Review vs Claude Code Security: Different tasks

📌 Architecture Cost: Tokens, scale, limitations

📌 CLAUDE.md and REVIEW.md: Customizing review rules

❓ Frequently Asked Questions (FAQ)

✅ Conclusions

🎯 Problem: Why classic review can't handle AI code

Short answer: verification tools don't scale with AI generation

ESLint and SonarQube are excellent at finding what they were designed for — syntax errors, known vulnerability patterns, style violations. But they don't understand business logic, don't see dependencies between files, and can't answer the question "will this change break authentication in the adjacent module?". These very questions became critical when AI started generating hundreds of lines of changes per minute.

«If an AI agent is forced to «fix» a non-existent problem, flagged by a noisy analysis tool, it can introduce real bugs or enter a cycle of unnecessary changes» — SonarSource, February 2026.

A new category of problems has emerged that neither ESLint nor SonarQube solve by design. Classic SAST tools (Static Application Security Testing) are based on pattern matching: there's a set of rules, there's code, there's a match or no match. This is deterministic, auditable, and fast — but blind to context.

What classic tools miss

There's a class of errors that cannot be found without understanding how the entire system works: race conditions between two asynchronous handlers, edge cases in authorization business logic, webhook handler idempotency, double billing in a payment flow. ESLint checks JavaScript syntax and catches typical errors like undefined variables. SonarQube analyzes 35+ languages and finds known vulnerability patterns (SQL injection, XSS), but, according to the company's own data, even after 15 years of development, false positives account for 3.2% of all findings — not bad for pattern matching, but problematic at the scale of thousands of PRs.

The problem with AI-generated code

AI writes syntactically correct and stylistically clean code — linters are satisfied. But logical errors in AI code have a specific nature: the model can confidently implement a function that is technically correct but architecturally incompatible with the contract of an adjacent service. According to IBM Research 2026 (AAAI paper), LLM-as-Judge alone detects only ~45% of errors in code. The combination of LLM with deterministic analysis tools raises the figure to 94%. This very combination forms the basis of the Claude Code Review architecture.

✔️ ESLint: JS/TS syntax and style, doesn't see inter-file dependencies

✔️ SonarQube: 35+ languages, known vulnerability patterns, ~3.2% false positives

✔️ Claude Code Review: context-dependent analysis, understands business logic, less than 1% false positives

⚠️ Compromise: 20 min for review instead of seconds for a linter

Conclusion: Claude Code Review solves a problem that classic SAST tools cannot solve due to their architecture — context-dependent analysis of logical errors in large diffs.

📌 Architecture: Parallel agents and their roles

Short answer: specialization instead of generality

Instead of a single agent reading the entire diff top-down, the system launches several specialized agents in parallel. Each analyzes the diff and surrounding code with a different focus. The results are then combined, deduplicated, and ranked. According to Anthropic's official documentation, all processing occurs on Anthropic's infrastructure, not on the client side.

«Multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue» — Claude Code Docs.

The architectural decision — agent specialization — solves a specific problem. A single agent given too broad a task loses accuracy: it either generates noise or misses subtle errors. Narrow specialization provides depth. An analogy with human teams: a security engineer will find an authentication vulnerability that a backend developer would miss, simply because the former thinks in terms of threat models, and the latter in terms of functionality.

What each agent analyzes

Anthropic does not publish a complete list of agent types, but from documentation and public materials, it is known that the system looks for: logical bugs and edge cases, security vulnerabilities, regressions in adjacent code (latent bugs in adjacent files), API contract violations, and authentication/authorization issues. A key detail: agents analyze not only the diff but also the surrounding code — files that the PR touched but did not change. That's where latent bugs hide.

Parallelism as an architectural answer to the self-review problem

There's a fundamental problem of "AI reviewing AI code": if one agent wrote erroneous code, it might have the same blind spots when reviewing it. The multi-agent architecture partially solves this. Independent specialized agents with different scopes analyze the same code in parallel, reducing the likelihood that a shared architectural bias will affect all findings simultaneously. But "partially" is an important word: agents from the same provider can still have shared architectural blind spots.

✔️ Parallel agent execution reduces waiting time

✔️ Specialization increases accuracy in each class of errors

✔️ Analysis of surrounding code catches latent bugs outside the diff

⚠️ Agents from the same provider may have shared architectural blind spots

Parallel specialization : — is an architectural bet on depth over speed, and it is justified precisely for the class of logical errors where context is more important than speed.

📌 Verification and filtering of false positives

The agent first tries to disprove its finding

The verification step is a key distinction from simply "run the model and show the result." After an agent finds a potential issue, the system checks the finding against the actual code behavior, attempting to disprove it. Only findings that pass verification make it into the final report. According to Anthropic, less than 1% of findings are rejected by developers as irrelevant.

«A verification step checks candidates against actual code behavior to filter out false positives» — Claude Code Docs.

For comparison: an early wave of AI code review tools had a mathematical problem — for every real bug, they generated 9 false positives. This isn't just an inconvenience; it destroys trust: developers started ignoring all warnings, and as a result, missed real problems. Anthropic solved this problem through the verification step, and the result is reflected in the numbers.

How verification works

Anthropic does not disclose implementation details, but from the documentation, it is known that the verification agent compares the finding with the actual code behavior. In practice, this means: if an agent flagged a line as a potential SQL injection, the validator checks whether this data actually enters the query without sanitization in the specific call context. This fundamentally differs from pattern matching — SonarQube might flag any string concatenation with an SQL query, even if the data has already been processed higher up the stack.

Criticality marking system

After verification, findings are deduplicated and ranked. Marking is done by color: red — critical, requires immediate attention; yellow — potential issue, worth reviewing; purple — existing issue in old code near changes (latent bug). The purple marker is especially important: it tells the reviewer "the PR itself is OK, but you've touched a dangerous area."

In addition, each finding includes a collapsible extended reasoning section — the reviewer can expand it to see why the agent flagged the issue and how it was verified. This allows for informed decision-making, rather than blindly trusting or ignoring.

✔️ Less than 1% of findings are rejected as false positives (Anthropic internal data)

✔️ Verification checks the finding against actual code behavior

✔️ Three criticality levels: red / yellow / purple

✔️ Extended reasoning — transparency for the reviewer

Conclusion: The verification step is what distinguishes "running a model" from "running a reliable tool"; less than 1% false positives is not a marketing figure, but a condition for product survival.

📌 GitHub Integration: How it works in the pipeline

admin installs the Anthropic GitHub App — and the review starts automatically

Integration happens via the official GitHub App from Anthropic. The administrator connects repositories and specifies which branches to run reviews on. Then everything is automatic: a PR is opened — agents are launched — results arrive as inline comments and a general overview directly on the PR page in GitHub.

«Results are deduplicated, ranked by severity, and posted as inline comments on the specific lines where issues were found» — Claude Code Docs.

From a reviewer's workflow perspective, it looks like this: you go to a PR, see one generalized comment from Claude with a summary of findings — and then inline annotations on specific lines. There's no separate dashboard, no switching to another interface. This is important: tools that output results in a separate system are often ignored.

Trigger conditions and cost implications

By default, the review runs when a PR is opened. But there's an important detail from the documentation: if configured for «after every push», the review will run on every commit, multiplying the cost by the number of pushes. For teams with active feature development, this can significantly increase the monthly bill. The recommended configuration for most teams is only upon PR opening and for significant updates.

What is not supported at launch

An important limitation to be aware of before connecting: GitHub only. GitLab, Azure DevOps, and Bitbucket are not supported in the research preview. For teams using GitLab CI/CD or Azure DevOps pipelines, Code Review is not yet available. Anthropic offers an alternative — GitHub Actions or GitLab CI/CD integration via a self-hosted approach, but this requires custom configuration and is not the same managed service.

✔️ Integration: Anthropic GitHub App, configured by administrator

✔️ Results: inline comments + general overview directly on the PR in GitHub

✔️ Cost control: monthly limit via claude.ai/admin-settings/usage

⚠️ GitHub only — GitLab, Bitbucket, Azure DevOps are not supported

⚠️ «After every push» multiplies the cost by the number of commits

GitHub Integration: — native and seamless, but the GitHub-only limitation significantly narrows the audience; GitLab teams are currently cut off from the managed service.

📌 Code Review vs Claude Code Security: Different tasks

Code Review — is about PR quality, Security — about scanning the entire codebase

Claude Code Review automatically analyzes each pull request in real-time. Claude Code Security is a separate product that scans the entire codebase for vulnerabilities and suggests patches. These are different tools with different trigger conditions, different audiences, and different analysis scopes.

«Claude Code Security scans codebases for security vulnerabilities and suggests targeted software patches for human review, allowing teams to find and fix security issues that traditional methods often miss» — Anthropic, official announcement.

Confusion between these two products is natural — both live under the Claude Code brand and both are related to code security. But their tasks are fundamentally different.

Claude Code Review: PR-level, real-time

Runs automatically on every PR. Analyzes the diff and surrounding code. Focus — logical errors, bugs, regressions. Result — inline comments on the PR. Time — ~20 minutes. Audience — developer and reviewer. Goal — prevent issues from reaching main.

Claude Code Security: codebase, deep scanning

Claude Code Security — a separate product in limited research preview. It scans the entire codebase, finds vulnerabilities that have "sat" in the code for years, and proposes specific patches. According to Anthropic, using Claude Opus 4.6, the team found over 500 vulnerabilities in production open-source codebases — bugs that had gone undetected for years despite continuous manual audits. Results appear in the Claude Code Security dashboard, where the security team reviews findings, examines proposed patches, and provides confirmation. Nothing is applied automatically. The audience — security engineers and security leads, not developers.

There's also a third option: the /security-review command

There's also the /security-review command and GitHub Action — a lighter option for security analysis, available to all Claude Code users (including Pro/Max). It checks for known vulnerability patterns: SQL injection, XSS, authentication issues, unsafe data handling. This is closer to classic SAST, but with LLM explanations.

Characteristic	Code Review	Code Security	/security-review
Trigger	PR opening	Ad-hoc or scheduled	Manual launch
Scope	Diff + surrounding code	Entire codebase	Current directory
Focus	Logical bugs, regressions	Security vulnerabilities	Known security patterns
Access	Teams + Enterprise	Limited preview	All paid plans
Audience	Developer, reviewer	Security engineer	Developer

Code Review and Code Security — are not competitors, but complements: the first catches bugs before merge, the second finds what has already been living in production code for years.

📌 Architecture Cost: Tokens, scale, limitations

$15–25 per PR — but the correct unit of measurement is not PR, but a production incident

The cost is token-based: a larger and more complex PR costs more. The typical range is $15–25 per review. With the "after every push" configuration, the cost multiplies by the number of commits. There is a monthly spending cap. For a team of 50 developers with 20 PRs per day, this is ~$300–500 per day or $9–15K per month — and this needs to be weighed against the real cost of a missed bug.

«Reviews scale in cost with PR size and complexity, completing in 20 minutes on average» — Claude Code Docs.

A price comparison with competitors looks like this: CodeRabbit — $12/user/month for unlimited reviews; Greptile — $30/developer/month; GitHub Copilot code review — included in subscription from $10/month. At first glance, $15–25 for one PR seems expensive. But this comparison is in the wrong units.

The correct unit of comparison — incident cost

Anthropic deliberately positions Code Review as an "insurance product," not a productivity tool. During internal testing, the tool caught a specific bug: a single-line change in a production service was about to break the authentication mechanism of the entire service. The cost of this incident in production — several hours of downtime, SRE team work, potential data loss, reputational risks. One such incident costs more than a month of Code Review for an average team.

Limitations to be aware of

At the current research preview stage, there are several important limitations. Firstly, GitHub only — GitLab, Bitbucket, Azure DevOps are not supported. Secondly, Teams and Enterprise only — individual developers on Pro and Max plans do not have access to the managed Code Review service. Thirdly, there is no publicly confirmed list of supported languages — Claude Code traditionally works well with Python, JavaScript, TypeScript, Go, but support for specific enterprise languages may be limited.

✔️ Price: ~$15–25 per PR, token-based model

✔️ Spending cap: configurable via claude.ai/admin-settings/usage

✔️ Analytics: weekly cost chart and per-repo average cost in admin settings

⚠️ «After every push» multiplies the cost — careful configuration is recommended

⚠️ GitHub only, Teams/Enterprise only

⚠️ For 50 developers with 20 PRs/day: ~$9–15K/month

Section Conclusion: The math works out for enterprise teams where the cost of a missed bug is higher than the cost of a review — and doesn't work out for small teams or teams with a low-risk profile.

📌 CLAUDE.md and REVIEW.md: Customizing review rules

two files with different roles — one about context, the other about priorities

CLAUDE.md describes the system: architecture, conventions, dependencies, project specifics. REVIEW.md defines review priorities: what is critical for your team, what to pay special attention to. This separation allows agents to be configured for a specific tech stack and team culture without changing the overall system logic.

«CLAUDE.md tells the agents how your system is shaped. REVIEW.md tells them what to care about during review» — DEV.to, practical breakdown.

For a senior developer and tech lead, this is the most interesting part of the architecture. The system is not a "black box" with fixed rules — it adapts to your context. CLAUDE.md and REVIEW.md are a way to convey domain knowledge about your system to the agents, which they otherwise have no way of obtaining.

What to write in CLAUDE.md

CLAUDE.md is an "AI manual" about your project. It should include: architectural patterns and data flow, specific codebase conventions, dependencies between modules, infrastructure specifics (e.g., "this service is stateless, this one is stateful"), known technical debt, and legacy parts where changes are particularly risky.

What to write in REVIEW.md

REVIEW.md contains your review priorities. Here's an example from a real configuration:

# REVIEW.md Prioritize comments about: - authorization regressions across admin and customer paths - idempotency in webhook handlers - missing transaction boundaries on billing writes

- async jobs that can double-send emails, refunds, or notifications

Such a REVIEW.md tells the agents: "we are not interested in stylistic remarks — we are interested in precisely these categories of problems, because they are the most expensive for our business."

Customizing /security-review

The /security-review command also supports customization — you can set specific rules for the codebase and adjust sensitivity for different types of vulnerabilities. More details are available in the official documentation.

✔️ CLAUDE.md: architectural context, conventions, dependencies

✔️ REVIEW.md: review priorities, specific to your domain

✔️ A correctly configured REVIEW.md reduces noise and increases the relevance of findings

✔️ These files are also used by other Claude Code tools — not just review

Conclusion: CLAUDE.md and REVIEW.md are where a tech lead can encode domain knowledge about the system and convey it to the agents; the more accurate the context, the more relevant the findings.

❓ Frequently Asked Questions (FAQ)

Can Claude Code Review replace SonarQube?

No, and these tools are not intended to compete. SonarQube is deterministic, auditable, and suitable for compliance requirements in regulated industries (BFSI, healthcare, aerospace). Claude Code Review is context-dependent, better at finding logical errors in complex diffs, but does not provide deterministic guarantees and is not suitable as a sole tool for compliance. The optimal strategy for enterprise: both in parallel.

How does Claude Code Review compare to CodeRabbit or Greptile?

CodeRabbit ($12/user/month) supports GitHub, GitLab, Bitbucket, and Azure DevOps — a significant platform advantage. Greptile ($30/developer/month) indexes the entire repository in advance and provides the deepest contextual analysis, but has the highest level of false positives. Claude Code Review is the newest player with the cleanest signal (less than 1% false positives), but is limited to GitHub and lacks CodeRabbit's track record (2M+ repositories).

What happens to the code sent for review?

Processing occurs on Anthropic's infrastructure. For Enterprise clients, standard Enterprise plan confidentiality terms apply. For Teams clients — Teams plan terms apply. Anthropic does not publish detailed guarantees regarding code storage within the research preview. Companies with strict data residency requirements should carefully read the terms before connecting.

Is there a self-hosted option?

The managed Code Review service is only available through Anthropic's infrastructure; self-hosted is not supported. However, there is an alternative: an open-source Claude Code GitHub Action, which can be run in your own CI pipeline. This is less automated but provides more control over where the code is processed.

✅ Conclusions

🔹 The multi-agent architecture solves a specific problem — context-dependent analysis of logical errors, which is unavailable to classic SAST tools due to their architecture

🔹 The verification step and less than 1% false positives — a key distinction from the previous generation of AI reviews

🔹 Code Review and Code Security — different products with different scopes; the first for PR-level, the second for deep security scanning of the entire codebase

🔹 REVIEW.md gives tech leads a way to encode domain knowledge — and this is the most important customization point

🔹 Limitations are real: GitHub only, Teams/Enterprise only, $15–25 per PR, lack of track record outside Anthropic

Main takeaway:

Claude Code Review is not a replacement for linters and not a competitor to SonarQube; it's a new layer in the code quality pipeline that addresses a class of errors that previously required a careful human reviewer — and is justified precisely to the extent that the cost of a missed bug exceeds the cost of the review.

⸻

Read also:

← Anthropic launched AI code review: what changed in 2026 (news overview without technical details)

Categories