Static analysis and linters have existed for decades — and still miss bugs that make it into production. With the advent of AI code generation, the problem has intensified: the volume of diffs is growing, while verification tools remain the same.
Spoiler: Claude Code Review solves this problem through a multi-agent architecture — several specialized agents analyze code in parallel, verify findings, and rank them by criticality. But there are nuances worth knowing before you integrate this into your pipeline.
⚡ TLDR
- ✅ Architecture: parallel specialized agents + verification step + deduplication and ranking by criticality
- ✅ Focus: exclusively logical errors — not style, not formatting
- ✅ Vs SonarQube/ESLint: understands context and business logic, not just patterns; less than 1% false positives vs 3–15% in classic tools
- ✅ Code Review ≠ Claude Code Security: these are two different products with different tasks
- ⚠️ Limitations: GitHub only, Teams/Enterprise only, ~$15–25 per PR, ~20 min per review
- 👇 Below — detailed architecture, tool comparison, and real use cases
📚 Table of Contents
🎯 Problem: Why classic review can't handle AI code
Short answer: verification tools don't scale with AI generation
ESLint and SonarQube are excellent at finding what they were designed for — syntax errors, known vulnerability patterns, style violations. But they don't understand business logic, don't see dependencies between files, and can't answer the question "will this change break authentication in the adjacent module?". These very questions became critical when AI started generating hundreds of lines of changes per minute.
«If an AI agent is forced to «fix» a non-existent problem, flagged by a noisy analysis tool, it can introduce real bugs or enter a cycle of unnecessary changes» — SonarSource, February 2026.
A new category of problems has emerged that neither ESLint nor SonarQube solve by design. Classic SAST tools (Static Application Security Testing) are based on pattern matching: there's a set of rules, there's code, there's a match or no match. This is deterministic, auditable, and fast — but blind to context.
What classic tools miss
There's a class of errors that cannot be found without understanding how the entire system works: race conditions between two asynchronous handlers, edge cases in authorization business logic, webhook handler idempotency, double billing in a payment flow. ESLint checks JavaScript syntax and catches typical errors like undefined variables. SonarQube analyzes 35+ languages and finds known vulnerability patterns (SQL injection, XSS), but, according to the company's own data, even after 15 years of development, false positives account for 3.2% of all findings — not bad for pattern matching, but problematic at the scale of thousands of PRs.
The problem with AI-generated code
AI writes syntactically correct and stylistically clean code — linters are satisfied. But logical errors in AI code have a specific nature: the model can confidently implement a function that is technically correct but architecturally incompatible with the contract of an adjacent service. According to IBM Research 2026 (AAAI paper), LLM-as-Judge alone detects only ~45% of errors in code. The combination of LLM with deterministic analysis tools raises the figure to 94%. This very combination forms the basis of the Claude Code Review architecture.
- ✔️ ESLint: JS/TS syntax and style, doesn't see inter-file dependencies
- ✔️ SonarQube: 35+ languages, known vulnerability patterns, ~3.2% false positives
- ✔️ Claude Code Review: context-dependent analysis, understands business logic, less than 1% false positives
- ⚠️ Compromise: 20 min for review instead of seconds for a linter
Conclusion: Claude Code Review solves a problem that classic SAST tools cannot solve due to their architecture — context-dependent analysis of logical errors in large diffs.
📌 Architecture: Parallel agents and their roles
Short answer: specialization instead of generality
Instead of a single agent reading the entire diff top-down, the system launches several specialized agents in parallel. Each analyzes the diff and surrounding code with a different focus. The results are then combined, deduplicated, and ranked. According to Anthropic's official documentation, all processing occurs on Anthropic's infrastructure, not on the client side.
«Multiple agents analyze the diff and surrounding code in parallel on Anthropic infrastructure. Each agent looks for a different class of issue» — Claude Code Docs.
The architectural decision — agent specialization — solves a specific problem. A single agent given too broad a task loses accuracy: it either generates noise or misses subtle errors. Narrow specialization provides depth. An analogy with human teams: a security engineer will find an authentication vulnerability that a backend developer would miss, simply because the former thinks in terms of threat models, and the latter in terms of functionality.
What each agent analyzes
Anthropic does not publish a complete list of agent types, but from documentation and public materials, it is known that the system looks for: logical bugs and edge cases, security vulnerabilities, regressions in adjacent code (latent bugs in adjacent files), API contract violations, and authentication/authorization issues. A key detail: agents analyze not only the diff but also the surrounding code — files that the PR touched but did not change. That's where latent bugs hide.
Parallelism as an architectural answer to the self-review problem
There's a fundamental problem of "AI reviewing AI code": if one agent wrote erroneous code, it might have the same blind spots when reviewing it. The multi-agent architecture partially solves this. Independent specialized agents with different scopes analyze the same code in parallel, reducing the likelihood that a shared architectural bias will affect all findings simultaneously. But "partially" is an important word: agents from the same provider can still have shared architectural blind spots.
- ✔️ Parallel agent execution reduces waiting time
- ✔️ Specialization increases accuracy in each class of errors
- ✔️ Analysis of surrounding code catches latent bugs outside the diff
- ⚠️ Agents from the same provider may have shared architectural blind spots
Parallel specialization : — is an architectural bet on depth over speed, and it is justified precisely for the class of logical errors where context is more important than speed.
📌 Verification and filtering of false positives
The agent first tries to disprove its finding
The verification step is a key distinction from simply "run the model and show the result." After an agent finds a potential issue, the system checks the finding against the actual code behavior, attempting to disprove it. Only findings that pass verification make it into the final report. According to Anthropic, less than 1% of findings are rejected by developers as irrelevant.
«A verification step checks candidates against actual code behavior to filter out false positives» — Claude Code Docs.
For comparison: an early wave of AI code review tools had a mathematical problem — for every real bug, they generated 9 false positives. This isn't just an inconvenience; it destroys trust: developers started ignoring all warnings, and as a result, missed real problems. Anthropic solved this problem through the verification step, and the result is reflected in the numbers.
How verification works
Anthropic does not disclose implementation details, but from the documentation, it is known that the verification agent compares the finding with the actual code behavior. In practice, this means: if an agent flagged a line as a potential SQL injection, the validator checks whether this data actually enters the query without sanitization in the specific call context. This fundamentally differs from pattern matching — SonarQube might flag any string concatenation with an SQL query, even if the data has already been processed higher up the stack.
Criticality marking system
After verification, findings are deduplicated and ranked. Marking is done by color: red — critical, requires immediate attention; yellow — potential issue, worth reviewing; purple — existing issue in old code near changes (latent bug). The purple marker is especially important: it tells the reviewer "the PR itself is OK, but you've touched a dangerous area."
In addition, each finding includes a collapsible extended reasoning section — the reviewer can expand it to see why the agent flagged the issue and how it was verified. This allows for informed decision-making, rather than blindly trusting or ignoring.
- ✔️ Less than 1% of findings are rejected as false positives (Anthropic internal data)
- ✔️ Verification checks the finding against actual code behavior
- ✔️ Three criticality levels: red / yellow / purple
- ✔️ Extended reasoning — transparency for the reviewer
Conclusion: The verification step is what distinguishes "running a model" from "running a reliable tool"; less than 1% false positives is not a marketing figure, but a condition for product survival.
📌 GitHub Integration: How it works in the pipeline
admin installs the Anthropic GitHub App — and the review starts automatically
Integration happens via the official GitHub App from Anthropic. The administrator connects repositories and specifies which branches to run reviews on. Then everything is automatic: a PR is opened — agents are launched — results arrive as inline comments and a general overview directly on the PR page in GitHub.
«Results are deduplicated, ranked by severity, and posted as inline comments on the specific lines where issues were found» — Claude Code Docs.
From a reviewer's workflow perspective, it looks like this: you go to a PR, see one generalized comment from Claude with a summary of findings — and then inline annotations on specific lines. There's no separate dashboard, no switching to another interface. This is important: tools that output results in a separate system are often ignored.
Trigger conditions and cost implications
By default, the review runs when a PR is opened. But there's an important detail from the documentation: if configured for «after every push», the review will run on every commit, multiplying the cost by the number of pushes. For teams with active feature development, this can significantly increase the monthly bill. The recommended configuration for most teams is only upon PR opening and for significant updates.
What is not supported at launch
An important limitation to be aware of before connecting: GitHub only. GitLab, Azure DevOps, and Bitbucket are not supported in the research preview. For teams using GitLab CI/CD or Azure DevOps pipelines, Code Review is not yet available. Anthropic offers an alternative — GitHub Actions or GitLab CI/CD integration via a self-hosted approach, but this requires custom configuration and is not the same managed service.
- ✔️ Integration: Anthropic GitHub App, configured by administrator
- ✔️ Results: inline comments + general overview directly on the PR in GitHub
- ✔️ Cost control: monthly limit via claude.ai/admin-settings/usage
- ⚠️ GitHub only — GitLab, Bitbucket, Azure DevOps are not supported
- ⚠️ «After every push» multiplies the cost by the number of commits
GitHub Integration: — native and seamless, but the GitHub-only limitation significantly narrows the audience; GitLab teams are currently cut off from the managed service.
📌 Code Review vs Claude Code Security: Different tasks
Code Review — is about PR quality, Security — about scanning the entire codebase
Claude Code Review automatically analyzes each pull request in real-time. Claude Code Security is a separate product that scans the entire codebase for vulnerabilities and suggests patches. These are different tools with different trigger conditions, different audiences, and different analysis scopes.
«Claude Code Security scans codebases for security vulnerabilities and suggests targeted software patches for human review, allowing teams to find and fix security issues that traditional methods often miss» — Anthropic, official announcement.
Confusion between these two products is natural — both live under the Claude Code brand and both are related to code security. But their tasks are fundamentally different.
Claude Code Review: PR-level, real-time
Runs automatically on every PR. Analyzes the diff and surrounding code. Focus — logical errors, bugs, regressions. Result — inline comments on the PR. Time — ~20 minutes. Audience — developer and reviewer. Goal — prevent issues from reaching main.
Claude Code Security: codebase, deep scanning
Claude Code Security — a separate product in limited research preview. It scans the entire codebase, finds vulnerabilities that have "sat" in the code for years, and proposes specific patches. According to Anthropic, using Claude Opus 4.6, the team found over 500 vulnerabilities in production open-source codebases — bugs that had gone undetected for years despite continuous manual audits. Results appear in the Claude Code Security dashboard, where the security team reviews findings, examines proposed patches, and provides confirmation. Nothing is applied automatically. The audience — security engineers and security leads, not developers.
There's also a third option: the /security-review command
There's also the /security-review command and GitHub Action — a lighter option for security analysis, available to all Claude Code users (including Pro/Max). It checks for known vulnerability patterns: SQL injection, XSS, authentication issues, unsafe data handling. This is closer to classic SAST, but with LLM explanations.
| Characteristic | Code Review | Code Security | /security-review |
|---|
| Trigger | PR opening | Ad-hoc or scheduled | Manual launch |
| Scope | Diff + surrounding code | Entire codebase | Current directory |
| Focus | Logical bugs, regressions | Security vulnerabilities | Known security patterns |
| Access | Teams + Enterprise | Limited preview | All paid plans |
| Audience | Developer, reviewer | Security engineer | Developer |
Code Review and Code Security — are not competitors, but complements: the first catches bugs before merge, the second finds what has already been living in production code for years.
📌 Architecture Cost: Tokens, scale, limitations
$15–25 per PR — but the correct unit of measurement is not PR, but a production incident
The cost is token-based: a larger and more complex PR costs more. The typical range is $15–25 per review. With the "after every push" configuration, the cost multiplies by the number of commits. There is a monthly spending cap. For a team of 50 developers with 20 PRs per day, this is ~$300–500 per day or $9–15K per month — and this needs to be weighed against the real cost of a missed bug.
«Reviews scale in cost with PR size and complexity, completing in 20 minutes on average» — Claude Code Docs.
A price comparison with competitors looks like this: CodeRabbit — $12/user/month for unlimited reviews; Greptile — $30/developer/month; GitHub Copilot code review — included in subscription from $10/month. At first glance, $15–25 for one PR seems expensive. But this comparison is in the wrong units.
The correct unit of comparison — incident cost
Anthropic deliberately positions Code Review as an "insurance product," not a productivity tool. During internal testing, the tool caught a specific bug: a single-line change in a production service was about to break the authentication mechanism of the entire service. The cost of this incident in production — several hours of downtime, SRE team work, potential data loss, reputational risks. One such incident costs more than a month of Code Review for an average team.
Limitations to be aware of
At the current research preview stage, there are several important limitations. Firstly, GitHub only — GitLab, Bitbucket, Azure DevOps are not supported. Secondly, Teams and Enterprise only — individual developers on Pro and Max plans do not have access to the managed Code Review service. Thirdly, there is no publicly confirmed list of supported languages — Claude Code traditionally works well with Python, JavaScript, TypeScript, Go, but support for specific enterprise languages may be limited.
- ✔️ Price: ~$15–25 per PR, token-based model
- ✔️ Spending cap: configurable via claude.ai/admin-settings/usage
- ✔️ Analytics: weekly cost chart and per-repo average cost in admin settings
- ⚠️ «After every push» multiplies the cost — careful configuration is recommended
- ⚠️ GitHub only, Teams/Enterprise only
- ⚠️ For 50 developers with 20 PRs/day: ~$9–15K/month
Section Conclusion: The math works out for enterprise teams where the cost of a missed bug is higher than the cost of a review — and doesn't work out for small teams or teams with a low-risk profile.
📌 CLAUDE.md and REVIEW.md: Customizing review rules
two files with different roles — one about context, the other about priorities
CLAUDE.md describes the system: architecture, conventions, dependencies, project specifics. REVIEW.md defines review priorities: what is critical for your team, what to pay special attention to. This separation allows agents to be configured for a specific tech stack and team culture without changing the overall system logic.
«CLAUDE.md tells the agents how your system is shaped. REVIEW.md tells them what to care about during review» — DEV.to, practical breakdown.
For a senior developer and tech lead, this is the most interesting part of the architecture. The system is not a "black box" with fixed rules — it adapts to your context. CLAUDE.md and REVIEW.md are a way to convey domain knowledge about your system to the agents, which they otherwise have no way of obtaining.
What to write in CLAUDE.md
CLAUDE.md is an "AI manual" about your project. It should include: architectural patterns and data flow, specific codebase conventions, dependencies between modules, infrastructure specifics (e.g., "this service is stateless, this one is stateful"), known technical debt, and legacy parts where changes are particularly risky.
What to write in REVIEW.md
REVIEW.md contains your review priorities. Here's an example from a real configuration:
# REVIEW.mdPrioritize comments about:
- authorization regressions across admin and customer paths
- idempotency in webhook handlers
- missing transaction boundaries on billing writes
- async jobs that can double-send emails, refunds, or notifications
Such a REVIEW.md tells the agents: "we are not interested in stylistic remarks — we are interested in precisely these categories of problems, because they are the most expensive for our business."
Customizing /security-review
The /security-review command also supports customization — you can set specific rules for the codebase and adjust sensitivity for different types of vulnerabilities. More details are available in the official documentation.
- ✔️ CLAUDE.md: architectural context, conventions, dependencies
- ✔️ REVIEW.md: review priorities, specific to your domain
- ✔️ A correctly configured REVIEW.md reduces noise and increases the relevance of findings
- ✔️ These files are also used by other Claude Code tools — not just review
Conclusion: CLAUDE.md and REVIEW.md are where a tech lead can encode domain knowledge about the system and convey it to the agents; the more accurate the context, the more relevant the findings.
❓ Frequently Asked Questions (FAQ)
Can Claude Code Review replace SonarQube?
No, and these tools are not intended to compete. SonarQube is deterministic, auditable, and suitable for compliance requirements in regulated industries (BFSI, healthcare, aerospace). Claude Code Review is context-dependent, better at finding logical errors in complex diffs, but does not provide deterministic guarantees and is not suitable as a sole tool for compliance. The optimal strategy for enterprise: both in parallel.
How does Claude Code Review compare to CodeRabbit or Greptile?
CodeRabbit ($12/user/month) supports GitHub, GitLab, Bitbucket, and Azure DevOps — a significant platform advantage. Greptile ($30/developer/month) indexes the entire repository in advance and provides the deepest contextual analysis, but has the highest level of false positives. Claude Code Review is the newest player with the cleanest signal (less than 1% false positives), but is limited to GitHub and lacks CodeRabbit's track record (2M+ repositories).
What happens to the code sent for review?
Processing occurs on Anthropic's infrastructure. For Enterprise clients, standard Enterprise plan confidentiality terms apply. For Teams clients — Teams plan terms apply. Anthropic does not publish detailed guarantees regarding code storage within the research preview. Companies with strict data residency requirements should carefully read the terms before connecting.
Is there a self-hosted option?
The managed Code Review service is only available through Anthropic's infrastructure; self-hosted is not supported. However, there is an alternative: an open-source Claude Code GitHub Action, which can be run in your own CI pipeline. This is less automated but provides more control over where the code is processed.
✅ Conclusions
- 🔹 The multi-agent architecture solves a specific problem — context-dependent analysis of logical errors, which is unavailable to classic SAST tools due to their architecture
- 🔹 The verification step and less than 1% false positives — a key distinction from the previous generation of AI reviews
- 🔹 Code Review and Code Security — different products with different scopes; the first for PR-level, the second for deep security scanning of the entire codebase
- 🔹 REVIEW.md gives tech leads a way to encode domain knowledge — and this is the most important customization point
- 🔹 Limitations are real: GitHub only, Teams/Enterprise only, $15–25 per PR, lack of track record outside Anthropic
Main takeaway:
Claude Code Review is not a replacement for linters and not a competitor to SonarQube; it's a new layer in the code quality pipeline that addresses a class of errors that previously required a careful human reviewer — and is justified precisely to the extent that the cost of a missed bug exceeds the cost of the review.
⸻
Read also:
← Anthropic launched AI code review: what changed in 2026 (news overview without technical details)
Ключові слова:
Claude APIAnthropicDynamic FilteringWeb Search ToolAI agentsLLM architectureRAG vs Web SearchPython Sandboxтокенна оптимізаціяClaude 4.6 OpusSonnet 4.6