Reviewing AI-written code: a checklist for what to scrutinise
ode generated by an AI agent looks like code, runs like code, and passes the obvious tests like code — but it has its own characteristic failure modes. Reviewing AI output well isn't about distrusting it; it's about knowing which categories of mistake are most likely and looking for them deliberately. The review skill compounds quickly once you know where to focus.
AI Code Review Map
- Review AI Codefocus on
- Five mistake categoriescovers
- 10-minute checklistfollow
- Easier review setupmake
- Characteristic failure modes
- Review in order
- Codebase investments
Table of Contents
Quick reference
Five categories
Wrong API, dropped requirements, over-engineering, sycophantic patches, wrong reasoning.
Review against the prompt
Walk the diff against your task spec, not the agent's description.
Run the code
Edge cases the agent didn't test. Don't trust the description.
Read the test assertions
Not just the count. Tests passing means little if assertions are wrong.
New deps
Every new import or package is a decision worth confirming.
Codebase hygiene
Strict types, fast tests, pre-commit lint. AI work rewards them.
Code generated by an AI agent looks like code, runs like code, and passes the obvious tests like code — but it has its own characteristic failure modes. Reviewing AI output well isn't about distrusting it; it's about knowing which categories of mistake are most likely and looking for them deliberately. The review skill compounds quickly once you know where to focus.
What you'll learn
The five categories of mistake
AI-written code goes wrong in characteristic ways. Five categories cover most of the real problems:
1. Plausible-but-wrong API usage. The agent calls a method that doesn't exist on the type, or uses an old version of an API, or invents a parameter name that looks reasonable. Type-checked languages catch a lot of this; loosely-typed code does not.
2. Silently dropped requirements. You asked for X and Y, the agent built X, the tests for X pass, and Y is just... missing. The output looks complete because it ran. Skim the diff against your requirements list, not against your assumptions.
3. Over-engineering. The agent adds abstractions, configuration knobs, and "flexibility" you didn't ask for. The code works but is harder to maintain. Watch for new files, new exports, new options.
4. Sycophantic patches. When the agent is corrected, it sometimes patches just enough to make you go away — adding a special case rather than fixing the underlying issue. Look for narrow conditionals that exist only to handle the specific case you flagged.
5. Confidently-wrong reasoning. The agent's explanation of what the code does is wrong, and the code matches the wrong explanation. You read the explanation, agree with it, and never check the actual behaviour. Run the code; don't take the agent's word.
A 10-minute review checklist
- 1
Read the diff
Read the diff against requirements, not the agent's description.
- 2
Run the code
Actually run it on a real input you didn't test in the prompt.
- 3
Check the tests
Read the assertions, not just the test count.
- 4
Look for dependencies
New imports, new packages, new external calls.
- 5
Spot-check explanation
Verify it matches the code.
- 6
Read it once more
A second pass with a clear mind catches things the first pass missed.
A workable review pattern, in order:
1. Read the diff against requirements, not the agent's description. Open your original prompt or task spec. Walk through the diff item by item. Anything missing? Anything extra?
2. Run the code. Don't just look at it. Actually run it on a real input you didn't test in the prompt. Edge cases, empty inputs, malformed inputs. AI code passes happy paths reliably and breaks on edges that weren't in the test suite.
3. Check the tests. Are the tests testing the actual behaviour, or are they testing what the agent assumed? Tests written by the agent often pass because the agent designed them to. Read the assertions, not just the test count.
4. Look for new dependencies. New imports, new packages, new external calls. Each one is a decision worth confirming.
5. Spot-check the explanation. Pick one section of the agent's description and verify it matches the code. If the explanation says "we cache the result," check that the code actually caches the result.
6. Read it once more, fresh. A second pass with a clear mind catches things the first pass missed. AI code reads convincingly the first time, less so the second.
This review should take 5–15 minutes for a typical change. If you find yourself reading for an hour, the change is too big — break it down.
Set up the codebase to make review easier
- Strict type checkingTypeScript with no `any`, full strict mode, and strict null checks catches a huge fraction of plausible-but-wrong API usage automatically.
- A test suite that runs in secondsIf they run in five seconds, you'll run them constantly.
- A pre-commit hook that runs lint and typesErrors caught at commit time never make it into a PR.
- A short AGENTS.md or similarA one-page document about your codebase's conventions helps the agent produce code that doesn't need stylistic review every time.
- Focused PRsEncourage one-concept PRs.
A few investments make AI-written code dramatically easier to review:
Strict type checking. TypeScript with no any, full strict mode, and strict null checks catches a huge fraction of plausible-but-wrong API usage automatically. Same for Rust, Kotlin, etc. Languages that allow loose typing pay a tax on AI-assisted code review.
A test suite that runs in seconds. If running the tests takes a minute, you'll skip them. If they run in five seconds, you'll run them constantly. Fast tests are an AI-coding multiplier.
A pre-commit hook that runs lint and types. Errors caught at commit time never make it into a PR. The agent often produces lint errors that a quick npm run lint would catch — automate this.
A short AGENTS.md or similar. A one-page document about your codebase's conventions — naming, file structure, error-handling patterns — helps the agent produce code that doesn't need stylistic review every time.
Focused PRs. Small PRs are easier to review whether the author is human or AI. Encourage one-concept PRs.
None of this is unique to AI-assisted work. AI assistance just rewards the codebase hygiene that you should have anyway, and punishes the absence of it more sharply.

Want a more guided way to practise this?
Common questions
How long should a review take?
5–15 minutes for a focused PR. If it's taking longer, the change is too big — push back on the scope rather than spending an hour scrutinising every line.
Should I have the agent review its own code?
Useful as a sanity check but not as a substitute for human review. The agent will catch some issues (lint, typos, obvious bugs) and miss others (over-engineering, requirement drift). Treat it as a free first pass, not a replacement.
How do I review code in a language I don't know well?
Run it. Read the tests. Ask the agent to explain specific lines. Be cautious about merging anything you can't fundamentally understand — the agent is great at producing syntactically correct code in languages it has more practice with than you do, which is its own failure mode.
What's the single highest-leverage check?
Running the code on inputs you didn't mention. Almost everything else can be automated; this one requires a thinking human and catches the most embarrassing bugs.
Bottom line
AI-written code fails in characteristic ways: wrong API usage, dropped requirements, over-engineering, sycophantic patches, and confidently wrong reasoning. A 10-minute review checklist — read the diff against the prompt, run the code, check the tests, scrutinise dependencies, spot-check the explanation — catches the great majority. Set up the codebase to make this review fast.