---
title: "Reviewing AI-written code: a checklist for what to scrutinise"
source: https://www.taim.io/agentic-coding/reviewing-ai-written-code-checklist
published: Sat May 09 2026 10:51:18 GMT+0000 (Coordinated Universal Time)
updated: Thu Jun 04 2026 17:17:19 GMT+0000 (Coordinated Universal Time)
description: "Code generated by an AI agent looks like code, runs like code, and passes the obvious tests like code — but it has its own characteristic failure modes. Reviewing AI output well isn't about distrusting it; it's about knowing which categorie"
---

# Reviewing AI-written code: a checklist for what to scrutinise

Code generated by an AI agent looks like code, runs like code, and passes the obvious tests like code — but it has its own characteristic failure modes. Reviewing AI output well isn't about distrusting it; it's about knowing which categories of mistake are most likely and looking for them deliberately. The review skill compounds quickly once you know where to focus.

Code generated by an AI agent looks like code, runs like code, and passes the obvious tests like code — but it has its own characteristic failure modes. Reviewing AI output well isn't about distrusting it; it's about knowing which categories of mistake are most likely and looking for them deliberately. The review skill compounds quickly once you know where to focus.

## What you'll learn

- The five categories of mistake AI agents make most often
- A review checklist that catches them in under 10 minutes
- How to set up the codebase to make AI-written code easier to review

## The five categories of mistake

AI-written code goes wrong in characteristic ways. Five categories cover most of the real problems:

**1. Plausible-but-wrong API usage.** The agent calls a method that doesn't exist on the type, or uses an old version of an API, or invents a parameter name that looks reasonable. Type-checked languages catch a lot of this; loosely-typed code does not.

**2. Silently dropped requirements.** You asked for X and Y, the agent built X, the tests for X pass, and Y is just... missing. The output looks complete because it ran. Skim the diff against your requirements list, not against your assumptions.

**3. Over-engineering.** The agent adds abstractions, configuration knobs, and "flexibility" you didn't ask for. The code works but is harder to maintain. Watch for new files, new exports, new options.

**4. Sycophantic patches.** When the agent is corrected, it sometimes patches just enough to make you go away — adding a special case rather than fixing the underlying issue. Look for narrow conditionals that exist only to handle the specific case you flagged.

**5. Confidently-wrong reasoning.** The agent's explanation of what the code does is wrong, and the code matches the wrong explanation. You read the explanation, agree with it, and never check the actual behaviour. Run the code; don't take the agent's word.

## A 10-minute review checklist

A workable review pattern, in order:

**1. Read the diff against requirements, not the agent's description.** Open your original prompt or task spec. Walk through the diff item by item. Anything missing? Anything extra?

**2. Run the code.** Don't just look at it. Actually run it on a real input you didn't test in the prompt. Edge cases, empty inputs, malformed inputs. AI code passes happy paths reliably and breaks on edges that weren't in the test suite.

**3. Check the tests.** Are the tests testing the actual behaviour, or are they testing what the agent assumed? Tests written by the agent often pass because the agent designed them to. Read the assertions, not just the test count.

**4. Look for new dependencies.** New imports, new packages, new external calls. Each one is a decision worth confirming.

**5. Spot-check the explanation.** Pick one section of the agent's description and verify it matches the code. If the explanation says "we cache the result," check that the code actually caches the result.

**6. Read it once more, fresh.** A second pass with a clear mind catches things the first pass missed. AI code reads convincingly the first time, less so the second.

This review should take 5–15 minutes for a typical change. If you find yourself reading for an hour, the change is too big — break it down.

## Set up the codebase to make review easier

A few investments make AI-written code dramatically easier to review:

**Strict type checking.** TypeScript with no `any`, full strict mode, and strict null checks catches a huge fraction of plausible-but-wrong API usage automatically. Same for Rust, Kotlin, etc. Languages that allow loose typing pay a tax on AI-assisted code review.

**A test suite that runs in seconds.** If running the tests takes a minute, you'll skip them. If they run in five seconds, you'll run them constantly. Fast tests are an AI-coding multiplier.

**A pre-commit hook that runs lint and types.** Errors caught at commit time never make it into a PR. The agent often produces lint errors that a quick `npm run lint` would catch — automate this.

**A short AGENTS.md or similar.** A one-page document about your codebase's conventions — naming, file structure, error-handling patterns — helps the agent produce code that doesn't need stylistic review every time.

**Focused PRs.** Small PRs are easier to review whether the author is human or AI. Encourage one-concept PRs.

None of this is unique to AI-assisted work. AI assistance just rewards the codebase hygiene that you should have anyway, and punishes the absence of it more sharply.

### Quick reference

#### Five categories

Wrong API, dropped requirements, over-engineering, sycophantic patches, wrong reasoning.

#### Review against the prompt

Walk the diff against your task spec, not the agent's description.

#### Run the code

Edge cases the agent didn't test. Don't trust the description.

#### Read the test assertions

Not just the count. Tests passing means little if assertions are wrong.

#### New deps

Every new import or package is a decision worth confirming.

#### Codebase hygiene

Strict types, fast tests, pre-commit lint. AI work rewards them.

### Common questions

#### How long should a review take?

5–15 minutes for a focused PR. If it's taking longer, the change is too big — push back on the scope rather than spending an hour scrutinising every line.

#### Should I have the agent review its own code?

Useful as a sanity check but not as a substitute for human review. The agent will catch some issues (lint, typos, obvious bugs) and miss others (over-engineering, requirement drift). Treat it as a free first pass, not a replacement.

#### How do I review code in a language I don't know well?

Run it. Read the tests. Ask the agent to explain specific lines. Be cautious about merging anything you can't fundamentally understand — the agent is great at producing syntactically correct code in languages it has more practice with than you do, which is its own failure mode.

#### What's the single highest-leverage check?

Running the code on inputs you didn't mention. Almost everything else can be automated; this one requires a thinking human and catches the most embarrassing bugs.

### Bottom line

AI-written code fails in characteristic ways: wrong API usage, dropped requirements, over-engineering, sycophantic patches, and confidently wrong reasoning. A 10-minute review checklist — read the diff against the prompt, run the code, check the tests, scrutinise dependencies, spot-check the explanation — catches the great majority. Set up the codebase to make this review fast.

### Next steps

- On your next AI-assisted PR, walk the diff against your original prompt and note anything dropped or extra.
- Add a "run on an unexpected input" step to your review habit. Time how long it takes.
- Audit your codebase for the four hygiene fundamentals (strict types, fast tests, pre-commit lint, short conventions doc). Fix the weakest one this week.
