Your AI Agent Passed All Tests. Your System Is Still Broken.

Tests confirm that functions return expected outputs for expected inputs. They do not confirm that the system behaves correctly. I keep running into the same categories of problems that pass every test in the suite:

A handler loads the same entity three times because nothing checks the execution trace.
An async consumer silently fails on a missing dependency registration. The endpoint returns 200. The consumer throws into the void. Nobody notices until a customer asks why they never got the email.
The OpenAPI spec marks a field as required when it is optional. Every client integration will break on valid input.

These are not edge cases. They are structural problems that automated checks cannot catch.

Coverage measures execution, not behavior

90% code coverage means 90% of your code runs during tests. It says nothing about whether the system does what it should. A test can execute a code path without verifying the side effects, the performance characteristics, or the integration behavior that path produces.

Tests do not tell you:

How many database queries a single API call triggers
Whether event consumers actually process the events they subscribe to
Whether error responses contain useful diagnostics or just “something went wrong”
Whether the system works end to end from a real client’s perspective

Traces show what actually happened

Call your API. Open the trace. That trace is the ground truth. It shows every database query, every message published, every consumer that fired or did not fire, every external call.

If the trace shows 14 DB queries for a simple CRUD operation, that is a problem. If a consumer span has status: Error while the root span shows success, that is a problem. These are binary observations, not judgment calls.

AI makes this worse

AI generates code faster than you can reason through it. You used to understand your code because you wrote every line. Now you review code you did not write, and the volume makes thorough review impractical. The tests pass, so you approve. The structural problems accumulate.

This is not a reason to stop using AI. It is a reason to verify behavior through observability instead of relying on tests and code review alone.

I wrote more about this in the AI-Driven Development section and in the methodology repo.