Observability

Observability in this methodology is not production monitoring. It is development-time verification. Traces, logs, and metrics are the evidence used to evaluate whether the system behaves correctly, not just whether it returns the right response.

Three Signals

Traces

Distributed traces show the full execution path: every span, every DB query, every message handler, every external call. Inspect them after every callable surface invocation during a journey walk.

Signal	Finding
Same entity loaded 2+ times in one trace	Redundant query. Handler re-fetches what is already in scope.
`status: Error` on child span, root span succeeded	Broken async consumer. Request returned 200, downstream handler failed.
10+ DB spans for a simple operation	Over-querying. Likely N+1 or missing eager load.
External HTTP call > 2 seconds	Latency risk. Slow external dependency.
Event published with zero consumer spans	Published event with no handler. Subscription is wrong or missing.

Logs

Structured logs with trace context show what happened at each point in the execution path. Inspect when a trace shows an error span or unexpected behavior.

Signal	Finding
Exception logged on a non-error span	Swallowed exception. Caught and logged but not propagated.
Validation failure logged but not returned	Silent validation. Client gets 200, input was partially invalid.
Log entry without trace context	Missing correlation. Cannot link to the request that caused it.

Metrics

Request counts, error rates, latency distributions, queue depths. Compare aggregate metrics before and after a change to detect regressions.

Healthy Trace Patterns

What a correct trace looks like depends on the operation type. These are baseline budgets, not hard rules. Adjust per system.

Simple read (list, get by ID)

1-3 DB queries (authorization check + data fetch)
No message publications
< 100ms total duration

Simple mutation (create, update)

1-3 DB queries (authorization + read + write)
0-2 event publications
All consumer spans succeed if events are published
< 200ms total duration (excluding external calls)

Batch operation

DB queries proportional to item count, not N+1
1 batch event publication, not N individual events
Consumer spans complete without error
Duration scales linearly with item count

Cross-module operation (external service)

External HTTP span clearly visible
Retry spans visible if the call failed and retried
Compensation span visible if operation failed after partial external change
Total duration dominated by external call, not internal processing

Trace Analysis

Manual trace inspection does not scale. The methodology defines a trace analysis tool (MCP server) that accepts a trace ID, extracts structured metrics (span count, DB queries, errored spans, duplicate loads, external call durations), compares against budgets, and returns binary violations with evidence.

See the trace analyzer spec for details.

Full observability methodology →