Observability
Observability in this methodology is not production monitoring. It is development-time verification. Traces, logs, and metrics are the evidence used to evaluate whether the system behaves correctly, not just whether it returns the right response.
Three Signals
Traces
Distributed traces show the full execution path: every span, every DB query, every message handler, every external call. Inspect them after every callable surface invocation during a journey walk.
| Signal | Finding |
|---|---|
| Same entity loaded 2+ times in one trace | Redundant query. Handler re-fetches what is already in scope. |
status: Error on child span, root span succeeded | Broken async consumer. Request returned 200, downstream handler failed. |
| 10+ DB spans for a simple operation | Over-querying. Likely N+1 or missing eager load. |
| External HTTP call > 2 seconds | Latency risk. Slow external dependency. |
| Event published with zero consumer spans | Published event with no handler. Subscription is wrong or missing. |
Logs
Structured logs with trace context show what happened at each point in the execution path. Inspect when a trace shows an error span or unexpected behavior.
| Signal | Finding |
|---|---|
| Exception logged on a non-error span | Swallowed exception. Caught and logged but not propagated. |
| Validation failure logged but not returned | Silent validation. Client gets 200, input was partially invalid. |
| Log entry without trace context | Missing correlation. Cannot link to the request that caused it. |
Metrics
Request counts, error rates, latency distributions, queue depths. Compare aggregate metrics before and after a change to detect regressions.
Healthy Trace Patterns
What a correct trace looks like depends on the operation type. These are baseline budgets, not hard rules. Adjust per system.
Simple read (list, get by ID)
- 1-3 DB queries (authorization check + data fetch)
- No message publications
- < 100ms total duration
Simple mutation (create, update)
- 1-3 DB queries (authorization + read + write)
- 0-2 event publications
- All consumer spans succeed if events are published
- < 200ms total duration (excluding external calls)
Batch operation
- DB queries proportional to item count, not N+1
- 1 batch event publication, not N individual events
- Consumer spans complete without error
- Duration scales linearly with item count
Cross-module operation (external service)
- External HTTP span clearly visible
- Retry spans visible if the call failed and retried
- Compensation span visible if operation failed after partial external change
- Total duration dominated by external call, not internal processing
Trace Analysis
Manual trace inspection does not scale. The methodology defines a trace analysis tool (MCP server) that accepts a trace ID, extracts structured metrics (span count, DB queries, errored spans, duplicate loads, external call durations), compares against budgets, and returns binary violations with evidence.
See the trace analyzer spec for details.