Glossary

Layer - one of six maturity levels (0-5), adopted incrementally. Each layer addresses a gap the previous layer can not fill.

Journey - a defined sequence of steps that verifies a user-facing flow through the callable surface. Not a test case, a verification arc that produces observable traces.

Walk - executing a journey against a live system. An AI agent follows journey steps through the callable surface, collecting responses and trace IDs.

Walk Report - structured output of a walk. Pass/fail for each step, findings, trace IDs, and an overall result. Evidence, not opinions.

Callable Surface - the interface exercised during a walk. MCP server, REST API, gRPC, CLI, or any programmatically invocable interface. What makes journeys walkable by AI agents.

Evaluation Criteria - binary checks applied to a walk report. PASS or FAIL, no partial credit. Organized into four scopes: functional, performance, security, and observability.

Budget - a numeric threshold for a performance criterion. Example: “DB query count must be less than 5.” Compared against actual trace values.

Violation - a criterion that fails, with evidence: a metric value, a trace ID, a response body. A measured fact, not an opinion.

Finding - an observation made during a walk that is noteworthy but may not be a violation. Unexpected behaviors, quality concerns, or observations outside defined criteria.

Trace - a distributed trace capturing the internal execution path of a system operation. Produced by OpenTelemetry instrumentation. In this methodology, traces are verification evidence, not monitoring data.

Black Box - a walk mode where the walker sees only callable surface responses. No traces, no internal state. Evaluates the system as a user would.

White Box - a walk mode where the walker also has access to distributed traces. Enables performance evaluation, duplicate detection, and internal behavior verification.

Bias Prevention - performing black box evaluation before white box. If the walker sees traces first, it may rationalize functional failures based on internal context.

Full glossary →