Open Problems
Honest about what’s unsolved. Contributions welcome.
Automated Regression
A journey that passes today may break tomorrow. No automated re-walk exists. Needs dependency graph mapping, CI-integrated walks, and a headless walk runner.
Adversarial Journeys
Current journeys are “actor tries to succeed.” Missing: authorization bypass attempts, cross-tenant data access, role escalation, input injection. The walk procedure supports failure mode testing per step, but dedicated adversarial journeys with systematic threat modeling would be more thorough.
Performance Budgets
Trace analysis currently uses default budgets (5 DB queries for CRUD, 500ms response). No per-operation baselines exist. Adding baselines requires measuring current performance and defining acceptable thresholds per journey step.
Scale
At 8 journeys, manual walks work. At 50, they don’t. Scaling requires automated execution, prioritization, and incremental re-walks (only steps affected by code changes).
Walk Data Format
Walk transcripts are AI-generated markdown. Future direction: structured JSON that can be consumed by a test framework for automated evaluation. This is a prerequisite for CI-integrated walks.
Cross-Model Consistency
Unknown whether different AI models walking the same journey with the same criteria produce comparable results. Binary criteria should help, but trace analysis may vary.
Methodology Observability
No meta-metrics for “is the methodology working?” Candidates: issues found per walk, re-walk count before passing, criteria violations per scope over time, time from walk failure to walk pass.
Human Bottleneck at Scale
The walk procedure requires human review for every fix. At scale, needs auto-approve for low-risk fixes, require-review for high-risk changes.
Callable Surface Parity
If the system has multiple surfaces (REST, MCP, gRPC), nothing verifies they behave identically. A walk through MCP may pass while the same operation through REST fails differently.