TDD for AI Agents: Why Probability Needs Determinism
Prompt engineering has hit a ceiling. The next reliability leap comes from runtime verification: assertions, traces, and loud failures.
Subtitle: Stop asking your agent if it succeeded. Start proving it.
1) The “silent failure” crisis
You ask an agent to do something boring and practical:
“Buy the cheapest USB‑C cable.”
It navigates. It clicks. It returns a cheerful answer:
“Success! Order placed for $9.99.”
Then you check the browser.
The cart is empty.
A “Subscribe to Newsletter” modal intercepted the click. The agent didn’t see it — not because the modal was exotic, but because your loop was built on a single fragile assumption:
The assumption that breaks production
The agent can judge its own success.
In the world of probabilistic AI, hallucinated success is the default state.
reality gap
Green log (self-report)
no proof
Reality (verified)
Two common failure modes: (1) the loop can lie about what happened; (2) logs can be "green" while reality is "red".
2) The diagnosis: the “loop of lies”
Most agent stacks follow a standard loop:
- Sense: capture a screenshot / DOM / AX tree
- Reason: a model decides what to do next
- Act: execute the click / type / submit
The bug is subtle.
We often use the same probabilistic model to do two jobs:
- decide the action
- judge whether the action worked
That’s like letting a junior intern grade their own final exam.
Of course they got an A.
The result is compounding drift:
- one missed modal at Step 2 becomes a fake checkout at Step 10
- one wrong “success” becomes a whole run that looks correct in the logs
The core failure
Probabilistic reasoning cannot produce deterministic accountability.
3) The solution: TDD for agents
We don’t need smarter agents (yet).
We need stricter runtimes.
Here’s the pivot:
- Probability is for planning.
- Determinism is for verification.
The old way (prompting your way out of uncertainty)
“Please check if the cart has items.”
That’s not a check. That’s a request for another guess.
The new way (assertions)
You write a predicate over observable state:
- “cart count (>) 0”
- “URL contains
/checkout” - “confirmation message exists”
And you treat it like software:
- pass → continue
- fail → stop, log evidence, replan
This is just Test-Driven Development — applied to runtime behavior.
4) Sentience: a runtime verification “referee”
Sentience is not another agent framework.
It’s a runtime verification layer that sits between your agent (LangChain, browser-use, your custom loop) and the browser.
The loop becomes:
The Loop of Lies
Screenshot → model picks action → model declares success → drift accumulates.
drift
The Loop of Truth
Snapshot → action → assert → pass/fail + trace artifacts.
How it works, in practice:
- Snapshot: prune to “interactable truth” (roles + geometry + stability) instead of dumping raw DOM
- Assert: the agent must pass a code-level check before proceeding
- Trace: failures produce artifacts that explain why (not just that) the run failed
We prefer a loud red failure over a fake green success.
5) Case study: the “invisible” modal
This is the entire point.
You don’t defeat modals by asking the model to be more careful. You defeat them by refusing to proceed unless reality matches the claim.
Here’s the line that saves you:
1# The line that saved us
2runtime.assert_(
3 exists("#confirmation-message"),
4 label="verify_checkout",
5 required=True,
6)
7Outcome:
- The agent halts (instead of drifting).
- You fix the logic (dismiss modal, re-click, or replan).
- You ship with confidence because the success condition is provable.
6) Conclusion: the trust barrier
We will not trust agents with money or data if they operate on vibes.
If your system is “click and hope,” it will remain a demo.
If your system is “act and verify,” it becomes software.
Stop building click-and-hope bots
Attach Sentience Debugger to your existing agent loop and start gating actions with deterministic assertions and traceable artifacts.
Sentience Debugger integration