TDD for AI Agents: Why Probability Needs Determinism

Subtitle: Stop asking your agent if it succeeded. Start proving it.

1) The “silent failure” crisis

You ask an agent to do something boring and practical:

“Buy the cheapest USB‑C cable.”

It navigates. It clicks. It returns a cheerful answer:

“Success! Order placed for $9.99.”

Then you check the browser.

The cart is empty.

A “Subscribe to Newsletter” modal intercepted the click. The agent didn’t see it — not because the modal was exotic, but because your loop was built on a single fragile assumption:

The assumption that breaks production

The agent can judge its own success.

In the world of probabilistic AI, hallucinated success is the default state.

Sense: screenshot

Reason: pick action

Act: click

SUCCESS

Modal intercepted click

reality gap

Green log (self-report)

SUCCESS: order placed

no proof

Reality (verified)

cart_count = 0

checkout never reached

Two common failure modes: (1) the loop can lie about what happened; (2) logs can be "green" while reality is "red".

2) The diagnosis: the “loop of lies”

Most agent stacks follow a standard loop:

Sense: capture a screenshot / DOM / AX tree
Reason: a model decides what to do next
Act: execute the click / type / submit

The bug is subtle.

We often use the same probabilistic model to do two jobs:

decide the action
judge whether the action worked

That’s like letting a junior intern grade their own final exam.

Of course they got an A.

The result is compounding drift:

one missed modal at Step 2 becomes a fake checkout at Step 10
one wrong “success” becomes a whole run that looks correct in the logs

The core failure

Probabilistic reasoning cannot produce deterministic accountability.

3) The solution: TDD for agents

We don’t need smarter agents (yet).

We need stricter runtimes.

Here’s the pivot:

Probability is for planning.
Determinism is for verification.

The old way (prompting your way out of uncertainty)

“Please check if the cart has items.”

That’s not a check. That’s a request for another guess.

The new way (assertions)

You write a predicate over observable state:

“cart count (>) 0”
“URL contains /checkout”
“confirmation message exists”

And you treat it like software:

pass → continue
fail → stop, log evidence, replan

This is just Test-Driven Development — applied to runtime behavior.

4) Sentience: a runtime verification “referee”

Sentience is not another agent framework.

It’s a runtime verification layer that sits between your agent (LangChain, browser-use, your custom loop) and the browser.

The loop becomes:

The Loop of Lies

Screenshot → model picks action → model declares success → drift accumulates.

Sense

Act

Success?

Continue

drift

The Loop of Truth

Snapshot → action → assert → pass/fail + trace artifacts.

Snapshot

Act

Assert

PASS

Continue

FAIL

Stop + Trace

How it works, in practice:

Snapshot: prune to “interactable truth” (roles + geometry + stability) instead of dumping raw DOM
Assert: the agent must pass a code-level check before proceeding
Trace: failures produce artifacts that explain why (not just that) the run failed

We prefer a loud red failure over a fake green success.

This is the entire point.

You don’t defeat modals by asking the model to be more careful. You defeat them by refusing to proceed unless reality matches the claim.

Here’s the line that saves you:

1# The line that saved us
2runtime.assert_(
3  exists("#confirmation-message"),
4  label="verify_checkout",
5  required=True,
6)
7

Outcome:

The agent halts (instead of drifting).
You fix the logic (dismiss modal, re-click, or replan).
You ship with confidence because the success condition is provable.

6) Conclusion: the trust barrier

We will not trust agents with money or data if they operate on vibes.

If your system is “click and hope,” it will remain a demo.

If your system is “act and verify,” it becomes software.

Stop building click-and-hope bots

Attach Sentience Debugger to your existing agent loop and start gating actions with deterministic assertions and traceable artifacts.

Sentience Debugger integration