Why AX Alone Isn't Enough for Reliable Agents
The Accessibility Tree (AX) provides standardized roles, names, and states—but reliable LLM agents need more. This post explains where AX shines, where it falls short, and why geometry + stability + verification are required for dependable agents.
The Accessibility Tree (AX) is one of the most underappreciated tools in the browser. It provides standardized roles, names, and states—and many agent frameworks rightly rely on it.
But when it comes to reliable LLM agents acting on modern websites, AX alone isn't enough.
Not because AX is flawed—but because it was never designed for what agents need to do.
This post explains where AX shines, where it falls short, and why geometry + stability + verification are required for dependable agents.
What AX is great at
AX answers the question:
"What elements are accessible?"
It provides:
- roles (button, link, textbox)
- names and labels
- states (checked, disabled, expanded)
For a single, static document, this is extremely powerful.
AX mental model
Accessibility Tree
┌─────────────────────┐
│ Button │
│ name: "Continue" │
│ role: button │
│ enabled: true │
└─────────────────────┘
If an agent already knows which element to act on and when the page is ready, AX works well.
But that's the catch.
Where AX breaks down for agents
Modern web apps are:
- JS-heavy
- dynamically hydrated
- full of embedded content and iframes
- constantly changing layout and visibility
AX is intentionally lossy in these dimensions.
1. No global ordinality
AX doesn't encode:
- "first result"
- "main CTA"
- "top item above the fold"
The Ordinality Problem
On a page with repeated elements, agents still need to answer: Which one matters?
AX doesn't model that.
2. Fragmentation across iframes
Each iframe has its own AX tree.
Page
├─ AX Tree (main document)
├─ AX Tree (auth iframe)
├─ AX Tree (checkout iframe)
What's missing:
- global ordering across frames
- spatial relationships
- visibility/occlusion awareness
What Agents Need
Agents need a single interaction surface, not multiple disconnected trees.
3. AX doesn't model stability
AX can report elements that:
- exist but aren't usable yet
- are technically accessible but visually blocked
- are mid-transition during hydration
AX Answers
"Is this accessible?"
Agents Need
"Is this usable right now?"
Those are different questions.
Why geometry matters
Geometry adds spatial truth to semantics.
It answers:
- where elements are
- what's above/below
- what's inside what
- what's visible in the viewport
Geometry mental model
Viewport
┌──────────────────────────────┐
│ [Search Result #1] │ ← dominant group, index 0
│ [Title] [Open] │
│ │
│ [Search Result #2] │ ← index 1
│ [Title] [Open] │
└──────────────────────────────┘
With geometry, agents can reason about:
- Ordinality — "first result", "second button"
- Grouping — "button inside same card"
- Hierarchy — "main content vs sidebar"
AX alone doesn't encode this.
Why vision alone isn't the answer either
Vision-first agents take screenshots and ask:
"What do you see?"
This works—but at a cost.
Vision mental model
Screenshot
┌──────────────────────────────┐
│ pixels, text, colors │
│ everything looks important │
└──────────────────────────────┘
Vision Trade-offs
- Expensive — burns tokens every step
- Brittle — struggles with ordinality
- Hard to debug — no clear action targets
- Risky — can hallucinate success
Vision is powerful—but unreliable as a default perception layer.
Comparing the three approaches
AX (Semantics)
"What exists?"
Geometry (Structure)
"What matters where?"
Vision (Fallback)
"What does it look like?"
Each answers a different question.
Reliable agents need all three, but in the right order.
The Sentience approach: structure-first, vision-last
Sentience treats the browser as a verifiable interaction surface.
It combines:
- AX-style semantics — roles, names, states
- Rendered geometry — layout, ordinality, grouping
- Stability signals — DOM quiet time, confidence
- Assertions — verify outcomes, not guess
Sentience mental model
Rendered DOM
↓
Semantic + Geometry Snapshot
↓
Assertions
↓
PASS / FAIL (with reasons)
Vision is used only when structure is exhausted, and only to verify, not to guess.
Why this matters for LLM agents
When structure is explicit:
- smaller models work
- token usage drops
- retries become bounded
- failures become explainable
LLMs are great at planning. They're bad at inferring unstable UI structure from scratch.
The system should handle that.
The takeaway
AX is necessary—but not sufficient—for reliable agents.
Accessibility Tree (AX) vs Sentience Snapshot (Semantic DOM + Geometry)
| Dimension | Accessibility Tree (AX) | Sentience Snapshot (Semantic DOM + Geometry) |
| ------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------- |
| Primary purpose | Assistive technologies (screen readers, keyboard navigation) | Reliable execution & verification for AI browser agents |
| Design goal | Human accessibility compliance | Deterministic agent perception and control |
| Source of truth | Accessibility API (roles, labels, ARIA) | Rendered DOM + layout geometry + interaction state |
| Coverage guarantee | Only what is exposed for accessibility | What is actually rendered and interactive |
| Completeness | Partial by design (many elements omitted) | Explicitly optimized for task completeness |
| Layout / geometry | No bounding boxes or spatial relationships | Precise bounding boxes, containment, row/column grouping |
| Ordinal information | No notion of "first / second / nth" | Ordinality (doc_y, group index, dominant group rank) |
| Visual grouping | Flat semantic tree | Cards, lists, feeds, drawers, modals inferred geometrically |
| Z-index / overlays | Not represented | Explicit overlay, modal, and blocking layer detection |
| State awareness | Limited (checked, expanded if annotated) | Enabled/disabled, visibility, occlusion, blocking |
| Dynamic SPA handling | Inconsistent across frameworks | Works on fully hydrated, JS-heavy SPAs |
| Failure behavior | Silent omission when semantics missing | Deterministic failure with reason codes |
| Reliance on developer correctness | High (ARIA must be perfect) | Lower (geometry + heuristics compensate) |
| Token efficiency | Often large / noisy | Aggressively pruned (≈30–60 elements) |
| LLM reasoning friendliness | Requires inference over missing info | Bounded, ranked, LLM-ready state |
| Verification support | Not designed for assertions | First-class assertions & .eventually() retries |
| Debugging artifacts | None | Snapshot JSON, screenshots, video, traces |
| Generalization across sites | Assumes strong accessibility hygiene | Designed to generalize to messy real-world sites |
| Best used for | Keyboard / screen reader navigation | Autonomous agents that must prove progress |
On modern, JS-heavy, iframe-filled websites, agents need:
What Reliable Agents Require
- Semantics to know what exists
- Geometry to know what matters
- Stability to know when to act
- Assertions to know what succeeded
That's how you move from:
"The agent probably clicked the right thing"
to:
"The system verified the outcome."
Move Beyond AX-Only Agents
Sentience combines accessibility semantics with geometry, stability signals, and verification—giving your agents the perception layer they need to act reliably.
SDK Quickstart