Abstract
AI coding agents can write code, run tests, and commit to git — but they cannot click buttons, fill web forms, or navigate vendor dashboards. When building production SaaS products, agents frequently need to create accounts on Stripe, Neon, fal.ai, and Cloudflare; verify email addresses; complete CAPTCHA challenges; and configure deployment settings through browser-based dashboards. This article presents Browser Control Layer (BCL), a Python CLI tool that bridges this gap on macOS by wrapping the Accessibility API (AXUIElement) to discover, target, and interact with GUI elements in background windows without stealing OS focus. We describe the architecture, the AXPress-first click strategy that avoids mouse movement, the window isolation pattern for multi-agent use, and the mutex protocol that prevents cursor conflicts when 20+ agents share one physical Mac.
The Problem: Agents Need Hands
Modern AI coding agents (Claude Code, Cursor, Windsurf) operate through terminal access: they read files, write code, run shell commands, and interact with APIs. This covers most software engineering tasks. But real product launches require interacting with systems that have no API:
- Account creation: Stripe, Cloudflare, Neon, and fal.ai all require web-based signup flows with email verification and often Turnstile or reCAPTCHA challenges.
- Dashboard configuration: Setting environment variables on Vercel, creating R2 buckets on Cloudflare, or configuring webhook endpoints on Stripe requires navigating multi-step browser UIs.
- Email verification: OTP codes arrive in Gmail or Apple Mail and must be copied into a browser form.
- Authentication flows: OAuth consent screens, two-factor authentication prompts, and session management all happen in the browser.
Without the ability to perform these actions, agents hit a wall: code is complete, deployment scripts are written, but the product cannot go live because nobody can click “Create Account” on a vendor’s website.
The naive solution — having a human perform these steps — defeats the purpose of autonomous agent swarms. The goal is a system where agents handle everything end-to-end, including the browser-based steps that traditional automation tools like Selenium or Puppeteer cannot reach (they require a browser process; macOS Accessibility works across any application).
Architecture: Accessibility API as the Universal Interface
BCL is a Python CLI tool (~2,500 lines) that uses macOS Accessibility API (via pyobjc-framework-ApplicationServices) as its primary interaction mechanism. The key insight is that every macOS application exposes its UI elements through AXUIElement objects, regardless of whether it is a browser, a native app, or an Electron wrapper.
The architecture has three layers:
Discovery layer: bcl discover walks the AX tree for a target application window and returns a numbered list of interactive elements (buttons, text fields, links, checkboxes). Each element includes its role, title, value, position, and a stable AX path.
Interaction layer: bcl click N, bcl type N 'text', and bcl key <shortcut> execute actions on discovered elements. The click command uses an AXPress-first strategy (described below).
Navigation layer: bcl navigate <url> uses AppleScript to set a browser tab’s URL without bringing the window to the foreground. bcl windows --json inventories all windows with their tab URLs via JXA (JavaScript for Automation).
The AXPress-First Click Strategy
A critical design decision in BCL is how clicks are delivered. The naive approach — move the mouse cursor to the element’s coordinates and simulate a click — is fragile in a multi-agent environment because:
- The mouse cursor is a shared global resource. If another process moves the cursor between BCL’s move and click, the click lands in the wrong place.
- Moving the cursor to a background window may trigger focus changes, tooltips, or hover effects that alter the UI state.
- Mouse movement takes time (even with
--fastmode), and during that time the UI can change.
BCL’s solution is AXPress-first: when bcl click N is called, BCL first attempts to invoke the AXPress action on the target element’s AXUIElement reference. AXPress is a direct programmatic trigger — equivalent to the user clicking the element — but it requires no mouse movement and works on background windows. The element receives the click event through the Accessibility framework, not through the event system that processes physical mouse input.
If AXPress fails (some elements do not support it, particularly in web content rendered by WebKit or Blink), BCL falls back to coordinate-based mouse simulation. The --fast flag makes this fallback instantaneous (teleport the cursor) rather than smooth (animate the cursor path). The --no-ax flag skips AXPress entirely for applications where AX is unreliable (common with some Electron apps).
This hierarchy — AXPress first, fast mouse second, smooth mouse last — means that most interactions in Safari and native apps happen without any visible mouse movement, which is critical when a human is also using the Mac.
Window Isolation for Multi-Agent Use
When 20 agents share one Mac, each agent’s browser automation must target a specific window without interfering with other agents’ windows. BCL solves this with window indexing:
# Step 1: Inventory all Safari windows with their tab URLs
bcl windows --app Safari --json
# Returns: [{index: 0, title: "Stripe Dashboard", current_tab_url: "https://dashboard.stripe.com", ...}, ...]
# Step 2: Target a specific window by index
bcl discover --app Safari --window 0 --json
# Step 3: All subsequent actions on that window
bcl click 5 --window 0
bcl type 3 'my-api-key' --window 0
The --window N flag ensures that discover walks the AX tree for window N specifically, and click/type target elements within that window’s coordinate space. Without --window, BCL defaults to the frontmost window, which is unreliable when multiple agents are active.
The coordination protocol assigns one Safari window index per in-flight task. The assigned index is recorded in the task management system (BridgeMind) so other agents know not to use that window. When a task completes, the window lease is released.
For URL navigation within a leased window, bcl navigate 'https://example.com' --window N uses AppleScript to set the URL on window N’s current tab without affecting other windows.
The Mutex Protocol: Preventing Cursor Conflicts
Even with AXPress-first clicks and window isolation, some operations require the physical mouse or OS-level focus: CAPTCHA challenges, drag-and-drop, certain modal dialogs, and payment card entry. For these foreground-only operations, BCL uses a mutex protocol:
- CLAIM: The agent announces
BCL LOCK CLAIMEDto all agents with a task ID and estimated duration. - WAIT: Other agents hold their BCL requests for 90 seconds or until a
BCL LOCK RELEASEDmessage. - RUN: The claiming agent executes its BCL operations using
--fastand--activateas needed. - RELEASE: The agent announces
BCL LOCK RELEASEDimmediately after completing its operations.
In practice, one designated agent (called BC1, typically Builder 3 in our swarm) is the sole executor of BCL commands. All other agents queue requests through the message-passing system. BC1 processes the queue serially, interleaving background-safe operations (AXPress clicks on different windows) with foreground-only operations (CAPTCHAs, modal dialogs).
This serialization is a deliberate tradeoff: it limits BCL throughput to what one agent can process, but it eliminates an entire class of cursor-conflict bugs that plagued earlier multi-agent attempts.
Production Results and Failure Modes
BCL has been used in production swarm sessions to:
- Create accounts on Neon, fal.ai, Stripe, and Cloudflare
- Verify email addresses via Gmail in Safari
- Complete Cloudflare Turnstile challenges
- Configure Vercel environment variables through the dashboard
- Navigate Google Cloud Console for OAuth credential setup
Known failure modes and their mitigations:
- Safari autofill interference: Safari’s password autofill can hijack login forms, typing saved credentials into fields the agent is trying to fill. Mitigation: use private browsing windows for new account creation.
- AX tree staleness: After page navigation, the AX tree from a previous
discoveris invalid. Mitigation: always re-discover after any navigation or significant UI change. - Turnstile and reCAPTCHA: These challenges often require foreground focus and sometimes defeat AXPress. Mitigation: use
--activatefor the specific challenge step, then return to background mode. - Element index instability: Indices change between
discovercalls as the page renders or scrolls. Mitigation: re-discover before every interaction; never cache indices across navigation boundaries.
Comparison with Alternative Approaches
| Approach | Works on any app? | Background? | Multi-agent safe? | Setup complexity |
|---|---|---|---|---|
| Selenium/Puppeteer | Browser only | No (own process) | Yes (separate instances) | Medium |
| AppleScript | macOS only | Yes | No (global scope) | Low |
| Computer Use API (Anthropic) | Screenshot-based | No | No | Low |
| BCL (this work) | Any macOS app | Yes (AXPress) | Yes (mutex + window isolation) | Medium |
BCL’s advantage is the combination of background operation, cross-application support, and multi-agent coordination. Its disadvantage is macOS-only support and reliance on applications exposing a reasonable AX tree.
Conclusion
Browser Control Layer demonstrates that the macOS Accessibility API is a viable substrate for AI agent computer control, particularly when multiple agents share a physical machine. The AXPress-first strategy avoids the fragility of mouse simulation, window isolation prevents cross-agent interference, and the mutex protocol handles the remaining foreground-only operations. The tool has been used in production to create vendor accounts, configure deployments, and verify authentication flows — tasks that previously required human intervention in an otherwise autonomous agent pipeline.
The source code is available at https://github.com/buildngrowsv under the WebMCP-Control-Layer-Research repository.