Where Computer Use Agents Actually Break, According to Neel Somani

Apr 21, 2026

The demos look remarkable. An AI agent opens a browser, navigates a website, fills out a form, and books a flight, all without a human touching the keyboard. Over the past several months, a wave of computer-use products has arrived, promising exactly this kind of hands-free autonomy: Perplexity’s computer-use offering, OpenAI’s Operator, and a growing roster of competitors each positioning themselves as the agent layer that will finally make autonomous computing practical at scale. The pitch is consistent. Point the agent at a task. Let it work. Come back to a result. What the press releases don’t cover is what happens in between. Neel Somani, a researcher and former quantitative analyst who has built production-grade agent tools, including adcock-agent, a computer-use agent framework, and web2mcp, a system for converting web interfaces into callable tools that agents can use reliably, has spent considerable time on the less glamorous side of this problem. His assessment is blunt: the technology works often enough to be exciting and fails often enough to be dangerous, and the industry is moving fast without having resolved the foundational engineering questions that will determine whether these systems can actually be trusted. The Multi-Step Reliability Problem The first and most fundamental issue is reliability across sequential tasks. Individual steps in a computer use workflow, click here, type this, read that, work with reasonable consistency. The problem compounds as those steps chain together. A ten-step task where each step succeeds 90 percent of the time completes less than 35 percent of the time successfully. In practice, individual steps on real-world websites rarely hit even that mark. Pages change layout between visits. Content loads in unpredictable order. Pop-ups appear unexpectedly. A button that was present on the last run has moved, been replaced, or is temporarily disabled. Human users navigate these disruptions automatically, drawing on a general sense of what a page is trying to accomplish. Current computer use agents do not reliably adapt across these variations. They either fail silently, continuing forward with a subtly wrong state, or stop and surface an error that requires human intervention to resolve. Neither outcome scales. Somani’s work on web2mcp reflects one thoughtful response to this problem. Rather than having an agent interpret a live screen every time it needs to take an action, web2mcp first uses a computer use model to automatically scan a web app, identify everything that can be clicked or interacted with, and map out the full tree of possible actions. From that, it generates a clean, structured interface that a downstream agent can call directly. The result is that agents interacting with a web app through web2mcp work from a defined map rather than navigating by sight on each run. When something goes wrong, it goes wrong in a way that is visible and recoverable, rather than silently, mid-task, three steps before the finish line. Irreversible Actions and the Absent Undo Button Reliability failures are frustrating. Irreversibility failures are a different category of problem entirely. Computer use agents, by design, can take real actions in the world: submitting forms, sending emails, making purchases, modifying files, triggering processes in business software. Most of these actions cannot be undone. An email, once sent, cannot be recalled. A form submission that kicks off a process in a business system may set off a chain of effects that ripples through multiple systems before anyone notices something went wrong. Current agent frameworks do not have a mature answer to this problem. Some add confirmation prompts before high-stakes actions, but these are typically hand-coded rules rather than any principled understanding of which actions are reversible and which are not. An agent may confidently delete a file it has judged to be a duplicate without recognizing that the deletion triggers a cascade in a connected system. The confirmation prompt, if it exists at all, may have been designed without knowledge of that downstream effect. What is missing is a layer that can meaningfully distinguish between reading information, making a change that can be undone, and taking an action that cannot, and that enforces different levels of human approval accordingly. This is an engineering problem, not an intelligence problem. The ability to act must be paired with a governance framework that distinguishes which actions can proceed automatically, which require human sign-off, and which should be blocked until someone explicitly says otherwise. Most deployed systems today treat all of these identically. For organizations running agents against internal systems, this is not a theoretical concern. It is an operational liability with real legal and financial exposure. The Audit Trail Is Mostly Missing Ask a computer use agent what it did, in detail, and the answer is usually unsatisfying. Current agent tools generate logs of varying quality, but clear, searchable records that capture the full context for each action, what the agent saw, what it considered, what it chose, and why, are the exception rather than the rule. When an agent completes a task successfully, this rarely matters. When it does something unexpected or gets something wrong midway through a workflow, the absence of a clear record makes investigating what happened extremely difficult. This matters for any organization deploying agents in regulated industries, in customer-facing workflows, or anywhere that someone will eventually need to answer a straightforward question: what did the system do, and why? Somani has noted that the visibility gap mirrors a broader problem in how AI tools are built: systems are typically designed around the success case and poorly instrumented for the failure case. Building an agent infrastructure that can be investigated after the fact requires treating record-keeping as a core requirement from the beginning, not something to add once something breaks. Somani’s adcock-agent, a DOM-based browser agent he describes as deterministic and traceable, was built with this requirement at its core. Every cycle the agent runs is logged in full: what the page looked like at that moment, what the agent decided to do, how it did it, and how long it took. That record is written to structured JSON logs, meaning the complete history of any run can be queried and reviewed after the fact rather than reconstructed from memory. The more novel part of adcock-agent is the development of a domain-specific language (DSL) for browser-based actions. When the agent outputs an action in the DSL, a compiler attempts to convert that action into a combination of browser interactions and code generation. If the DSL output is malformed, a clear error is produced. This paradigm, which outputs clear and correctable errors, has been popular in recent agent designs. Permission Models That Don’t Exist Yet Perhaps the most overlooked gap in current computer use tools is the absence of meaningful controls over what an agent is actually allowed to do. Traditional software access controls are built around straightforward rules: a given user can perform a given action on a given resource, or they cannot. Computer use agents do not fit this model cleanly. An agent acts on behalf of a user, with access to that user’s accounts and credentials, often across a much wider range of systems than any specific task actually requires. The basic principle of giving software only the access it needs for a specific job is difficult to enforce when the agent’s scope is defined loosely and changes from task to task. In practice, most agents today inherit broad access by default and operate without meaningful limits. An agent authorized to help manage email has access to the entire inbox. An agent authorized to update a record in a business system can often do considerably more than that. The gap between what the agent is supposed to do and what it is technically capable of doing is wide, and there is currently no established framework for closing it at the level of an individual task. Addressing this requires connecting agent tools to the access control systems that organizations already use, in ways that current platforms have not prioritized. It requires defining what an agent is allowed to do for a specific task rather than for an entire session, keeping credentials separate across concurrent agent workflows, and building the ability to cut off an agent’s access mid-task if something goes wrong. None of this is technically beyond reach. All of it requires deliberate investment that the current wave of product launches has largely deferred. What Builders and Organizations Should Be Thinking About Now Somani’s position is not that computer use agents are not useful. They are already useful for a real class of tasks, and the technology will improve. His position is that the gap between what is being promised and what the underlying infrastructure currently supports is wide enough to create genuine operational risk for organizations that deploy these systems without understanding the constraints. For builders, the practical conclusion is to treat agent reliability as a systems problem, not a model problem. Better AI will not, on its own, solve the record-keeping problem, the permission problem, or the irreversibility problem. These require deliberate architectural decisions made early, not improvements to the underlying model. For organizational leaders evaluating where to deploy agents, the right questions are not about how well the system performs on a demo. They are about what happens when the agent fails, how quickly that failure gets detected, how much damage it can do before anyone notices, and whether the organization has the visibility to answer those questions after the fact. Somani’s broader argument, developed through direct experience building agent infrastructure rather than observing it from the outside, is that the organizations that navigate this moment well will be the ones that invest in the operational foundations now rather than waiting for the market to provide them. Autonomy without accountability is not a feature. It is a liability. The demos will keep getting better. The hard work is just beginning. The post Where Computer Use Agents Actually Break, According to Neel Somani appeared first on LA Weekly. ...read more read less

https://www.laweekly.com/where-computer-use-agents-actually-break-according-to-neel-somani/?utm_source=rss&utm_medium=rss&utm_campaign=where-computer-use-agents-actually-break-according-to-neel-somani

Respond, make new discussions, see other discussions and customize your news... Log in.

Respond, make new discussions, see other discussions and customize your news...
Log in.