The Gate Test: Why Human-in-the-Loop Fails and How to Fix It

Why Human Oversight in AI Systems Fails

The Tesla’s forward-collision warning had been active for 2.3 seconds when the Model S struck the fire truck at 65 miles per hour. The driver’s hands were off the wheel. His eyes, the cabin camera would later show, were aimed at his phone. Autopilot had been engaged for six minutes. The system detected the obstacle, warned, and waited for the human to take over, but he never did.

Across several years of investigation, the National Highway Traffic Safety Administration documented more than 200 crashes following this pattern, with dozens of fatalities. In each case, the car’s own telemetry told the same story: adequate warning, adequate time, absent driver. The gate was everywhere on paper (continuous human attention required) and nowhere in practice.

The Gate Test: effective human oversight requires both enforcement at the irreversible step and visibility into the system's reasoning before costs compound. Neither property alone is sufficient.

And the pattern extends well beyond automotive. Klarna reversed its 700-FTE customer service automation after fifteen months of apparent success. Air Canada got sued when its chatbot hallucinated a refund policy. McDonald’s ended its IBM voice-ordering pilot after accuracy rates stalled in the low 80s, well below the 95% threshold franchisees needed. Each failure had its own technical explanation, and each shared the same underlying architecture: systems that requested human attention without ensuring it, or that acted on customers without anyone seeing what they said until complaints arrived.

The Design Decision That Matters

Debates about AI agents tend toward a binary: autonomous systems versus human oversight. The deployments that work ask a different question: at which step does human judgment actually change outcomes, and what would it take for a human at that step to actually exercise that judgment?

Two deployments from 2024-2025 show what this looks like in practice. In July 2025, Allianz Australia launched Project Nemo for processing food-spoilage claims, cutting resolution time from four days to hours. In February 2024, Klarna announced that its OpenAI-built customer service agent had handled the workload of 700 full-time employees in its first month, dropping resolution time from eleven minutes to two.

Allianz automated five stages: document intake, policy matching, fraud detection, and summarization all run without human involvement. A human claims professional enters only at the payout decision. The system architecture enforces this gate; the payout cannot proceed without explicit human approval, and there is no checkbox to click through. Each agent’s reasoning is logged and visible to the approver before release. If an agent flags incorrect coverage or misreads a receipt, the claim is held before money moves, not clawed back after. Leadership sees what the system decided before it costs money, before a complaint arrives, before a pattern of errors becomes visible only in churn data three quarters later.

The human at the payout step is not reviewing every document or re-running every fraud check. The human is making one decision (pay or escalate) with full visibility into how the system reached its recommendation. The cognitive load is sustainable because the scope is narrow: judgment on the irreversible action, not surveillance of the entire pipeline.

Klarna took the opposite approach, automating everything, including the judgment calls on hard cases. The company measured throughput, cost, and resolution time. It did not measure resolution quality on the interactions where customer relationships are won or lost: complaints that required empathy, billing disputes that required context the AI lacked, frustrated customers who needed a human voice. For fifteen months dashboards showed green, but customer churn data, which lagged by quarters, eventually revealed the cost.

It turned out the layoffs had removed not just labor cost but institutional knowledge about which situations required human judgment. The 700 FTEs Klarna displaced were not just processing transactions; they were catching edge cases, pattern-matching on ambiguous situations, and maintaining a quality floor that the metrics did not directly capture. In May 2025, CEO Sebastian Siemiatkowski told Bloomberg they were rehiring: “As cost unfortunately seems to have been a too predominant evaluation factor, what you end up having is lower quality.”

Allianz placed a human at the irreversible step with enforcement and visibility. Klarna removed humans entirely and measured the wrong signals.

Allianz automated everything except the payout decision, where a human reviews agent reasoning before money moves. Klarna automated the entire pipeline including judgment calls, measuring only throughput and cost.

Why Supervisors Fail

Human approval at machine speed tends toward rubber-stamping. Human-factors researchers have documented this dynamic for decades, beginning with Norman Mackworth’s 1948 radar vigilance studies at Cambridge. Mackworth found that operators monitoring radar screens for rare signals showed significant attention decay within thirty minutes. The finding has replicated across every domain where humans monitor automated systems: when the system almost never fails, the human monitoring it gradually stops paying the kind of attention that would catch a failure.

Mackworth's vigilance curve, generalized. Operators monitoring reliable automated systems show sharp attention decay within the first 30 minutes. The pattern has replicated across radar, aviation, process control, and autonomous vehicle supervision.

The mechanism behind this is straightforward. When a system succeeds repeatedly, the human monitoring it builds an expectation of continued success. Each successful cycle reinforces the expectation and reduces the cognitive resources allocated to monitoring. The operator’s understanding of what the system is actually doing drifts further from reality. When the system finally fails, the operator lacks current context, must re-acquire situational awareness under time pressure, and intervenes (if at all) with degraded judgment. Mica Endsley’s research on situation awareness established this pattern across aviation, process control, and medical monitoring: operators asked to supervise reliable automation become least prepared to intervene at the moment intervention matters.

Mary Cummings’s research on autonomous vehicle supervision at Duke showed that the problem also scales. A single operator monitoring multiple autonomous units hits performance degradation as load increases, with error rates climbing significantly above threshold. Her work established that automating subtasks does not free cognitive capacity for higher-level oversight. It shifts load from doing to watching, and watching near-perfection requires continuous expenditure of attention against a baseline of expected success. This is expensive cognitive work that humans struggle to sustain.

A night-shift air traffic controller with three aircraft on scope at 2 AM, every handoff proceeding normally, faces this problem acutely. The situation requires sustained vigilance, but the environment provides no stimulation to maintain it. Minute by minute, the controller loses track of what’s actually on scope, even as every handoff proceeds normally. The same dynamic appears in any monitoring context: security operations centers, industrial process control, content moderation queues, and supervisory dashboards for AI agents.

This is what played out in the Tesla crash, where the driver looking at his phone was responding to a system that had succeeded for six minutes, and six minutes before that, and six minutes before that. The design asked for vigilant monitoring of near-perfection at the highest possible volume. Vigilance researchers have been documenting this pattern since the 1940s.

Two Architectures, Same Failure

Tesla and Air Canada both deployed systems that acted without enforcement at the gate or visibility into what they were doing. In Tesla’s case the consequences were fatal. In Air Canada’s, the damage was an $812 small-claims judgment.

MIT AgeLab researchers Bryan Reimer and Bruce Mehler confirmed this when they measured driver behavior with and without Autopilot engaged. Driver glances away from the road increased 18% the moment Autopilot activated, with hands-free driving rising 32%. The disengagement from attention was immediate, and in fatal crashes NTSB investigations documented cases where drivers showed no steering input for the final seven seconds before impact. Tesla’s system required continuous driver attention as a matter of policy, but the architecture included no mechanism to ensure that attention was present. A steering wheel torque sensor provided weak feedback. Eye-tracking with graduated lockout would have provided strong enforcement, but Tesla went with the torque sensor.

Air Canada’s chatbot operated at a different layer with similar results. In 2022, Jake Moffatt of British Columbia asked the airline’s website chatbot about bereavement fares after his grandmother died. The chatbot told him he could book at full price and apply for the bereavement discount retroactively within ninety days. The actual policy prohibited retroactive claims. Nobody at Air Canada caught the error.

When Moffatt sued, Air Canada argued the chatbot was “a separate legal entity responsible for its own actions.” The British Columbia Civil Resolution Tribunal rejected this defense in its February 2024 decision. Tribunal member Christopher Rivers wrote: “It should be obvious to Air Canada that it is responsible for all the information on its website.”

The damages were $812 Canadian, but the case established that companies bear responsibility for their AI agents’ statements. Air Canada learned what its chatbot had been telling customers only when the lawsuit arrived. There was no audit chain reaching the customer-facing surface, no sampling of chatbot responses against policy, no visibility into the gap between what the system said and what the company’s policies allowed.

Tesla lacked enforcement at the gate, Air Canada lacked visibility into the action surface, and Allianz has both.

The Question That Remains

The productive question is where humans should be placed, doing what, and under what conditions they can actually exercise judgment rather than perform the appearance of oversight.

The Gate Test is built around that question, and it has two components. First, enforcement at the step: a mechanism that ensures attention, not a policy that requests it. Steering wheel torque sensors request attention; eye-tracking interlocks enforce it. Checkbox approvals request review; system architecture that blocks action until review is complete enforces it. Second, visibility before the cost: an audit chain that reaches the action surface before errors compound. If leadership first reads what the agent said in a lawsuit attachment, the visibility property is already missing.

Place the gate on the irreversible step with both properties, at a load humans can sustain, and the workflow has a structural basis for human oversight. Remove either property and human presence becomes ceremonial.

This is why human-in-the-loop is a design choice. Adding a human adds nothing if they can’t understand what the system did, can’t step in before the damage compounds, or can’t keep up with the volume. The relevant question is not “should there be a human?” but “can this human, at this step, with this information, actually change outcomes?”

When the answer is no, the human is not performing oversight. The human is absorbing liability, present to take blame when the system fails rather than to prevent the failure. The EU AI Act’s Article 14 now requires that high-risk AI systems be designed so humans can “effectively oversee” them, with specific provisions against automation bias. Regulators have recognized what the deployment failures demonstrated: the label “human-in-the-loop” means nothing without the architecture to make human judgment effective.

That’s why while building Jitera we designed context that works like memory, preserving what happened and why across sessions, and team-in-the-loop architecture that distributes oversight across roles instead of concentrating it in a single supervisor.

Where you place the human determines whether they act or watch.