Anthropic's Frontier Red Team had Opus 4.7 rerun last year's robot-dog experiment with no human help. On every task a human team finished a year ago, it ran about 20x faster than the fastest human team — yet it still couldn't manage the real-time control of nudging the ball back home. Together, those two facts mark where the capability frontier sits for LLMs driving physical hardware.
There are two layers of "phase" under the name "Project Fetch," and they are easy to conflate. Separate them before reading on.
The first layer is the project's two rounds of research. Round one ran in August 2025 and was published in November — a human-vs-human controlled experiment. Round two, published June 2026, is titled Project Fetch: Phase two, and has the model run fully autonomously.
The second layer is the three escalating stages inside the round-one experiment, which the original post also calls Phase One / Two / Three: warm up with the controller, then put it down and write your own program to reach the sensors, then have the robodog fetch the ball autonomously.
This article is about the project's second round. Wherever it refers to those three internal stages, it says so explicitly, and never mixes them with the project-level "round two."
Round one asked whether frontier models could reach past the screen and act on the physical world, with robots as one path. The team used an uplift study: split people at random into a group with AI and a group without, and measure the gap in task performance — that gap is the "uplift." It's the same method they've used extensively in their work on biological risk.
They recruited 8 Anthropic researchers with no robotics background, split four-and-four into Team Claude and Team Claude-less, and had each team program an off-the-shelf quadruped (the post calls it a robodog) to fetch a beach ball. The finding was substantial uplift: on tasks both teams completed, Team Claude took about half the time of the other; Team Claude completed 7 of 8 tasks to Team Claude-less's 6 of 8; and only Team Claude made real progress toward the final goal of fully autonomous retrieval.
The team also ran a prior check: put the strongest model of the day, Claude Opus 4.1, in on its own and see whether it could do the tasks unaided. It plainly could not — like the team without Claude, it got hung up on the very first step of figuring out how to connect to the robot. That "4.1 can't do it alone" is the baseline for reading what round two means.
Round two swapped in Claude Opus 4.7, removed the human operator, and asked whether the model could finish round one's tasks on its own.
The physical-controller step couldn't be given to the model, so the test covered the remaining subset of tasks that can be done by writing code. The model ran in Claude Code with adaptive thinking and effort set to maximum, three trials per objective. The researcher's role was pared to the minimum: plug a laptop running Claude Code into the robodog, enter the initial prompt, approve commands, and approve the model to move to the next task.
The headline finding, in the team's own words: "on every task that was completed by at least one human team in August, Opus 4.7 completed the same task at least ten times faster."
Task by task. Putting both human teams and Opus 4.7 side by side on each task — these five map onto Phase 2 (programmatic control) and Phase 3 (autonomous operation) inside the round-one experiment:
| Task | Claude-less | Team Claude | Opus 4.7 |
|---|---|---|---|
| Connect to video cameraPhase 2 | 165 min* | 64 min | 5:57 |
| Connect to lidar sensorPhase 2 | 154 min | 35 min | 0:56 |
| Write control programPhase 2 | 15 min | 40 min | 1:07 |
| Localize & plot pathPhase 3 | 27 min | 42 min | 1:34 |
| Detect beach ballPhase 3 | Did not finish | 83 min | 2:32 |
| All five tasks | — | 264 min | 12:07 |
Opus times are the mean of three trials (min:sec). Deltas are measured against the faster human team on each task, counting only tasks attempted and completed in both rounds. *Team Claude-less was given a hint to finish the video task. One detail worth noting: on writing the control program, Team Claude-less (15 min) was actually faster than Team Claude (40 min). Humans were genuinely quicker on some sub-tasks — the team with Claude sometimes took a detour, trying more approaches in parallel and writing more code.
Draw the magnitude. Looking only at the four tasks both teams completed (video, lidar, control program, localization), the total-time gap is an order of magnitude:
Faster and leaner. Code volume is just as lopsided: Opus 4.7 was as successful as both human teams, or more, while writing nearly ten times less code than Team Claude.
There were quality differences too. Where humans struggled to choose between ways of interfacing with the robodog's sensors, Opus 4.7 quickly identified the best path; much of its code worked on the first try, which neither human team managed in round one. It was not flawless: it defaulted to an outdated object-detection algorithm, but still worked around it and arrived at an effective solution.
The team stresses, as in round one, that this progress is not the result of any concerted effort to improve the models' robotics ability. It emerged from much more general scaling — like so many other capabilities in LLM history — with no one training it specifically to drive a robodog.
What the model couldn't do is exactly the "fetch" in Project Fetch: precisely nudging the beach ball back to the start.
The team is specific about it. With their hands and a little practice, humans could pilot the robodog to gently knock the ball back to the patch of fake grass it started on. That takes a fast closed loop: sense whether the ball has drifted off course, relate that error to the last command, locate where the ball is now, and adjust the next input to move it more precisely.
a kind of closed loop at which people excel (at least after making some mistakes and learning from them) Opus 4.7 struggled to capture that subtlety. Like the round-one humans who reached this stage, it could move the robot behind the ball and position it to knock the ball home — but the control was poor, and, like those participants, it did not succeed.Anthropic Frontier Red Team · Project Fetch: Phase two
There's a telling counterpoint: one round-one volunteer with more robotics experience did successfully write a program for autonomous fetching. On that basis, the team judges it very likely that the current generation of Claude could do the same with more time and additional scaffolding. What they'll be watching next is whether the model can finish this last step with the same speed and reliability it showed on everything else.
"With more time and additional scaffolding, we think it is very likely that current generations of Claude could do the same."
Put the two rounds together and a clear trajectory shows up.
The team planted this judgment back in the round-one post: "in AI, uplift often precedes autonomy. What models can help humans accomplish today, they can frequently do alone tomorrow." Round two cashes that in. Their own analogy is coding: programmers long ago stopped handing AI snippets to debug and instead hand it tasks and let the model write the code.
We are plausibly entering the early era of physical agentic AI. The team is careful to add: "This doesn't mean that LLMs have now solved robotics. Far from it." None of the tasks here touch the harder, low-level parts of robotic control.Anthropic Frontier Red Team · Project Fetch: Phase two
What the team thinks has changed: we seem much closer to a world where models can use off-the-shelf physical tools with relative ease, at least for limited purposes. They liken it to how AI came to use existing software-editing tools like string-replace on its way to more agentic coding.
The pattern they keep pointing to is a three-step arc: first models help humans, then humans help models, and finally models can largely do things themselves. In the original: "first, models are helpful to humans. Then, humans are helpful to models. Finally, models are largely able to do things themselves." They've seen it in cybersecurity, and now it's taking shape at the intersection of AI and the physical world. The thread also ties into their work monitoring the potential for AI to automate AI R&D — a capability threshold in Anthropic's Responsible Scaling Policy.
The key is to see that the tasks the model did well and the one it stalled on are not the same kind of task. The former — connecting sensors, choosing an interface, writing the ball-detection program, planning a path — are "think-it-through-once-and-produce-it" cognitive tasks. You can reason them out offline, write the code, and run it once to check. That's squarely LLM territory. "Nudging the ball precisely back to the start" is a different kind: continuous closed-loop control. It can't be done in one shot; it demands looking and micro-adjusting over and over against physical feedback. There are at least three layers of difficulty.
Humans can do it thanks to three things the model currently lacks: an internalized physical intuition and proprioception (the mapping from "how much force" to "how it moves" is wired into the body, no computation needed); a continuous, low-latency feedback loop (eyes on the ball, hands adjusting in real time, matched to the pace of the roll); and the ability to learn online over a few attempts (miss twice, get the feel on the third try — whereas a model has no such learn-as-you-go mechanism within a single task).
What determines an AI tool's usefulness is domain expertise, not coding proficiency. The person in round one who could write the autonomous-fetch program was the one with more robotics experience. What the fetch step lacks is robotics expertise, not coding ability — which is also why the team judges that "a bit more scaffolding" would very likely close the gap.
Anthropic Frontier Red Team · 2026-06-18 · Michael Ilie, C. Daniel Freeman, Kevin K. Troy. The speed, code-volume, and strength/weakness data here come from this post and its three charts. It reports Opus 4.7 rather than the stronger Mythos Preview because 4.7 was the most advanced non-Mythos-class model available at the time of the experiment.
Anthropic Frontier Red Team · 2025-11-12 (experiment run August 2025). The round-one uplift study, the 7/8 vs 6/8 result, the Opus 4.1 baseline, and the "uplift precedes autonomy" judgment all come from this post. Section 05, "why real-time control is harder than writing code," is analysis grounded in control theory rather than a conclusion from the source, as noted inline. On scope: the experiment had only two teams over a single day and was a convenience sample — small in scale.