Anthropic Frontier Red Team · Robotics · 2026-06

When Claude Drives the Robot Dog Itself

Anthropic's Frontier Red Team had Opus 4.7 rerun last year's robot-dog experiment with no human help. On every task a human team finished a year ago, it ran about 20x faster than the fastest human team — yet it still couldn't manage the real-time control of nudging the ball back home. Together, those two facts mark where the capability frontier sits for LLMs driving physical hardware.

Headline

~20×

Opus 4.7, fully autonomous, beat the fastest human team (Team Claude) by this much on the four shared tasks

Full run

12:07

Mean time for the model to finish all five tasks alone; a year ago one human team needed 264 minutes

01 · Setup

First, untangle a naming trap

There are two layers of "phase" under the name "Project Fetch," and they are easy to conflate. Separate them before reading on.

The first layer is the project's two rounds of research. Round one ran in August 2025 and was published in November — a human-vs-human controlled experiment. Round two, published June 2026, is titled Project Fetch: Phase two, and has the model run fully autonomously.

The second layer is the three escalating stages inside the round-one experiment, which the original post also calls Phase One / Two / Three: warm up with the controller, then put it down and write your own program to reach the sensors, then have the robodog fetch the ball autonomously.

This article is about the project's second round. Wherever it refers to those three internal stages, it says so explicitly, and never mixes them with the project-level "round two."

Background

What round one measured

Round one asked whether frontier models could reach past the screen and act on the physical world, with robots as one path. The team used an uplift study: split people at random into a group with AI and a group without, and measure the gap in task performance — that gap is the "uplift." It's the same method they've used extensively in their work on biological risk.

They recruited 8 Anthropic researchers with no robotics background, split four-and-four into Team Claude and Team Claude-less, and had each team program an off-the-shelf quadruped (the post calls it a robodog) to fetch a beach ball. The finding was substantial uplift: on tasks both teams completed, Team Claude took about half the time of the other; Team Claude completed 7 of 8 tasks to Team Claude-less's 6 of 8; and only Team Claude made real progress toward the final goal of fully autonomous retrieval.

Baseline

The team also ran a prior check: put the strongest model of the day, Claude Opus 4.1, in on its own and see whether it could do the tasks unaided. It plainly could not — like the team without Claude, it got hung up on the very first step of figuring out how to connect to the robot. That "4.1 can't do it alone" is the baseline for reading what round two means.

02 · Speed

Take the humans out, let the model run

Round two swapped in Claude Opus 4.7, removed the human operator, and asked whether the model could finish round one's tasks on its own.

The physical-controller step couldn't be given to the model, so the test covered the remaining subset of tasks that can be done by writing code. The model ran in Claude Code with adaptive thinking and effort set to maximum, three trials per objective. The researcher's role was pared to the minimum: plug a laptop running Claude Code into the robodog, enter the initial prompt, approve commands, and approve the model to move to the next task.

The headline finding, in the team's own words: "on every task that was completed by at least one human team in August, Opus 4.7 completed the same task at least ten times faster."

Task by task. Putting both human teams and Opus 4.7 side by side on each task — these five map onto Phase 2 (programmatic control) and Phase 3 (autonomous operation) inside the round-one experiment:

Task	Claude-less	Team Claude	Opus 4.7
Connect to video cameraPhase 2	165 min*	64 min	5:57
Connect to lidar sensorPhase 2	154 min	35 min	0:56
Write control programPhase 2	15 min	40 min	1:07
Localize & plot pathPhase 3	27 min	42 min	1:34
Detect beach ballPhase 3	Did not finish	83 min	2:32
All five tasks	—	264 min	12:07

Opus times are the mean of three trials (min:sec). Deltas are measured against the faster human team on each task, counting only tasks attempted and completed in both rounds. *Team Claude-less was given a hint to finish the video task. One detail worth noting: on writing the control program, Team Claude-less (15 min) was actually faster than Team Claude (40 min). Humans were genuinely quicker on some sub-tasks — the team with Claude sometimes took a detour, trying more approaches in parallel and writing more code.

Draw the magnitude. Looking only at the four tasks both teams completed (video, lidar, control program, localization), the total-time gap is an order of magnitude:

Team Claude-less

361 min

Team Claude (fastest humans)

181 min

Opus 4.7 alone

9:35

Total time on the four shared tasks. Opus 4.7 was 37.7× faster than Team Claude-less and 18.9× faster than Team Claude — the bar is barely visible, which is exactly what "an order of magnitude faster" looks like.

Faster and leaner. Code volume is just as lopsided: Opus 4.7 was as successful as both human teams, or more, while writing nearly ten times less code than Team Claude.

Team Claude

10,309

Team Claude-less

1,136

Opus 4.7 alone

1,045

Lines of code. The Opus 4.7 figure is from the one trial with code-volume data.

There were quality differences too. Where humans struggled to choose between ways of interfacing with the robodog's sensors, Opus 4.7 quickly identified the best path; much of its code worked on the first try, which neither human team managed in round one. It was not flawless: it defaulted to an outdated object-detection algorithm, but still worked around it and arrived at an effective solution.

The team stresses, as in round one, that this progress is not the result of any concerted effort to improve the models' robotics ability. It emerged from much more general scaling — like so many other capabilities in LLM history — with no one training it specifically to drive a robodog.

03 · Limit

Stuck on the "fetch" itself

What the model couldn't do is exactly the "fetch" in Project Fetch: precisely nudging the beach ball back to the start.

The team is specific about it. With their hands and a little practice, humans could pilot the robodog to gently knock the ball back to the patch of fake grass it started on. That takes a fast closed loop: sense whether the ball has drifted off course, relate that error to the last command, locate where the ball is now, and adjust the next input to move it more precisely.

“ ”

a kind of closed loop at which people excel (at least after making some mistakes and learning from them) Opus 4.7 struggled to capture that subtlety. Like the round-one humans who reached this stage, it could move the robot behind the ball and position it to knock the ball home — but the control was poor, and, like those participants, it did not succeed.

Anthropic Frontier Red Team · Project Fetch: Phase two

There's a telling counterpoint: one round-one volunteer with more robotics experience did successfully write a program for autonomous fetching. On that basis, the team judges it very likely that the current generation of Claude could do the same with more time and additional scaffolding. What they'll be watching next is whether the model can finish this last step with the same speed and reliability it showed on everything else.

From the post

"With more time and additional scaffolding, we think it is very likely that current generations of Claude could do the same."

04 · Arc

Uplift first, autonomy after

Put the two rounds together and a clear trajectory shows up.

2025 · Round one

Opus 4.1

On its own, the model couldn't even connect to the robot; its value was in helping people. The team with Claude was about twice as fast as the team without.

2026 · Round two

Opus 4.7

Less than a year later, the model finishes that batch of tasks unaided, and about 20× faster than the fastest human team.

The team planted this judgment back in the round-one post: "in AI, uplift often precedes autonomy. What models can help humans accomplish today, they can frequently do alone tomorrow." Round two cashes that in. Their own analogy is coding: programmers long ago stopped handing AI snippets to debug and instead hand it tasks and let the model write the code.

“ ”

We are plausibly entering the early era of physical agentic AI. The team is careful to add: "This doesn't mean that LLMs have now solved robotics. Far from it." None of the tasks here touch the harder, low-level parts of robotic control.

Anthropic Frontier Red Team · Project Fetch: Phase two

What the team thinks has changed: we seem much closer to a world where models can use off-the-shelf physical tools with relative ease, at least for limited purposes. They liken it to how AI came to use existing software-editing tools like string-replace on its way to more agentic coding.

The pattern they keep pointing to is a three-step arc: first models help humans, then humans help models, and finally models can largely do things themselves. In the original: "first, models are helpful to humans. Then, humans are helpful to models. Finally, models are largely able to do things themselves." They've seen it in cybersecurity, and now it's taking shape at the intersection of AI and the physical world. The thread also ties into their work monitoring the potential for AI to automate AI R&D — a capability threshold in Anthropic's Responsible Scaling Policy.

05 · Mechanics

Why real-time control is harder than writing code

This section is analysis, not the source's conclusion. What follows draws on common knowledge from control theory and robotics to explain why the model excelled at the earlier tasks but stalled on nudging the ball. Anthropic's post only goes as far as "the fetch is a closed loop humans excel at, and Claude controlled it poorly"; it doesn't unpack the mechanics below.

The key is to see that the tasks the model did well and the one it stalled on are not the same kind of task. The former — connecting sensors, choosing an interface, writing the ball-detection program, planning a path — are "think-it-through-once-and-produce-it" cognitive tasks. You can reason them out offline, write the code, and run it once to check. That's squarely LLM territory. "Nudging the ball precisely back to the start" is a different kind: continuous closed-loop control. It can't be done in one shot; it demands looking and micro-adjusting over and over against physical feedback. There are at least three layers of difficulty.

The timescales don't match

Physical control loops typically run at tens to hundreds of hertz — dozens to hundreds of corrections per second — while one LLM inference takes seconds. Using a loop that needs several seconds to think in order to correct a ball rolling across the floor in real time simply can't keep pace. It's the opposite of the relaxed constraint the model enjoys when writing code, where taking time to deliberate is fine.

Contact dynamics are nonlinear

A round ball, ground friction, and the contact point where the robodog's leg or body meets the ball are all strongly nonlinear. A light touch might leave the ball motionless or send it shooting off in the wrong direction. This "contact-plus-rolling" physics is notoriously hard to model precisely — even dedicated robotic-control research struggles with it. The model lacks a reliable world model to predict "if I move like this, the ball rolls like that."

It's writing a controller, not piloting by hand

The post notes the model couldn't use a physical controller. So "push the ball back" amounts to writing a controller that adapts and corrects online — precisely the low-level actuation policy the post says this round did not touch. The fetch step lands right on the edge of its capability envelope: it can call existing tools, but it isn't yet good at writing a strong real-time control policy from scratch.

Humans can do it thanks to three things the model currently lacks: an internalized physical intuition and proprioception (the mapping from "how much force" to "how it moves" is wired into the body, no computation needed); a continuous, low-latency feedback loop (eyes on the ball, hands adjusting in real time, matched to the pace of the roll); and the ability to learn online over a few attempts (miss twice, get the feel on the third try — whereas a model has no such learn-as-you-go mechanism within a single task).

Lines up with another study

What determines an AI tool's usefulness is domain expertise, not coding proficiency. The person in round one who could write the autonomous-fetch program was the one with more robotics experience. What the fetch step lacks is robotics expertise, not coding ability — which is also why the team judges that "a bit more scaffolding" would very likely close the gap.

Sources

Official first-party, faithfully rendered

OfficialProject Fetch: Phase two (round two)

Anthropic Frontier Red Team · 2026-06-18 · Michael Ilie, C. Daniel Freeman, Kevin K. Troy. The speed, code-volume, and strength/weakness data here come from this post and its three charts. It reports Opus 4.7 rather than the stronger Mythos Preview because 4.7 was the most advanced non-Mythos-class model available at the time of the experiment.

OfficialProject Fetch: Can Claude train a robot dog? (round one)

Anthropic Frontier Red Team · 2025-11-12 (experiment run August 2025). The round-one uplift study, the 7/8 vs 6/8 result, the Opus 4.1 baseline, and the "uplift precedes autonomy" judgment all come from this post. Section 05, "why real-time control is harder than writing code," is analysis grounded in control theory rather than a conclusion from the source, as noted inline. On scope: the experiment had only two teams over a single day and was a convenience sample — small in scale.