Agent-Friendly CLI 4 authoritative sources

Stop wrapping hundreds of API endpoints
one-to-one. Ship
tools agents will actually use.

Agents don't need a GUI, and they don't need a 1:1 wrapper around your REST API. They need workflow-shaped, self-describing, hallucination-hardened CLI commands. Your hundreds of endpoints should consolidate into 30–50 agent-facing commands—that's the right merge ratio.

Sources Anthropic · Justin Poehnelt (Google) · AXI benchmarks · CWC 2026 #14 Evidence 425 runs / 17 tasks / Sonnet 4.6
~/internal-cli
✗ 1:1 wrap — agent reassembles workflow itself
internal-cli users list → find Jane's id
internal-cli users get --id 73a9f...
internal-cli events query --user 73a9f...
internal-cli events create --user 73a9f... --time ...
# 4 steps, 4 round-trips, every ID a hallucination risk
▼ Merged into a workflow command
✓ Workflow-merged — single call, single intent
internal-cli schedule-meeting \
  --with "Jane" \
  --topic "Q1 review" \
  --duration 30m
# Internally: lookup → free/busy → create → invite

CLIs aren't "just another interface"—they're a token-efficiency pattern

Anthropic's Claude Code documentation states it directly: CLI tools are the most context-efficient way for agents to interact with external services. This isn't taste—it's a constraint-driven design conclusion. Schema definitions consume context, and every extra round-trip taxes the agent. The AXI project's 425-run benchmark provides the hard numbers.

"CLI tools are the most context-efficient way to interact with external services." — code.claude.com/docs/en/best-practices

Claude Code itself ships only four primitive tools (bash / read / write / edit) and lets the agent transact with the world through the shell. That's not minimalism for its own sake—bash is the most universal, most composable interface humans have built. The Agent SDK announcement frames this principle plainly:

"The key design principle behind the Claude Agent SDK is to give your agents a computer, allowing them to work like humans do." — Anthropic, "Building agents with the Claude Agent SDK" (2025-09-29)

AXI Benchmark: five conditions in a head-to-head race

The AXI project (github.com/kunchenguid/axi) ran 425 benchmark trials—17 tasks × 5 conditions × 5 repeats—against Claude Sonnet 4.6, using GitHub operations as the representative workflow.

Success Rate (%)

425 runs · Claude Sonnet 4.6 · Mar 2026
gh-axi WINNER
100%
100%
GitHub MCP
87%
87%
gh (native CLI)
86%
86%
MCP + Code Mode
84%
84%
MCP + ToolSearch
82%
82%
Avg cost per task
$0.050 $0.148
66% cheaper vs MCP
Avg duration
15.7s 34.2s
54% faster
Avg turns
3 6
half the round-trips
Bottom line: an agent-optimized CLI dominates MCP on success rate, cost, and latency simultaneously. The reasons are mechanical—each MCP tool definition costs ~200–500 tokens, the discovery/call dance adds round-trips, and the shell lets agents pipe operations together in a single bash invocation.

CLI or MCP? Six factors to decide

The matrix below comes from a DEV Community deep-dive on AI-agent CLIs—useful as an engineering checklist. But the more important judgment comes from Justin Poehnelt: this isn't a binary choice. It's two surfaces over one core.

Choose CLI
Choose MCP
Number of operations
< 15 commands
50+ tools and the agent has no shell
State across calls
Stateless
Stateful sessions required
Shell access
Yes
No (API-only environment)
Token budget
Tight (CLI is leaner)
Generous
Existing tooling
An existing CLI to wrap or refactor
Greenfield server build
Reliability requirement
High (no extra process to fail)
Server dependency acceptable

The pragmatic answer: Justin Poehnelt's pattern is one binary, two surfaces—CLI for execution (where the cost/latency wins live) and MCP for discovery (when the agent can't possibly hold every tool's schema in context at once). Both surfaces are generated from a single schema source so you don't pay double maintenance.

Ten principles, ordered by ROI

The principles below are synthesized from three authoritative streams: Anthropic's official methodology, Justin Poehnelt's lessons from Google Workspace's CLI (where he is a DevRel engineer), and AXI's benchmark evidence. Each principle is tagged with its source. The first three are the minimum bar—doing only Phase 1 already makes existing agents usable on your CLI.

Source legend
A Anthropic official (highest)
G Justin Poehnelt / Google
AXI AXI benchmark evidence
CWC Code with Claude 2026 #14
#01
Merge by workflow, don't wrap one-to-one
A
Each CLI command should map to a step the agent would naturally take, not to an underlying API endpoint.

Anthropic calls this out by name in Writing effective tools for agents:

"A common error we've observed is tools that merely wrap existing software functionality or API endpoints—whether or not the tools are appropriate for agents."
internal-cli users list internal-cli events list internal-cli events create \ --user-id X --title Y --time Z
internal-cli schedule-meeting \ --with "Jane" \ --topic "Q1 review" \ --duration 30m # Internal: user lookup → free/busy # → create event → send invite

How to apply: list the top 20 high-frequency tasks your agent has to solve, then design a single-shot command for each. Hundreds of APIs collapsing into 30–50 agent-facing commands is a healthy merge ratio.

#02
Structured output is the floor
GAXI
Every command must support --output json (or --json). No exceptions.
# Human mode (default) internal-cli orders recent # → pretty table # Agent mode internal-cli orders recent --output json # → pure JSON on stdout, logs go to stderr
CapabilityWhat it doesWhy it matters
--fieldsField masks limit returned columnsProtects the context window
--streamNDJSON, one object per lineStreamable—no memory blow-ups
auto-detect TTYSwitch to JSON when stdout isn't a TTYAgents don't need to remember the flag
TOON formatA token-efficient alternativeAXI measured ~40% token savings

AXI's additions: default to 3–4 core fields per record (e.g. id, name, status), not the full 10+ field response. Auto-truncate large text fields and offer --full as an escape hatch. Include pre-computed aggregates (count: 47 of 1203) so the agent doesn't have to count.

#03
Runtime schema introspection beats static docs
G
The CLI should describe itself—a schema subcommand returns machine-readable signatures.

The problem is simple: agents can't Google your docs. Doc lookups burn context tokens and the docs themselves go stale.

internal-cli schema orders.create # Returns: param types, required/optional, enums, example, scopes internal-cli schema --all # Schema index for tool-discovery flows
"Each gws schema call dumps the full method signature—params, request body, response types, required OAuth scopes—as machine-readable JSON. The agent self-serves without pre-stuffed documentation." — Justin Poehnelt

Help-text rules for agent friendliness: the first 2–3 lines must be real usage examples (agents learn faster from examples than from flag descriptions); mark required vs optional clearly; show enum values for constrained flags (--format json|table|csv); and keep per-command help under 50 lines so the signal isn't diluted.

#04
Input hardening — defend against hallucination
G
Agents are not trusted operators. Humans typo, agents hallucinate—the failure modes are completely different.
Failure modeHumanAgentDefense
Path traversal ../../.sshalmost neveryes (confuses path segments)canonicalize + sandbox to CWD
Control characterscopy-pastegenerates invisible charsreject ASCII < 0x20
Embedded query in resource IDnoyes (fileId?fields=name)reject ? and # in IDs
Double URL-encodingalmost nevercommon (%2e%2e for ..)reject % in resource names
Oversized inputsrarelong context text passed as argenforce length limits
def validate_input(value: str, field_type: str) -> str: """Every agent-supplied input passes through here.""" reject_control_chars(value) reject_path_traversal(value) reject_embedded_query_params(value) reject_percent_encoding(value) enforce_length_limit(value, field_type) return value

Anthropic's reinforcement, lifted from an example AGENTS.md: "This CLI is frequently invoked by AI/LLM agents. Always assume inputs can be adversarial."

#05
--dry-run is mandatory for writes
GAXI
Every mutating operation (create / update / delete) must support --dry-run.
internal-cli orders cancel --order-id 12345 --dry-run # Output: Would cancel order 12345 (status: processing, amount: $847.00) # No state change.

Why it's non-negotiable: agents naturally use --dry-run to verify their understanding; the cost asymmetry is huge—a hallucinated argument in dry-run wastes one call, but executed it can lose data; and in human-in-the-loop scenarios, the agent can show dry-run output to a human for confirmation.

Related rule: no interactive "Are you sure? (y/n)" prompts—agents can't answer stdin prompts. Destructive commands should require an explicit --confirm flag instead.

#06
Non-interactive authentication
GAXI
Never require a browser-based OAuth redirect or interactive login.
✓ Acceptable
  • Env var INTERNAL_API_TOKEN
  • Credential file at ~/.config/.../credentials.json
  • Service accounts injected into the container
  • System keychain (for human convenience)
✗ Unacceptable
  • "Press Enter to continue"
  • Browser OAuth consent screens
  • Interactive password prompts on stdin
#07
Semantic error handling
AG
Errors must be specific and actionable, not opaque codes or stack traces.
{ "error": "ERR_4012", "trace": "at OrderService.java:847..." }
{ "error": "order_not_found", "message": "Order ID 12345 does not exist", "suggestion": "Use 'internal-cli orders search --customer X' to find valid order IDs", "exit_code": 1 }
"If a tool call raises an error, you can prompt-engineer your error responses to clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks." — Anthropic, Writing effective tools for agents

Exit-code conventions (agent-cli-framework)

CodeMeaningWhat the agent should do
0SuccessContinue
1Transient error (retryable)Read stderr, wait, retry
2Config / auth errorStop, report missing credential
3Input validation failedFix input, retry
4Rate limitedBack off, retry
124TimeoutRetry or split the request
#08
Ship a CONTEXT.md / SKILL.md alongside the CLI
GCWC
--help tells an agent the parameter shape, but it can't communicate invariants and best practices. Ship a SKILL.md alongside your CLI so it travels with the tool.
--- name: internal-orders-cli version: 2.1.0 --- # Internal Orders CLI — Agent Guide ## Invariants (must hold for every operation) - ALWAYS use --output json - ALWAYS use --fields to limit response size - ALWAYS use --dry-run before any delete/cancel - NEVER pass customer PII directly — look up by ID first ## Common workflows ### Refund processing 1. orders search --customer-id CID --status disputed 2. orders refund --order-id OID --reason "..." --dry-run 3. Once dry-run output looks correct, drop the flag and execute
Will Steuk's CWC 2026 Session #14 demo: he migrated all the business logic out of a 400-line system prompt and into Skills (loaded on demand via progressive disclosure). The system prompt collapsed to 15 lines. Eval pass rate jumped from 62% to 92%—same model, same task suite.
#09
Response-token discipline
AAXI
Anthropic's hard rule: "For Claude Code, we restrict tool responses to 25,000 tokens by default."
StrategyImplementationEffect
Pagination--limit 20 + --page-tokenPrevents one-shot table dumps
Field masks--fields "id,name,status"Cuts payload by 50–80%
Truncation + hintTruncate large fields, include [truncated, 4.2KB total, use --full]Agent expands on demand
Pre-computed aggregatesOutput carries total: 1203, shown: 20, filtered: 47Fewer follow-up calls
Token-efficient formatTOON / compact JSON~40% savings (AXI)

AXI's distinctive idea: contextual disclosure

Append a "next-step suggestion" to every output. The agent doesn't have to know all possible follow-ups in advance—the tool itself nudges it forward:

count: 3 of 47 orders[3]{id,status,amount}: ORD-001, processing, $847.00 ORD-002, shipped, $124.50 ORD-003, disputed, $2100.00 help[1]: Run `internal-cli orders detail ORD-003` to see dispute details
#10
A raw-JSON payload path
G
When a CLI sits in front of a REST API, ship a passthrough path—bespoke flags for humans, --json for agents.
# Human-friendly mode (bespoke flags) internal-cli users create --name "Jane" --role admin --team engineering # Agent passthrough mode (raw API payload) internal-cli users create --json '{ "name": "Jane", "role": "admin", "team": "engineering", "permissions": ["read", "write", "admin"], "metadata": {"source": "agent-created", "ticket": "JIRA-4521"} }'

Why agents prefer JSON: it maps directly to the API schema with zero translation loss; it expresses nested structure that flat flags can't; LLMs are natively excellent at generating JSON; and it sidesteps the shell-quoting and flag-ordering ambiguities that bite agents.

"Make the raw-payload path a first-class citizen alongside any convenience flags you ship for humans. Most teams can't afford to maintain two separate tools." — Justin Poehnelt

Three phases, ten steps: from minimum bar to agent-native

Sequenced by ROI. Phase 1's three steps fit in two weeks and let existing agents start using your CLI immediately. Prioritize the highest-frequency workflows first—don't try to convert everything at once.

PHASE 1 1–2 weeks

Minimum-viable retrofit

Make existing agents usable today. Immediate, visible payoff.

  • 1. --output json on every command → agents can parse
  • 2. Input validation + hardening → block hallucination escapes
  • 3. Structured + actionable errors → agents can self-correct
PHASE 2 2–4 weeks

Agent-Optimized

Self-service introspection plus a safety net for writes.

  • 4. Add schema / --describe → runtime introspection
  • 5. Add --fields and --limit → context protection
  • 6. Add --dry-run → write-side safety net
PHASE 3 4–8 weeks

Agent-Native

Move from 1:1 wrappers to workflow-shaped commands.

  • 7. Ship SKILL.md / CONTEXT.md → encode the implicit knowledge
  • 8. Merge commands by workflow → fewer tools, higher success
  • 9. Add an MCP surface (if needed) → cover no-shell environments
  • 10. Eval-driven optimization → let Claude tune descriptions

"More tools = stronger agent" is wrong

Will Steuk (Anthropic, Code with Claude 2026 Session #14) walked through a concrete agent regression-and-recovery story. Three numbers carry the entire insight:

System prompt
400 15
Business logic moved into Skills (progressive disclosure)
Custom tools
12 3
Down to bash / read / write primitives (agent composes via code execution)
Eval pass rate
62% 92%
+30 percentage points (same model, same tasks, different tool design)

The insight: twelve custom tools chewed up so much context with their schema definitions that the agent had less room to think. Cutting back to three primitives plus on-demand Skills made the agent smarter, not weaker.

Will's tool-design priority

  1. Start with human-like primitives (bash / read / write)—general-purpose capability
  2. Add custom tools only when needed—when primitives can't do it efficiently
  3. MCP is the last consideration—only when there's no shell, or you genuinely need discovery

Single binary, multiple surfaces

Justin Poehnelt's pattern from Google Workspace's CLI: one core, many surfaces. CLI and MCP are generated from the same schema source, which keeps you from paying double maintenance.

API / Backend existing services Core Binary internal-cli single source of schema + logic CLI human / agent shell $ internal-cli ... MCP stdio server discovery surface Skill SKILL.md progressive context SDK env / library embedded use One schema source, four surfaces generated from it

"Most teams can't afford to maintain two separate tools." — generating multiple surfaces from a single schema source is how you escape the double-maintenance trap.

Agent-Readiness Score (agent-cli-builder)

The agent-cli-builder Skill provides an 11-dimension rubric for grading how agent-friendly an existing CLI is:

01 Structured output (JSON / TOON)
02 Semantic exit codes
03 Input validation depth
04 Schema introspection
05 Help-text quality
06 Dry-run support
07 Non-interactive auth
08 Field masks / pagination
09 Error actionability
10 Skill file presence
11 Multi-surface exposure

Common mistakes, paired with the fix

These seven anti-patterns get called out repeatedly across Anthropic, Justin Poehnelt, and AXI. If your CLI hits three or more, agent invocations will reliably get stuck—but each has a clear remediation.

1:1 wrap of every API endpoint
Tool sprawl, agent struggles to choose, schema definitions devour context
Merge by workflow—hundreds of APIs converge on 30–50 agent-facing commands
Returning the full API response
Thousands of tokens per call; one response can blow the context window
Slim defaults (3–4 fields) + --fields + --full escape hatch
Using UUIDs as the canonical identifier
Agents readily hallucinate UUIDs (similar-looking strings, off by one)
Accept names / emails / slugs as identifiers; keep UUIDs internal
Interactive confirmation prompts
"Are you sure? (y/n)"—agents can't answer stdin prompts
Use --confirm or two-step --dry-run + --apply flags
Docs only on the web
Every doc lookup costs context; docs go stale and the agent doesn't know
--describe / schema introspection at runtime—docs travel with the code
Stack-trace error responses
"java.lang.NullPointerException at OrderService.java:847"—nothing the agent can act on
Structured error JSON with a suggestion field that tells the agent what to try next
Pre-loading every tool schema
~200–500 tokens per tool definition—dozens of tools eat the context budget
Progressive disclosure or tool search—load only what's needed

Let Claude tune the tool descriptions for you

Anthropic's recommended methodology: don't hand-tune tool descriptions after writing the prototype. Instead, run evals, feed failed transcripts to Claude, and let it propose refinements. On held-out test sets, the result outperformed hand-written tools.

STEP 1
Prototype
Hand-test the CLI inside Claude Code
STEP 2
Generate evals
Prompt-response pairs from real workflows
STEP 3
Run evals
Programmatic API calls; record success / token usage
STEP 4
Analyze transcripts
Feed failing sessions back to Claude Code
STEP 5
Claude refines
Improves descriptions / schemas / errors
STEP 6
Held-out check
Guards against overfitting
Anthropic's internal result: on held-out test sets for the Slack and Asana tools, Claude's optimized tools outperformed the human-written versions. The gains came from subtle refinements—better namespacing, clearer descriptions, more targeted tool implementations. Most of the lift came from tool-description tuning (wording, examples, edge-case notes), not from changing the underlying implementation.

The shortest path from "hundreds of APIs" to "agents that actually work"

The situation

  • Hundreds of internal APIs; agents can't use them efficiently
  • Goal: improve agent interaction quality via a CLI surface
  • Default 1:1 wrapping just relocates the problem to the CLI layer

The recommendation

  • Don't translate 1:1. Merge by business workflow—aim for 30–50 agent-facing commands
  • Retrofit incrementally. Phase 1 ships in two weeks; existing agents become usable on day one
  • Prioritize. Convert high-frequency workflows first, not everything at once
  • Two surfaces. CLI for execution, MCP for discovery
  • Eval-driven. Once it's live, let Claude tune the tool descriptions

Expected ROI (per AXI benchmarks)

+14%
Task success (86% → 100%)
−66%
Token cost ($0.148 → $0.050)
−54%
Latency (34.2s → 15.7s)

Talk is cheap — we shipped these 10 principles as a Skill

After laying out the principles, the benchmark, the anti-patterns, and the audit rubric, we figured: words alone don't ship. So we packaged the entire 5-phase workflow as a Claude Code Skill — with a hard merge-ratio gate, a Python+Typer template, an audit rubric callable on its own, and a fully runnable reference implementation. Open-sourced under MIT-0 in aws-samples.

aws-samples/sample-aws-ops-skills-for-agents/api-to-agent-cli

API → Agent-Friendly CLI

A 5-phase prescriptive wizard that forces workflow merging before any code generation — the single highest-leverage step in this domain, and the one users skip most often. OpenAPI specs are seductive, so the Skill makes "merge ratio < 3:1" a hard gate that blocks progression until the design is properly merged.

Workflow 5 phases · Inventory → Merge ⭐ → Stub → SKILL.md → Audit
References 6 docs · including 5 real-world merge case studies
Templates 4 outputs · one per phase
Example 1 runnable CLI · Python+Typer, smoke-tested across 10 scenarios
api-to-agent-cli/ ├── SKILL.md # 5-phase wizard entrypoint ├── references/ │ ├── principles-checklist.md # 10-principle quick check │ ├── workflow-merge-examples.md ⭐ 5 real merge case studies │ ├── antipatterns-quickref.md # 7 anti-patterns + fixes │ ├── cli-template-python.md # Typer how-to guide │ ├── error-schema.md # structured error JSON schema │ └── audit-rubric.md # 11-dim score (callable standalone) └── assets/ ├── inventory.template.md # Phase 1 output ├── commands.spec.template.md # Phase 2 output ├── SKILL.template.md # Phase 4 output ├── audit-report.template.md # Phase 5 output └── example-hello-cli/ # full Typer reference ├── SKILL.md ├── pyproject.toml └── src/hello_cli/ # all 10 principles in code
Standalone usage

Already have a CLI? Jump straight to Audit

The audit rubric works on its own — point it at your existing CLI, get an 11-dimension score with a gap list and fix recommendations. No need to walk through Phases 1–4.

Why the Skill doesn't ship a codegen script

OpenAPI shapes vary too much for any auto-converter to be enough

Justin Poehnelt himself hand-designs each command at Google. The value here is the prescriptive workflow, not a spec → CLI compiler.

The 12 primary sources

Every claim above traces back to one of the sources below. Each is tagged [Official] or [Third-party], with original URL and publication date.

[01]
2025-09-11
Official
[02]
2025-09-29
Official
[03]
2024-12-19
Official
[05]
2026-03-04
Google DevRel
[06]
2026-03-21
benchmark
[07]
2026-04-05
guide
[08]
2026-05-07
skill
[09]
2026-03-25
implementation
[10]
2026-03-08
field report
[11]
CWC 2026 Session #14: Tool, Skill, or Subagent?
2026-05-23
Official
[12]
2026-02-27
third-party