Designing Agent-Friendly CLIs

§1 · Why a CLI

CLIs aren't "just another interface"—they're a token-efficiency pattern

Anthropic's Claude Code documentation states it directly: CLI tools are the most context-efficient way for agents to interact with external services. This isn't taste—it's a constraint-driven design conclusion. Schema definitions consume context, and every extra round-trip taxes the agent. The AXI project's 425-run benchmark provides the hard numbers.

"CLI tools are the most context-efficient way to interact with external services." — code.claude.com/docs/en/best-practices

Claude Code itself ships only four primitive tools (bash / read / write / edit) and lets the agent transact with the world through the shell. That's not minimalism for its own sake—bash is the most universal, most composable interface humans have built. The Agent SDK announcement frames this principle plainly:

"The key design principle behind the Claude Agent SDK is to give your agents a computer, allowing them to work like humans do." — Anthropic, "Building agents with the Claude Agent SDK" (2025-09-29)

AXI Benchmark: five conditions in a head-to-head race

The AXI project (github.com/kunchenguid/axi) ran 425 benchmark trials—17 tasks × 5 conditions × 5 repeats—against Claude Sonnet 4.6, using GitHub operations as the representative workflow.

Success Rate (%)

425 runs · Claude Sonnet 4.6 · Mar 2026

gh-axi WINNER

100%

GitHub MCP

87%

gh (native CLI)

86%

MCP + Code Mode

84%

MCP + ToolSearch

82%

Avg cost per task

$0.050 ← $0.148

66% cheaper vs MCP

Avg duration

15.7s ← 34.2s

54% faster

Avg turns

3 ← 6

half the round-trips

Bottom line: an agent-optimized CLI dominates MCP on success rate, cost, and latency simultaneously. The reasons are mechanical—each MCP tool definition costs ~200–500 tokens, the discovery/call dance adds round-trips, and the shell lets agents pipe operations together in a single bash invocation.

§2 · Decision Matrix

CLI or MCP? Six factors to decide

The matrix below comes from a DEV Community deep-dive on AI-agent CLIs—useful as an engineering checklist. But the more important judgment comes from Justin Poehnelt: this isn't a binary choice. It's two surfaces over one core.

Choose CLI

Choose MCP

Number of operations

< 15 commands

50+ tools and the agent has no shell

State across calls

Stateless

Stateful sessions required

Shell access

Yes

No (API-only environment)

Token budget

Tight (CLI is leaner)

Generous

Existing tooling

An existing CLI to wrap or refactor

Greenfield server build

Reliability requirement

High (no extra process to fail)

Server dependency acceptable

The pragmatic answer: Justin Poehnelt's pattern is one binary, two surfaces—CLI for execution (where the cost/latency wins live) and MCP for discovery (when the agent can't possibly hold every tool's schema in context at once). Both surfaces are generated from a single schema source so you don't pay double maintenance.

§3 · Design Principles

Ten principles, ordered by ROI

The principles below are synthesized from three authoritative streams: Anthropic's official methodology, Justin Poehnelt's lessons from Google Workspace's CLI (where he is a DevRel engineer), and AXI's benchmark evidence. Each principle is tagged with its source. The first three are the minimum bar—doing only Phase 1 already makes existing agents usable on your CLI.

Source legend

A Anthropic official (highest)

G Justin Poehnelt / Google

AXI AXI benchmark evidence

CWC Code with Claude 2026 #14

#01

Merge by workflow, don't wrap one-to-one

A

Each CLI command should map to a step the agent would naturally take, not to an underlying API endpoint.

Anthropic calls this out by name in Writing effective tools for agents:

"A common error we've observed is tools that merely wrap existing software functionality or API endpoints—whether or not the tools are appropriate for agents."

internal-cli users list internal-cli events list internal-cli events create \ --user-id X --title Y --time Z

internal-cli schedule-meeting \ --with "Jane" \ --topic "Q1 review" \ --duration 30m # Internal: user lookup → free/busy # → create event → send invite

How to apply: list the top 20 high-frequency tasks your agent has to solve, then design a single-shot command for each. Hundreds of APIs collapsing into 30–50 agent-facing commands is a healthy merge ratio.

#02

Structured output is the floor

GAXI

Every command must support --output json (or --json). No exceptions.

# Human mode (default)
internal-cli orders recent
# → pretty table

# Agent mode
internal-cli orders recent --output json
# → pure JSON on stdout, logs go to stderr
      

Capability	What it does	Why it matters
`--fields`	Field masks limit returned columns	Protects the context window
`--stream`	NDJSON, one object per line	Streamable—no memory blow-ups
auto-detect TTY	Switch to JSON when stdout isn't a TTY	Agents don't need to remember the flag
TOON format	A token-efficient alternative	AXI measured ~40% token savings

AXI's additions: default to 3–4 core fields per record (e.g. id, name, status), not the full 10+ field response. Auto-truncate large text fields and offer --full as an escape hatch. Include pre-computed aggregates (count: 47 of 1203) so the agent doesn't have to count.

#03

Runtime schema introspection beats static docs

G

The CLI should describe itself—a schema subcommand returns machine-readable signatures.

The problem is simple: agents can't Google your docs. Doc lookups burn context tokens and the docs themselves go stale.

internal-cli schema orders.create
# Returns: param types, required/optional, enums, example, scopes

internal-cli schema --all
# Schema index for tool-discovery flows
      

"Each gws schema call dumps the full method signature—params, request body, response types, required OAuth scopes—as machine-readable JSON. The agent self-serves without pre-stuffed documentation." — Justin Poehnelt

Help-text rules for agent friendliness: the first 2–3 lines must be real usage examples (agents learn faster from examples than from flag descriptions); mark required vs optional clearly; show enum values for constrained flags (--format json|table|csv); and keep per-command help under 50 lines so the signal isn't diluted.

#04

Input hardening — defend against hallucination

G

Agents are not trusted operators. Humans typo, agents hallucinate—the failure modes are completely different.

Failure mode	Human	Agent	Defense
Path traversal `../../.ssh`	almost never	yes (confuses path segments)	canonicalize + sandbox to CWD
Control characters	copy-paste	generates invisible chars	reject ASCII < 0x20
Embedded query in resource ID	no	yes (`fileId?fields=name`)	reject `?` and `#` in IDs
Double URL-encoding	almost never	common (`%2e%2e` for `..`)	reject `%` in resource names
Oversized inputs	rare	long context text passed as arg	enforce length limits

def validate_input(value: str, field_type: str) -> str:
    """Every agent-supplied input passes through here."""
    reject_control_chars(value)
    reject_path_traversal(value)
    reject_embedded_query_params(value)
    reject_percent_encoding(value)
    enforce_length_limit(value, field_type)
    return value
      

Anthropic's reinforcement, lifted from an example AGENTS.md: "This CLI is frequently invoked by AI/LLM agents. Always assume inputs can be adversarial."

#05

--dry-run is mandatory for writes

GAXI

Every mutating operation (create / update / delete) must support --dry-run.

internal-cli orders cancel --order-id 12345 --dry-run
# Output: Would cancel order 12345 (status: processing, amount: $847.00)
# No state change.
      

Why it's non-negotiable: agents naturally use --dry-run to verify their understanding; the cost asymmetry is huge—a hallucinated argument in dry-run wastes one call, but executed it can lose data; and in human-in-the-loop scenarios, the agent can show dry-run output to a human for confirmation.

Related rule: no interactive "Are you sure? (y/n)" prompts—agents can't answer stdin prompts. Destructive commands should require an explicit --confirm flag instead.

#06

Non-interactive authentication

GAXI

Never require a browser-based OAuth redirect or interactive login.

✓ Acceptable

Env var INTERNAL_API_TOKEN
Credential file at ~/.config/.../credentials.json
Service accounts injected into the container
System keychain (for human convenience)

✗ Unacceptable

"Press Enter to continue"
Browser OAuth consent screens
Interactive password prompts on stdin

#07

Semantic error handling

AG

Errors must be specific and actionable, not opaque codes or stack traces.

{ "error": "ERR_4012", "trace": "at OrderService.java:847..." }

{ "error": "order_not_found", "message": "Order ID 12345 does not exist", "suggestion": "Use 'internal-cli orders search --customer X' to find valid order IDs", "exit_code": 1 }

"If a tool call raises an error, you can prompt-engineer your error responses to clearly communicate specific and actionable improvements, rather than opaque error codes or tracebacks." — Anthropic, Writing effective tools for agents

Exit-code conventions (agent-cli-framework)

Code	Meaning	What the agent should do
0	Success	Continue
1	Transient error (retryable)	Read stderr, wait, retry
2	Config / auth error	Stop, report missing credential
3	Input validation failed	Fix input, retry
4	Rate limited	Back off, retry
124	Timeout	Retry or split the request

#08

Ship a CONTEXT.md / SKILL.md alongside the CLI

GCWC

--help tells an agent the parameter shape, but it can't communicate invariants and best practices. Ship a SKILL.md alongside your CLI so it travels with the tool.

---
name: internal-orders-cli
version: 2.1.0
---

# Internal Orders CLI — Agent Guide

## Invariants (must hold for every operation)

- ALWAYS use --output json
- ALWAYS use --fields to limit response size
- ALWAYS use --dry-run before any delete/cancel
- NEVER pass customer PII directly — look up by ID first

## Common workflows

### Refund processing
1. orders search --customer-id CID --status disputed
2. orders refund --order-id OID --reason "..." --dry-run
3. Once dry-run output looks correct, drop the flag and execute
      

Will Steuk's CWC 2026 Session #14 demo: he migrated all the business logic out of a 400-line system prompt and into Skills (loaded on demand via progressive disclosure). The system prompt collapsed to 15 lines. Eval pass rate jumped from 62% to 92%—same model, same task suite.

#09

Response-token discipline

AAXI

Anthropic's hard rule: "For Claude Code, we restrict tool responses to 25,000 tokens by default."

Strategy	Implementation	Effect
Pagination	`--limit 20` + `--page-token`	Prevents one-shot table dumps
Field masks	`--fields "id,name,status"`	Cuts payload by 50–80%
Truncation + hint	Truncate large fields, include `[truncated, 4.2KB total, use --full]`	Agent expands on demand
Pre-computed aggregates	Output carries `total: 1203, shown: 20, filtered: 47`	Fewer follow-up calls
Token-efficient format	TOON / compact JSON	~40% savings (AXI)

AXI's distinctive idea: contextual disclosure

Append a "next-step suggestion" to every output. The agent doesn't have to know all possible follow-ups in advance—the tool itself nudges it forward:

count: 3 of 47
orders[3]{id,status,amount}:
  ORD-001, processing, $847.00
  ORD-002, shipped, $124.50
  ORD-003, disputed, $2100.00
help[1]:
  Run `internal-cli orders detail ORD-003` to see dispute details
      

#10

A raw-JSON payload path

G

When a CLI sits in front of a REST API, ship a passthrough path—bespoke flags for humans, --json for agents.

# Human-friendly mode (bespoke flags)
internal-cli users create --name "Jane" --role admin --team engineering

# Agent passthrough mode (raw API payload)
internal-cli users create --json '{
  "name": "Jane",
  "role": "admin",
  "team": "engineering",
  "permissions": ["read", "write", "admin"],
  "metadata": {"source": "agent-created", "ticket": "JIRA-4521"}
}'
      

Why agents prefer JSON: it maps directly to the API schema with zero translation loss; it expresses nested structure that flat flags can't; LLMs are natively excellent at generating JSON; and it sidesteps the shell-quoting and flag-ordering ambiguities that bite agents.

"Make the raw-payload path a first-class citizen alongside any convenience flags you ship for humans. Most teams can't afford to maintain two separate tools." — Justin Poehnelt

§4 · Implementation Path

Three phases, ten steps: from minimum bar to agent-native

Sequenced by ROI. Phase 1's three steps fit in two weeks and let existing agents start using your CLI immediately. Prioritize the highest-frequency workflows first—don't try to convert everything at once.

PHASE 1 1–2 weeks

Minimum-viable retrofit

Make existing agents usable today. Immediate, visible payoff.

1. --output json on every command → agents can parse
2. Input validation + hardening → block hallucination escapes
3. Structured + actionable errors → agents can self-correct

PHASE 2 2–4 weeks

Agent-Optimized

Self-service introspection plus a safety net for writes.

4. Add schema / --describe → runtime introspection
5. Add --fields and --limit → context protection
6. Add --dry-run → write-side safety net

PHASE 3 4–8 weeks

Agent-Native

Move from 1:1 wrappers to workflow-shaped commands.

7. Ship SKILL.md / CONTEXT.md → encode the implicit knowledge
8. Merge commands by workflow → fewer tools, higher success
9. Add an MCP surface (if needed) → cover no-shell environments
10. Eval-driven optimization → let Claude tune descriptions

§5 · CWC 2026 #14 in numbers

"More tools = stronger agent" is wrong

Will Steuk (Anthropic, Code with Claude 2026 Session #14) walked through a concrete agent regression-and-recovery story. Three numbers carry the entire insight:

System prompt

400 15

Business logic moved into Skills (progressive disclosure)

Custom tools

12 3

Down to bash / read / write primitives (agent composes via code execution)

Eval pass rate

62% 92%

+30 percentage points (same model, same tasks, different tool design)

The insight: twelve custom tools chewed up so much context with their schema definitions that the agent had less room to think. Cutting back to three primitives plus on-demand Skills made the agent smarter, not weaker.

Will's tool-design priority

Start with human-like primitives (bash / read / write)—general-purpose capability
Add custom tools only when needed—when primitives can't do it efficiently
MCP is the last consideration—only when there's no shell, or you genuinely need discovery

§6 · Reference Architecture

Single binary, multiple surfaces

Justin Poehnelt's pattern from Google Workspace's CLI: one core, many surfaces. CLI and MCP are generated from the same schema source, which keeps you from paying double maintenance.

"Most teams can't afford to maintain two separate tools." — generating multiple surfaces from a single schema source is how you escape the double-maintenance trap.

Agent-Readiness Score (agent-cli-builder)

The agent-cli-builder Skill provides an 11-dimension rubric for grading how agent-friendly an existing CLI is:

01 Structured output (JSON / TOON)

02 Semantic exit codes

03 Input validation depth

04 Schema introspection

05 Help-text quality

06 Dry-run support

07 Non-interactive auth

08 Field masks / pagination

09 Error actionability

10 Skill file presence

11 Multi-surface exposure

§7 · Anti-patterns

Common mistakes, paired with the fix

These seven anti-patterns get called out repeatedly across Anthropic, Justin Poehnelt, and AXI. If your CLI hits three or more, agent invocations will reliably get stuck—but each has a clear remediation.

1:1 wrap of every API endpoint
Tool sprawl, agent struggles to choose, schema definitions devour context

→

Merge by workflow—hundreds of APIs converge on 30–50 agent-facing commands

Returning the full API response
Thousands of tokens per call; one response can blow the context window

→

Slim defaults (3–4 fields) + --fields + --full escape hatch

Using UUIDs as the canonical identifier
Agents readily hallucinate UUIDs (similar-looking strings, off by one)

→

Accept names / emails / slugs as identifiers; keep UUIDs internal

Interactive confirmation prompts
"Are you sure? (y/n)"—agents can't answer stdin prompts

→

Use --confirm or two-step --dry-run + --apply flags

Docs only on the web
Every doc lookup costs context; docs go stale and the agent doesn't know

→

--describe / schema introspection at runtime—docs travel with the code

Stack-trace error responses
"java.lang.NullPointerException at OrderService.java:847"—nothing the agent can act on

→

Structured error JSON with a suggestion field that tells the agent what to try next

Pre-loading every tool schema
~200–500 tokens per tool definition—dozens of tools eat the context budget

→

Progressive disclosure or tool search—load only what's needed

§8 · Eval-Driven Optimization

Let Claude tune the tool descriptions for you

Anthropic's recommended methodology: don't hand-tune tool descriptions after writing the prototype. Instead, run evals, feed failed transcripts to Claude, and let it propose refinements. On held-out test sets, the result outperformed hand-written tools.

STEP 1

Prototype

Hand-test the CLI inside Claude Code

STEP 2

Generate evals

Prompt-response pairs from real workflows

STEP 3

Run evals

Programmatic API calls; record success / token usage

STEP 4

Analyze transcripts

Feed failing sessions back to Claude Code

STEP 5

Claude refines

Improves descriptions / schemas / errors

STEP 6

Held-out check

Guards against overfitting

Anthropic's internal result: on held-out test sets for the Slack and Asana tools, Claude's optimized tools outperformed the human-written versions. The gains came from subtle refinements—better namespacing, clearer descriptions, more targeted tool implementations. Most of the lift came from tool-description tuning (wording, examples, edge-case notes), not from changing the underlying implementation.

§9 · Executive Summary

The shortest path from "hundreds of APIs" to "agents that actually work"

The situation

Hundreds of internal APIs; agents can't use them efficiently
Goal: improve agent interaction quality via a CLI surface
Default 1:1 wrapping just relocates the problem to the CLI layer

The recommendation

Don't translate 1:1. Merge by business workflow—aim for 30–50 agent-facing commands
Retrofit incrementally. Phase 1 ships in two weeks; existing agents become usable on day one
Prioritize. Convert high-frequency workflows first, not everything at once
Two surfaces. CLI for execution, MCP for discovery
Eval-driven. Once it's live, let Claude tune the tool descriptions

Expected ROI (per AXI benchmarks)

+14%

Task success (86% → 100%)

−66%

Token cost ($0.148 → $0.050)

−54%

Latency (34.2s → 15.7s)

§10 · Reference Implementation

Talk is cheap — we shipped these 10 principles as a Skill

After laying out the principles, the benchmark, the anti-patterns, and the audit rubric, we figured: words alone don't ship. So we packaged the entire 5-phase workflow as a Claude Code Skill — with a hard merge-ratio gate, a Python+Typer template, an audit rubric callable on its own, and a fully runnable reference implementation. Open-sourced under MIT-0 in aws-samples.

⌥ aws-samples/sample-aws-ops-skills-for-agents/api-to-agent-cli

API → Agent-Friendly CLI

A 5-phase prescriptive wizard that forces workflow merging before any code generation — the single highest-leverage step in this domain, and the one users skip most often. OpenAPI specs are seductive, so the Skill makes "merge ratio < 3:1" a hard gate that blocks progression until the design is properly merged.

Workflow 5 phases · Inventory → Merge ⭐ → Stub → SKILL.md → Audit

References 6 docs · including 5 real-world merge case studies

Templates 4 outputs · one per phase

Example 1 runnable CLI · Python+Typer, smoke-tested across 10 scenarios

api-to-agent-cli/ ├── SKILL.md # 5-phase wizard entrypoint ├── references/ │ ├── principles-checklist.md # 10-principle quick check │ ├── workflow-merge-examples.md ⭐ 5 real merge case studies │ ├── antipatterns-quickref.md # 7 anti-patterns + fixes │ ├── cli-template-python.md # Typer how-to guide │ ├── error-schema.md # structured error JSON schema │ └── audit-rubric.md # 11-dim score (callable standalone) └── assets/ ├── inventory.template.md # Phase 1 output ├── commands.spec.template.md # Phase 2 output ├── SKILL.template.md # Phase 4 output ├── audit-report.template.md # Phase 5 output └── example-hello-cli/ # full Typer reference ├── SKILL.md ├── pyproject.toml └── src/hello_cli/ # all 10 principles in code

Python ≥ 3.10 · MIT-0 · cp -r api-to-agent-cli/ ~/.claude/skills/ View on GitHub →

Standalone usage

Already have a CLI? Jump straight to Audit

The audit rubric works on its own — point it at your existing CLI, get an 11-dimension score with a gap list and fix recommendations. No need to walk through Phases 1–4.

Why the Skill doesn't ship a codegen script

OpenAPI shapes vary too much for any auto-converter to be enough

Justin Poehnelt himself hand-designs each command at Google. The value here is the prescriptive workflow, not a spec → CLI compiler.

§11 · Source Index

The 12 primary sources

Every claim above traces back to one of the sources below. Each is tagged [Official] or [Third-party], with original URL and publication date.

[01]

Anthropic: Writing effective tools for agents

2025-09-11

Official

[02]

Anthropic: Building agents with the Claude Agent SDK

2025-09-29

Official

[03]

Anthropic: Building effective agents

2024-12-19

Official

[04]

Claude Code Best Practices

Live

Official

[05]

Justin Poehnelt: You Need to Rewrite Your CLI for AI Agents

2026-03-04

Google DevRel

[06]

AXI: Agent eXperience Interface

2026-03-21

benchmark

[07]

Agentic CLI Guide

2026-04-05

guide

[08]

Agent CLI Builder (Skill)

2026-05-07

skill

[09]

Agent CLI Framework (Rust)

2026-03-25

implementation

[10]

Starkslab: Build CLI Tools for AI Agents

2026-03-08

field report

[11]

CWC 2026 Session #14: Tool, Skill, or Subagent?

2026-05-23

Official

[12]

DEV Community: Writing CLI Tools That AI Agents Actually Want to Use

2026-02-27

third-party

Stop wrapping hundreds of API endpoints one-to-one. Ship tools agents will actually use.

CLIs aren't "just another interface"—they're a token-efficiency pattern

AXI Benchmark: five conditions in a head-to-head race

Success Rate (%)

CLI or MCP? Six factors to decide

Ten principles, ordered by ROI

Source legend

Exit-code conventions (agent-cli-framework)

AXI's distinctive idea: contextual disclosure

Three phases, ten steps: from minimum bar to agent-native

Minimum-viable retrofit

Agent-Optimized

Agent-Native

"More tools = stronger agent" is wrong

Will's tool-design priority

Single binary, multiple surfaces

Agent-Readiness Score (agent-cli-builder)

Common mistakes, paired with the fix

Let Claude tune the tool descriptions for you

The shortest path from "hundreds of APIs" to "agents that actually work"

The situation

The recommendation

Expected ROI (per AXI benchmarks)

Talk is cheap — we shipped these 10 principles as a Skill

API → Agent-Friendly CLI

Already have a CLI? Jump straight to Audit

OpenAPI shapes vary too much for any auto-converter to be enough

The 12 primary sources

Stop wrapping hundreds of API endpoints
one-to-one. Ship
tools agents will actually use.