Stop wrapping hundreds of API endpoints
one-to-one. Ship
tools agents will actually use.
Agents don't need a GUI, and they don't need a 1:1 wrapper around your REST API. They need workflow-shaped, self-describing, hallucination-hardened CLI commands. Your hundreds of endpoints should consolidate into 30–50 agent-facing commands—that's the right merge ratio.
CLIs aren't "just another interface"—they're a token-efficiency pattern
Anthropic's Claude Code documentation states it directly: CLI tools are the most context-efficient way for agents to interact with external services. This isn't taste—it's a constraint-driven design conclusion. Schema definitions consume context, and every extra round-trip taxes the agent. The AXI project's 425-run benchmark provides the hard numbers.
Claude Code itself ships only four primitive tools (bash / read / write / edit) and lets the agent transact with the world through the shell. That's not minimalism for its own sake—bash is the most universal, most composable interface humans have built. The Agent SDK announcement frames this principle plainly:
AXI Benchmark: five conditions in a head-to-head race
The AXI project (github.com/kunchenguid/axi) ran 425 benchmark trials—17 tasks × 5 conditions × 5 repeats—against Claude Sonnet 4.6, using GitHub operations as the representative workflow.
Success Rate (%)
CLI or MCP? Six factors to decide
The matrix below comes from a DEV Community deep-dive on AI-agent CLIs—useful as an engineering checklist. But the more important judgment comes from Justin Poehnelt: this isn't a binary choice. It's two surfaces over one core.
The pragmatic answer: Justin Poehnelt's pattern is one binary, two surfaces—CLI for execution (where the cost/latency wins live) and MCP for discovery (when the agent can't possibly hold every tool's schema in context at once). Both surfaces are generated from a single schema source so you don't pay double maintenance.
Ten principles, ordered by ROI
The principles below are synthesized from three authoritative streams: Anthropic's official methodology, Justin Poehnelt's lessons from Google Workspace's CLI (where he is a DevRel engineer), and AXI's benchmark evidence. Each principle is tagged with its source. The first three are the minimum bar—doing only Phase 1 already makes existing agents usable on your CLI.
Source legend
Anthropic calls this out by name in Writing effective tools for agents:
How to apply: list the top 20 high-frequency tasks your agent has to solve, then design a single-shot command for each. Hundreds of APIs collapsing into 30–50 agent-facing commands is a healthy merge ratio.
--output json (or --json). No exceptions.| Capability | What it does | Why it matters |
|---|---|---|
--fields | Field masks limit returned columns | Protects the context window |
--stream | NDJSON, one object per line | Streamable—no memory blow-ups |
| auto-detect TTY | Switch to JSON when stdout isn't a TTY | Agents don't need to remember the flag |
| TOON format | A token-efficient alternative | AXI measured ~40% token savings |
AXI's additions: default to 3–4 core fields per record (e.g. id, name, status), not the full 10+ field response. Auto-truncate large text fields and offer --full as an escape hatch. Include pre-computed aggregates (count: 47 of 1203) so the agent doesn't have to count.
schema subcommand returns machine-readable signatures.The problem is simple: agents can't Google your docs. Doc lookups burn context tokens and the docs themselves go stale.
gws schema call dumps the full method signature—params, request body, response types, required OAuth scopes—as machine-readable JSON. The agent self-serves without pre-stuffed documentation."
— Justin Poehnelt
Help-text rules for agent friendliness: the first 2–3 lines must be real usage examples (agents learn faster from examples than from flag descriptions); mark required vs optional clearly; show enum values for constrained flags (--format json|table|csv); and keep per-command help under 50 lines so the signal isn't diluted.
| Failure mode | Human | Agent | Defense |
|---|---|---|---|
Path traversal ../../.ssh | almost never | yes (confuses path segments) | canonicalize + sandbox to CWD |
| Control characters | copy-paste | generates invisible chars | reject ASCII < 0x20 |
| Embedded query in resource ID | no | yes (fileId?fields=name) | reject ? and # in IDs |
| Double URL-encoding | almost never | common (%2e%2e for ..) | reject % in resource names |
| Oversized inputs | rare | long context text passed as arg | enforce length limits |
Anthropic's reinforcement, lifted from an example AGENTS.md: "This CLI is frequently invoked by AI/LLM agents. Always assume inputs can be adversarial."
--dry-run is mandatory for writes--dry-run.Why it's non-negotiable: agents naturally use --dry-run to verify their understanding; the cost asymmetry is huge—a hallucinated argument in dry-run wastes one call, but executed it can lose data; and in human-in-the-loop scenarios, the agent can show dry-run output to a human for confirmation.
Related rule: no interactive "Are you sure? (y/n)" prompts—agents can't answer stdin prompts. Destructive commands should require an explicit --confirm flag instead.
- Env var
INTERNAL_API_TOKEN - Credential file at
~/.config/.../credentials.json - Service accounts injected into the container
- System keychain (for human convenience)
- "Press Enter to continue"
- Browser OAuth consent screens
- Interactive password prompts on stdin
Exit-code conventions (agent-cli-framework)
| Code | Meaning | What the agent should do |
|---|---|---|
| 0 | Success | Continue |
| 1 | Transient error (retryable) | Read stderr, wait, retry |
| 2 | Config / auth error | Stop, report missing credential |
| 3 | Input validation failed | Fix input, retry |
| 4 | Rate limited | Back off, retry |
| 124 | Timeout | Retry or split the request |
--help tells an agent the parameter shape, but it can't communicate invariants and best practices. Ship a SKILL.md alongside your CLI so it travels with the tool.| Strategy | Implementation | Effect |
|---|---|---|
| Pagination | --limit 20 + --page-token | Prevents one-shot table dumps |
| Field masks | --fields "id,name,status" | Cuts payload by 50–80% |
| Truncation + hint | Truncate large fields, include [truncated, 4.2KB total, use --full] | Agent expands on demand |
| Pre-computed aggregates | Output carries total: 1203, shown: 20, filtered: 47 | Fewer follow-up calls |
| Token-efficient format | TOON / compact JSON | ~40% savings (AXI) |
AXI's distinctive idea: contextual disclosure
Append a "next-step suggestion" to every output. The agent doesn't have to know all possible follow-ups in advance—the tool itself nudges it forward:
--json for agents.Why agents prefer JSON: it maps directly to the API schema with zero translation loss; it expresses nested structure that flat flags can't; LLMs are natively excellent at generating JSON; and it sidesteps the shell-quoting and flag-ordering ambiguities that bite agents.
Three phases, ten steps: from minimum bar to agent-native
Sequenced by ROI. Phase 1's three steps fit in two weeks and let existing agents start using your CLI immediately. Prioritize the highest-frequency workflows first—don't try to convert everything at once.
Minimum-viable retrofit
Make existing agents usable today. Immediate, visible payoff.
- 1.
--output jsonon every command → agents can parse - 2. Input validation + hardening → block hallucination escapes
- 3. Structured + actionable errors → agents can self-correct
Agent-Optimized
Self-service introspection plus a safety net for writes.
- 4. Add
schema/--describe→ runtime introspection - 5. Add
--fieldsand--limit→ context protection - 6. Add
--dry-run→ write-side safety net
Agent-Native
Move from 1:1 wrappers to workflow-shaped commands.
- 7. Ship SKILL.md / CONTEXT.md → encode the implicit knowledge
- 8. Merge commands by workflow → fewer tools, higher success
- 9. Add an MCP surface (if needed) → cover no-shell environments
- 10. Eval-driven optimization → let Claude tune descriptions
"More tools = stronger agent" is wrong
Will Steuk (Anthropic, Code with Claude 2026 Session #14) walked through a concrete agent regression-and-recovery story. Three numbers carry the entire insight:
The insight: twelve custom tools chewed up so much context with their schema definitions that the agent had less room to think. Cutting back to three primitives plus on-demand Skills made the agent smarter, not weaker.
Will's tool-design priority
- Start with human-like primitives (bash / read / write)—general-purpose capability
- Add custom tools only when needed—when primitives can't do it efficiently
- MCP is the last consideration—only when there's no shell, or you genuinely need discovery
Single binary, multiple surfaces
Justin Poehnelt's pattern from Google Workspace's CLI: one core, many surfaces. CLI and MCP are generated from the same schema source, which keeps you from paying double maintenance.
"Most teams can't afford to maintain two separate tools." — generating multiple surfaces from a single schema source is how you escape the double-maintenance trap.
Agent-Readiness Score (agent-cli-builder)
The agent-cli-builder Skill provides an 11-dimension rubric for grading how agent-friendly an existing CLI is:
Common mistakes, paired with the fix
These seven anti-patterns get called out repeatedly across Anthropic, Justin Poehnelt, and AXI. If your CLI hits three or more, agent invocations will reliably get stuck—but each has a clear remediation.
Tool sprawl, agent struggles to choose, schema definitions devour context
Thousands of tokens per call; one response can blow the context window
--fields + --full escape hatchAgents readily hallucinate UUIDs (similar-looking strings, off by one)
"Are you sure? (y/n)"—agents can't answer stdin prompts
--confirm or two-step --dry-run + --apply flagsEvery doc lookup costs context; docs go stale and the agent doesn't know
--describe / schema introspection at runtime—docs travel with the code"java.lang.NullPointerException at OrderService.java:847"—nothing the agent can act on
suggestion field that tells the agent what to try next~200–500 tokens per tool definition—dozens of tools eat the context budget
Let Claude tune the tool descriptions for you
Anthropic's recommended methodology: don't hand-tune tool descriptions after writing the prototype. Instead, run evals, feed failed transcripts to Claude, and let it propose refinements. On held-out test sets, the result outperformed hand-written tools.
The shortest path from "hundreds of APIs" to "agents that actually work"
The situation
- Hundreds of internal APIs; agents can't use them efficiently
- Goal: improve agent interaction quality via a CLI surface
- Default 1:1 wrapping just relocates the problem to the CLI layer
The recommendation
- Don't translate 1:1. Merge by business workflow—aim for 30–50 agent-facing commands
- Retrofit incrementally. Phase 1 ships in two weeks; existing agents become usable on day one
- Prioritize. Convert high-frequency workflows first, not everything at once
- Two surfaces. CLI for execution, MCP for discovery
- Eval-driven. Once it's live, let Claude tune the tool descriptions
Expected ROI (per AXI benchmarks)
Talk is cheap — we shipped these 10 principles as a Skill
After laying out the principles, the benchmark, the anti-patterns, and the audit rubric, we figured: words alone don't ship. So we packaged the entire 5-phase workflow as a Claude Code Skill — with a hard merge-ratio gate, a Python+Typer template, an audit rubric callable on its own, and a fully runnable reference implementation. Open-sourced under MIT-0 in aws-samples.
API → Agent-Friendly CLI
A 5-phase prescriptive wizard that forces workflow merging before any code generation — the single highest-leverage step in this domain, and the one users skip most often. OpenAPI specs are seductive, so the Skill makes "merge ratio < 3:1" a hard gate that blocks progression until the design is properly merged.
Already have a CLI? Jump straight to Audit
The audit rubric works on its own — point it at your existing CLI, get an 11-dimension score with a gap list and fix recommendations. No need to walk through Phases 1–4.
OpenAPI shapes vary too much for any auto-converter to be enough
Justin Poehnelt himself hand-designs each command at Google. The value here is the prescriptive workflow, not a spec → CLI compiler.
The 12 primary sources
Every claim above traces back to one of the sources below. Each is tagged [Official] or [Third-party], with original URL and publication date.