Pointing an LLM at the warehouse and letting it query sounds like the obvious move. But at thousands of employees, hundreds of PB, and tens of thousands of tables, the real question isn't whether it can return a number. It's whether that number is trustworthy enough to back a board, audit, or KPI decision. Anthropic's answer hinges on one counterintuitive trade-off.
Self-service analytics has been "solved" many times over. For years it has been a slog.
Making the data model accessible to less-technical coworkers via wide, denormalized tables leads to overlapping views with inconsistent definitions as the business scales. Ring-fencing environments per team misses the long tail of questions and breeds metric and dashboard bloat. When LLMs arrived, many assumed there was finally a third path: skip SQL, just ask in plain English.
But building it that way has a trap. As Anthropic puts it, pointing Claude at a warehouse and letting agents execute can create a false sense of precision. The setup separates stakeholders from the underlying infrastructure, documentation, and expertise that previously steered them toward carefully curated datasets. The initial elation of liberation from ad-hoc requests turns into dread with one realization: nobody can confirm whether the number is right.
So the question isn't whether you can return a number. At Anthropic's scale, thousands of employees, hundreds of PB, tens of thousands of tables, the real question is whether an AI-computed number is trustworthy enough to directly support board, audit, and KPI decisions. This post is the answer they arrived at after running it on their own warehouse. The headline numbers are concrete: 95% of business analytics queries are now automated via Claude, with ~95% accuracy in aggregate.
Pointing Claude at a warehouse and letting the agents execute can create a false sense of precision … this setup separates stakeholders from the underlying infrastructure, documentation, and expertise.— Anthropic Data Science & Data Engineering team "
To see where analytics agents struggle, Anthropic offers a useful contrast: compare them to coding agents.
Coding is an open-ended solution space that rewards model creativity, and it comes with natural guardrails: documentation states intent, tests verify the result, and you find out whether code runs by running it. Even if the model hallucinates a wrong approach, the tests catch it.
Analytics is the opposite. For a given business question there is often a single correct answer, computed from a single correct source, and there's no deterministic way to prove the answer is right. A query that runs is not a query that's correct: it may use the wrong table or drop a filter and still return a plausible-looking number. Wrong code throws an error; wrong data does not. It quietly hands you the wrong number.
In Anthropic's words, once you can map the question to the right entity, "the resulting execution and SQL becomes trivial." That reframes text-to-SQL as the easy part.
This is also why a stronger model doesn't fix it. Finding the right table and getting the metric right are structure problems, not IQ problems. However capable the model, faced with forty subtly different "revenue" tables it cannot know which one the company stands behind. That information isn't in the model; it lives in the organization's governance.
Anthropic attributes the overwhelming majority of wrong answers to three failure modes:
Anthropic splits its agentic data stack into four layers. Each one exists primarily to attack one of the three failure modes.
The most important factor in accuracy is strong foundations. Anthropic stresses that standard practices, dimensional modeling, shift-left testing, freshness and completeness checks, are just as important as ever. What changes is the end user of the data model: no longer a data scientist who can judge correctness, but an agent acting for someone who may not understand the data at all. Hence a hard constraint: results can't require the user to validate correctness, because the user doesn't know.
This layer is aimed at ambiguity. If "revenue" resolves to one governed dataset instead of forty plausible candidates, the problem largely disappears before the agent ever searches. The most effective practice is canonical datasets: curate a small set of single-source-of-truth datasets, clearly owned, consumption-ready, discoverable, then aggressively deprecate the near-duplicates.
canonical/
fct_orders ← the one table the company stands behind
grain: one row = one paid order
owner: growth-data
tier: gold (canonical, consumption-ready)
rules: paid only / exclude internal test accts / exclude refunds
dim_customer ← the one canonical customer dimension
--- deprecated, should no longer be referenced ---
orders_v2_wide [DEPRECATED → use fct_orders]
orders_finance_copy [DEPRECATED → use fct_orders]
Foundations only hold when enforced by tooling, CI, and mandate: the agent is structurally routed to canonical models first, changes that bypass them fail review, and downstream teams build on the governed layer or explain why not. Anthropic also colocates all data code (modeling, semantic layer, reference docs, dashboard definitions) in a single repo with CI protecting cross-layer integrity, so a modeling change that would break a downstream dashboard gets fixed in the same PR. Finally, metadata is treated as a first-class product: column and table descriptions, metric definitions, grain, lineage, ownership, all maintained with the same rigor as the transformations.
If foundations are the warehouse itself, sources of truth are the reference surfaces the agent consults to navigate it. This layer turns a stakeholder's "weekly active users" into a specific governed entity. Anthropic ranks four, roughly in descending order of trust.
The semantic layer, most trusted. Compiled metric and dimension definitions. If a question maps cleanly to a defined metric, the agent calls a function and gets one number, the same number every other surface in the company produces. Anthropic's agents are structurally required (by skill instruction) to leverage the semantic layer first.
metric: gmv label: "Gross Merchandise Value (GMV)" owner: finance sql: SUM(fct_orders.amount_paid) filters: [paid, exclude refunds, exclude internal accts] dims: [date, country, product_category, channel] basis: tax-inclusive / shipping-excluded # agent asked "US GMV last quarter" → calls gmv(country=US, date=last_quarter) # → one number, identical to the finance dashboard and the board deck
Anthropic records a failed experiment here: bootstrapping the semantic layer by having an LLM auto-generate metric definitions from raw tables and query logs. It produced plausible-looking definitions that encoded the very ambiguities they were trying to eliminate, and was net-negative on evals versus a smaller, human-curated layer. The lesson: generate the documentation with Claude, but have a human own the definition.
The remaining three, in order: lineage and the transformation graph (when the semantic layer doesn't cover a question, table ranking by references lets the agent reason about which governed model to aggregate from); the query corpus (historical SQL, intuitively high-value, but giving the agent raw retrieval over thousands of prior queries moved accuracy by less than a point, see the validation section); and business context, the most underrated layer. An agent that doesn't understand the business answers what was asked, not what was meant: it won't know "the Q2 launch" refers to a specific product, or that a question is being asked because a board meeting is on Thursday. Anthropic pipes in a company knowledge graph (docs, roadmaps, decision logs, org structure) to resolve those ambient references.
All four share one failure mode: poor or stale documentation. Claude is exceptionally useful for closing the gap (drafting descriptions, proposing metric docs, flagging undocumented models in CI), but curation and ownership stay with humans.
If sources of truth are the agent's declarative knowledge (what a metric means), a skill is its procedural knowledge: which sources to consult in what order, how to navigate ambiguous data, and what a finished analysis looks like. In Claude Code, a skill is a folder of markdown the agent reads on demand.
This layer is hugely value-additive. Without skills, Claude's accuracy on these evals never exceeded 21%. With skills it sits consistently above 95% in aggregate, and regularly around 99% in certain domains. Same model; the difference is entirely the skills. That confirms the core thesis: the bottleneck is structure, not the model.
Create pairwise skills. A knowledge skill acts as a thin top-level router: try the semantic layer first, and if there's no coverage, here are ~30 reference files for this domain describing the relevant tables, columns, joins, and gotchas. That router is the answer to retrieval failure, narrowing a million-field warehouse to a few dozen curated files before a query is written. A second skill encodes the process a senior analyst follows, and bundles reusable patterns (retention curves, rate decomposition, funnel analysis).
knowledge-skill (thin router)
└─ "try the semantic layer first; if no coverage, see domain refs below"
references/
orders.md ← orders domain: tables, columns, joins, gotchas, triggers
users.md ← users domain
marketing.md ← marketing domain
… a few dozen domain files; each question loads only the relevant few …
execution-skill (senior-analyst process)
1. clarify 2. find sources (via knowledge-skill)
3. run query 4. loop through adversarial review sub-agents
+ bundled: retention curves / rate decomposition / funnels
---
name: [warehouse-skill]
description: "IF the user asks to query [the company]'s data warehouse for any
[list of business domains] question — THEN invoke this skill. DO NOT invoke
for [adjacent engineering tasks] or questions with no data-warehouse component."
---
# Semantic Layer (REQUIRED first step)
The governed semantic layer is the mandatory default path for every data question;
raw SQL is the fallback, used only after the semantic-layer path is shown not to cover the ask.
## Required workflow: Load → Discover (always check segments) → Compile + run → Fallback
> Don't bail early. Do NOT fall back to raw SQL on these grounds:
> "[custom date filtering / cohorts]" → covered by time-dimension specs
> "[needs a join]" → the metric layer already encapsulates its joins
> [3-4 more pre-rebutted excuses agents use to skip the semantic layer]
# PART 1 MUST KNOW: red flags / out-of-scope escalation / clarify the decision / entity disambiguation / data-integrity rules
# PART 2 HOW TO DO: technical execution / analysis guide / MANDATORY adversarial SQL review / report with provenance footer
# PART 3 DATA REFERENCES (multi-level index, drill down, load only the relevant few)
references/_index.md ← L0 top-level domain index (read this first)
├─ Users → user/_index.md keywords: signup / profile / DAU / MAU / cohorts
│ └─ user/tables.md L2 table-level: user / device / membership
├─ Transactions → transaction/_index.md keywords: orders / payment / refund / GMV
│ └─ transaction/tables.md
└─ Marketing → marketing/_index.md keywords: campaigns / coupons / push / redemption
metrics/_index.md ← metric routing: DAU/retention → core; GMV/LTV/ROI → kpi
relationships.md ← cross-domain JOIN relationships (load when joining)
The appendix's knowledge-base navigation is single-level (one reference file per domain); at scale it's often built as a multi-level index: a top-level domain index → per-domain _index → table-level docs, narrowing the search space layer by layer. The frontmatter, semantic-layer enforcement, and the PART 1-3 spine follow the original skeleton; the multi-level indexing layer is an extension drawn from large-warehouse practice.
Write reference docs for retrieval by an LLM. Describe grain, scope, exclusions, the mechanics of gotchas ("exclude known free-email domains, but keep custom ones like anthropic.com"), and explicit routing triggers, without prescriptive recipes that go stale. Anthropic shares a skeleton in the appendix.
# [Domain] Tables ## Quick Reference ### Business Context — [what this domain means in plain words] ### Entity Grain — [what one row represents] ### Standard Hygiene Filter — [the filter every query here applies] ## Dimensions · Key Tables · Gotchas - [the wrong-answer modes a senior analyst would warn you about] ## Best Practices / Common Query Patterns · Cross-References - [default cuts; neighboring domain docs that own adjacent questions]
Treat skill maintenance as a first-class citizen. Skill docs describe a data model that changes daily; without active maintenance they're wrong within weeks. Anthropic watched offline accuracy drift from ~95% at launch to ~65% over a month before treating it as an engineering problem: colocate skill markdown with the transformation models, so the PR that changes a model is the PR that updates its doc, and a code-review hook flags any reporting-model change that doesn't touch a skill file. Roughly 90% of data-model PRs now include a skill change in the same diff. They also ensure a consistent experience across surfaces: the same skill gives the same answer in Slack, the IDE, a dashboard tool, and standalone sessions, via one canonical source and auto-sync on merge (plugin marketplace, cloud-storage blobs, MCP resources).
Validation is how you find out which of the three failure modes is still leaking through. Three parts.
Offline evals. Simple question/answer pairs, like offline testing for an ML model. Anthropic deploys dashboard-based evals (Claude-generated, human-validated) and long-tail evals (feed Claude business context, have it generate plausible questions), and harvests every stakeholder correction as a candidate eval. Key practices: anchor ground truth to a snapshot date so it can't drift; store results like telemetry in a warehouse table (skill version, git SHA, model ID, pass/fail, tokens, wall-clock) so "did that change help?" becomes a query; gate launches per domain (initially ~90%); and offline accuracy should be ~100%, with every correct answer also hitting the semantic layer.
Ablation. Every structural skill decision is made by holding the eval set fixed and varying one component. Anthropic's most useful ablation was a negative one: it gave the agent direct grep access to thousands of SQL files and verified in transcripts that it read them before every answer, yet accuracy moved by less than a point. For the questions it got wrong, the answer was in the corpus about 80% of the time, but "answer present" didn't predict "now gets it right." The information was there, the agent saw it, and it still didn't use it. The bottleneck wasn't access to prior work, it was structure, mapping a question to the right entity. That insight redirected months of roadmap.
Online validation. An adversarial-review skill aggressively challenges every assumption behind a candidate answer (+6% accuracy, at the cost of +32% tokens and +72% latency, see the next section); every response carries a provenance footer stating source tier, freshness, and owner; passive monitoring tracks the share of queries resolving through the semantic layer and the share of responses with correction language; and a scheduled agent scans channels every few hours for corrections, drafts a one-line doc fix, and opens a PR, deliberately boring: edit markdown, merge, auto-sync.
> Source: semantic layer | governed table | raw exploration > Confidence: [tier] Reviewed: [reviewer ✓, round N] > Freshness: [max date in the data] Owner: [owning team]
The one thing none of this fully catches is the silent failure: the answer is wrong, but looks plausible and is used without objection. Anthropic's mitigations are the provenance footer, explicit human sign-off on anything leadership-bound, and a standing eval for each domain's top KPIs checked daily against the blessed dashboard. They admit they don't have a robust solution yet.
The most striking part of the stack is a counterintuitive trade-off. It hides inside that phrase "structurally required to leverage the semantic layer first," yet it carries the whole post.
The usual text-to-SQL approach: give the model enough context, then let it write the final SQL, the stronger the model, the better. Anthropic doesn't do that. Its agent is structurally forced by skill instruction: if a question maps to a defined metric in the semantic layer, it must call that metric function rather than assemble its own SQL. The model is not allowed to improvise at the final computing step; raw SQL is only a fallback.
Why tie its own hands? Because what it wants isn't "probably right," it's "wrong in a way you can see." For the same question, "US GMV last quarter," the two paths end very differently:
That's the heart of it. A system that errors out fails explicitly, you know on the spot that it can't answer, and you can escalate or clarify. A system that always returns a number fails silently, dressing up "I don't know" as "here's the answer," which is exactly the most dangerous thing in a high-stakes decision. When a number is headed for a board, an audit, or a KPI, "looks right but isn't" is far more destructive than "I can't answer that."
So the load-bearing call, in one line: use a deterministic semantic layer to force errors from silent into explicit. Model creativity is an asset when writing code and a liability at the final computing step, where there's a single correct answer and no test can falsify it on the spot. Anthropic's choice is to switch the creativity off at that step.
With text-to-SQL, failure looks like a plausible but incorrect answer. With the Semantic Layer, failure looks like an error message.— dbt, "Semantic Layer vs. Text-to-SQL" (vendor source, see notes) "
Anthropic speaks from its own internal system, and those numbers can't be independently verified. But the direction, that the structural gap is real and a deterministic semantic layer closes it, is corroborated by a body of 2026 industry evidence.
A caveat up front: most of these benchmarks come from semantic-layer and governance vendors (dbt, Looker, Atlan). By this site's standard they are vendor blogs, the least trustworthy tier, with a commercial motive for the semantic layer to "win," and the specific percentages were likely cherry-picked. They're used here only for directional judgment, each labeled, none treated as definitive. The direction is credible because three kinds of source agree: motivated vendors, a neutral academic benchmark (Spider 2.0), and Anthropic's own internal numbers.
Put together, one oft-repeated phrase fits: accuracy is a context problem, not a model problem (third-party). That is Anthropic's core thesis, said another way.
Anthropic offers a pragmatic starting point: from zero, a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill capture most of the upside; everything else in the post was added once those were built.
But to run it for the long haul, you have to face the cost. Lined up, the post's hard numbers make it plain.
Together these point to one conclusion: the real cost of this architecture is maintenance, not construction. It isn't a system you build and leave; it's a living thing you feed. The data model changes daily, docs go stale, evals drift, corrections have to be harvested back. Standing up that maintenance machinery is far harder, and far more valuable, than getting the agent working the first time.
Anthropic also poses a few questions to align on, since not every practice fits every team.
The lesson for other teams is clear: building self-service analytics in the agent era, the real engineering isn't "wire up a model and let it query," it's the three things before that: govern down to a single answer, make it discoverable, and build an immune system that keeps correcting it. A stronger model can't substitute for these, because what they solve isn't a model-capability problem, it's the organization's data-structure problem.
claude.com · 2026-06-03 · Data Science & Data Engineering team (Chen Chang, Clement Peng, Justin Leder, Johanne Jiao, Josh Cherry). Sections 1–4 and 6 draw all their claims, numbers, architecture, and appendix skeletons from this post. All internal accuracy figures (95% automated / ~95% aggregate accuracy, <21% without skills, >95% with skills and ~99% in certain domains, offline accuracy 95%→65% over a month, adversarial review +6% / +32% tokens / +72% latency, ~90% of PRs carry a skill change, launch gate initially ~90%) are Anthropic's own and cannot be independently verified. Blocks marked "illustrative" reconstruct the data structures with a generic e-commerce example, not any company's internal tables.
dbt (text-to-SQL 32.7%→64.5%; bare schema ~40% vs semantic layer 83%; "failure looks like an error message"), Looker (NL→LookML avg 97% vs NL→SQL avg 80%), and Atlan (522 queries, +38% with governance context) are vendors selling semantic or governance layers, with a commercial motive and likely cherry-picked numbers; used here only for directional judgment. Spider 2.0 (GPT-4o 86%→6%, enterprise peak 38–59%) is a neutral academic benchmark and a non-self-interested cross-check. Pinterest's 40%→90% retrieval hit-rate and "accuracy is a context problem, not a model problem" are third-party, neutral-leaning. The direction is credible because vendor benchmarks, the neutral academic benchmark, and Anthropic's internal numbers all agree; no single percentage should be taken as definitive.