Agentic Analytics · Anthropic self-service data analytics

A data-agent project is, at its core,
data governance

Pointing an LLM at the warehouse and letting it query sounds like the obvious move. But at thousands of employees, hundreds of PB, and tens of thousands of tables, the real question isn't whether it can return a number. It's whether that number is trustworthy enough to back a board, audit, or KPI decision. Anthropic's answer hinges on one counterintuitive trade-off.

21%

accuracy ceiling
without skills

→

95%+

with skills
~99% in some domains

01 · A problem that looks long solved

The relief turns into dread: nobody can confirm the number is right

Self-service analytics has been "solved" many times over. For years it has been a slog.

Making the data model accessible to less-technical coworkers via wide, denormalized tables leads to overlapping views with inconsistent definitions as the business scales. Ring-fencing environments per team misses the long tail of questions and breeds metric and dashboard bloat. When LLMs arrived, many assumed there was finally a third path: skip SQL, just ask in plain English.

But building it that way has a trap. As Anthropic puts it, pointing Claude at a warehouse and letting agents execute can create a false sense of precision. The setup separates stakeholders from the underlying infrastructure, documentation, and expertise that previously steered them toward carefully curated datasets. The initial elation of liberation from ad-hoc requests turns into dread with one realization: nobody can confirm whether the number is right.

So the question isn't whether you can return a number. At Anthropic's scale, thousands of employees, hundreds of PB, tens of thousands of tables, the real question is whether an AI-computed number is trustworthy enough to directly support board, audit, and KPI decisions. This post is the answer they arrived at after running it on their own warehouse. The headline numbers are concrete: 95% of business analytics queries are now automated via Claude, with ~95% accuracy in aggregate.

Pointing Claude at a warehouse and letting the agents execute can create a false sense of precision … this setup separates stakeholders from the underlying infrastructure, documentation, and expertise.

— Anthropic Data Science & Data Engineering team "

02 · The core insight

Data is not software: why a stronger model won't fix it

To see where analytics agents struggle, Anthropic offers a useful contrast: compare them to coding agents.

Coding is an open-ended solution space that rewards model creativity, and it comes with natural guardrails: documentation states intent, tests verify the result, and you find out whether code runs by running it. Even if the model hallucinates a wrong approach, the tests catch it.

Analytics is the opposite. For a given business question there is often a single correct answer, computed from a single correct source, and there's no deterministic way to prove the answer is right. A query that runs is not a query that's correct: it may use the wrong table or drop a filter and still return a plausible-looking number. Wrong code throws an error; wrong data does not. It quietly hands you the wrong number.

Writing & running SQLthe much-discussed text-to-SQL

largely solved · trivial

Finding the right table & definitionmapping a vague question to the one correct entity

the real difficulty · a structure problem, not a model problem

In Anthropic's words, once you can map the question to the right entity, "the resulting execution and SQL becomes trivial." That reframes text-to-SQL as the easy part.

This is also why a stronger model doesn't fix it. Finding the right table and getting the metric right are structure problems, not IQ problems. However capable the model, faced with forty subtly different "revenue" tables it cannot know which one the company stands behind. That information isn't in the model; it lives in the organization's governance.

Anthropic attributes the overwhelming majority of wrong answers to three failure modes:

Failure mode 01

Concept ↔ entity ambiguity

With hundreds of viable options (out of potentially millions of fields), the agent can't pick the right ones. Measuring "active users": which actions count as active? Include fraudulent users? What lookback window?

Failure mode 02

Data staleness

Sources, business definitions, and schemas change constantly. Assets and agent knowledge go stale and start returning subtly wrong answers, right-looking, already wrong.

Failure mode 03

Retrieval failure

The right information is actually in the model, properly annotated, but given the vastness of the search space the agent simply doesn't find it. Present, yet unused.

03 · The four-layer stack

Each layer plugs one leak

Anthropic splits its agentic data stack into four layers. Each one exists primarily to attack one of the three failure modes.

Data foundations

Shrink the space of plausible entities until there's a single governed answer

Entity ambiguity

Sources of truth

Turn "weekly active users" in a question into a specific governed entity

Concept ↔ entity

Skills

Make sure the agent reliably finds that answer, and uses it correctly

Retrieval failure

Validation

The backstop, finding out which failure mode is still leaking through

Silent failure

Layer 1 · Data Foundations

Layer 1: Data foundations

The most important factor in accuracy is strong foundations. Anthropic stresses that standard practices, dimensional modeling, shift-left testing, freshness and completeness checks, are just as important as ever. What changes is the end user of the data model: no longer a data scientist who can judge correctness, but an agent acting for someone who may not understand the data at all. Hence a hard constraint: results can't require the user to validate correctness, because the user doesn't know.

This layer is aimed at ambiguity. If "revenue" resolves to one governed dataset instead of forty plausible candidates, the problem largely disappears before the agent ever searches. The most effective practice is canonical datasets: curate a small set of single-source-of-truth datasets, clearly owned, consumption-ready, discoverable, then aggressively deprecate the near-duplicates.

Illustrative · canonical dataset registry (generic e-commerce)

canonical/
  fct_orders            ← the one table the company stands behind
    grain:   one row = one paid order
    owner:   growth-data
    tier:    gold (canonical, consumption-ready)
    rules:   paid only / exclude internal test accts / exclude refunds
  dim_customer          ← the one canonical customer dimension
  --- deprecated, should no longer be referenced ---
  orders_v2_wide        [DEPRECATED → use fct_orders]
  orders_finance_copy   [DEPRECATED → use fct_orders]

Foundations only hold when enforced by tooling, CI, and mandate: the agent is structurally routed to canonical models first, changes that bypass them fail review, and downstream teams build on the governed layer or explain why not. Anthropic also colocates all data code (modeling, semantic layer, reference docs, dashboard definitions) in a single repo with CI protecting cross-layer integrity, so a modeling change that would break a downstream dashboard gets fixed in the same PR. Finally, metadata is treated as a first-class product: column and table descriptions, metric definitions, grain, lineage, ownership, all maintained with the same rigor as the transformations.

Layer 2 · Sources of Truth

Layer 2: Sources of truth

If foundations are the warehouse itself, sources of truth are the reference surfaces the agent consults to navigate it. This layer turns a stakeholder's "weekly active users" into a specific governed entity. Anthropic ranks four, roughly in descending order of trust.

The semantic layer, most trusted. Compiled metric and dimension definitions. If a question maps cleanly to a defined metric, the agent calls a function and gets one number, the same number every other surface in the company produces. Anthropic's agents are structurally required (by skill instruction) to leverage the semantic layer first.

Illustrative · semantic-layer metric definition

metric: gmv
  label:   "Gross Merchandise Value (GMV)"
  owner:   finance
  sql:     SUM(fct_orders.amount_paid)
  filters: [paid, exclude refunds, exclude internal accts]
  dims:    [date, country, product_category, channel]
  basis:   tax-inclusive / shipping-excluded

# agent asked "US GMV last quarter" → calls gmv(country=US, date=last_quarter)
# → one number, identical to the finance dashboard and the board deck

Anthropic records a failed experiment here: bootstrapping the semantic layer by having an LLM auto-generate metric definitions from raw tables and query logs. It produced plausible-looking definitions that encoded the very ambiguities they were trying to eliminate, and was net-negative on evals versus a smaller, human-curated layer. The lesson: generate the documentation with Claude, but have a human own the definition.

The remaining three, in order: lineage and the transformation graph (when the semantic layer doesn't cover a question, table ranking by references lets the agent reason about which governed model to aggregate from); the query corpus (historical SQL, intuitively high-value, but giving the agent raw retrieval over thousands of prior queries moved accuracy by less than a point, see the validation section); and business context, the most underrated layer. An agent that doesn't understand the business answers what was asked, not what was meant: it won't know "the Q2 launch" refers to a specific product, or that a question is being asked because a board meeting is on Thursday. Anthropic pipes in a company knowledge graph (docs, roadmaps, decision logs, org structure) to resolve those ambient references.

All four share one failure mode: poor or stale documentation. Claude is exceptionally useful for closing the gap (drafting descriptions, proposing metric docs, flagging undocumented models in CI), but curation and ownership stay with humans.

Layer 3 · Skills

Layer 3: Skills

If sources of truth are the agent's declarative knowledge (what a metric means), a skill is its procedural knowledge: which sources to consult in what order, how to navigate ambiguous data, and what a finished analysis looks like. In Claude Code, a skill is a folder of markdown the agent reads on demand.

This layer is hugely value-additive. Without skills, Claude's accuracy on these evals never exceeded 21%. With skills it sits consistently above 95% in aggregate, and regularly around 99% in certain domains. Same model; the difference is entirely the skills. That confirms the core thesis: the bottleneck is structure, not the model.

Create pairwise skills. A knowledge skill acts as a thin top-level router: try the semantic layer first, and if there's no coverage, here are ~30 reference files for this domain describing the relevant tables, columns, joins, and gotchas. That router is the answer to retrieval failure, narrowing a million-field warehouse to a few dozen curated files before a query is written. A second skill encodes the process a senior analyst follows, and bundles reusable patterns (retention curves, rate decomposition, funnel analysis).

Illustrative · the pairwise-skill pattern

knowledge-skill (thin router)
  └─ "try the semantic layer first; if no coverage, see domain refs below"
     references/
       orders.md       ← orders domain: tables, columns, joins, gotchas, triggers
       users.md        ← users domain
       marketing.md    ← marketing domain
       … a few dozen domain files; each question loads only the relevant few …

execution-skill (senior-analyst process)
  1. clarify  2. find sources (via knowledge-skill)
  3. run query  4. loop through adversarial review sub-agents
  + bundled: retention curves / rate decomposition / funnels

Main warehouse-skill skeleton · from Anthropic's appendix (PART 3 expanded to multi-level indexing)

---
name: [warehouse-skill]
description: "IF the user asks to query [the company]'s data warehouse for any
  [list of business domains] question — THEN invoke this skill. DO NOT invoke
  for [adjacent engineering tasks] or questions with no data-warehouse component."
---

# Semantic Layer (REQUIRED first step)
The governed semantic layer is the mandatory default path for every data question;
raw SQL is the fallback, used only after the semantic-layer path is shown not to cover the ask.
## Required workflow: Load → Discover (always check segments) → Compile + run → Fallback
> Don't bail early. Do NOT fall back to raw SQL on these grounds:
>   "[custom date filtering / cohorts]" → covered by time-dimension specs
>   "[needs a join]"                    → the metric layer already encapsulates its joins
>   [3-4 more pre-rebutted excuses agents use to skip the semantic layer]

# PART 1 MUST KNOW: red flags / out-of-scope escalation / clarify the decision / entity disambiguation / data-integrity rules
# PART 2 HOW TO DO: technical execution / analysis guide / MANDATORY adversarial SQL review / report with provenance footer
# PART 3 DATA REFERENCES (multi-level index, drill down, load only the relevant few)
  references/_index.md              ← L0 top-level domain index (read this first)
    ├─ Users      → user/_index.md         keywords: signup / profile / DAU / MAU / cohorts
    │               └─ user/tables.md        L2 table-level: user / device / membership
    ├─ Transactions → transaction/_index.md  keywords: orders / payment / refund / GMV
    │               └─ transaction/tables.md
    └─ Marketing  → marketing/_index.md     keywords: campaigns / coupons / push / redemption
  metrics/_index.md                 ← metric routing: DAU/retention → core; GMV/LTV/ROI → kpi
  relationships.md                  ← cross-domain JOIN relationships (load when joining)

The appendix's knowledge-base navigation is single-level (one reference file per domain); at scale it's often built as a multi-level index: a top-level domain index → per-domain _index → table-level docs, narrowing the search space layer by layer. The frontmatter, semantic-layer enforcement, and the PART 1-3 spine follow the original skeleton; the multi-level indexing layer is an extension drawn from large-warehouse practice.

Write reference docs for retrieval by an LLM. Describe grain, scope, exclusions, the mechanics of gotchas ("exclude known free-email domains, but keep custom ones like anthropic.com"), and explicit routing triggers, without prescriptive recipes that go stale. Anthropic shares a skeleton in the appendix.

Reference-doc skeleton · from Anthropic's appendix

# [Domain] Tables

## Quick Reference
### Business Context — [what this domain means in plain words]
### Entity Grain — [what one row represents]
### Standard Hygiene Filter — [the filter every query here applies]

## Dimensions · Key Tables · Gotchas
- [the wrong-answer modes a senior analyst would warn you about]

## Best Practices / Common Query Patterns · Cross-References
- [default cuts; neighboring domain docs that own adjacent questions]

Treat skill maintenance as a first-class citizen. Skill docs describe a data model that changes daily; without active maintenance they're wrong within weeks. Anthropic watched offline accuracy drift from ~95% at launch to ~65% over a month before treating it as an engineering problem: colocate skill markdown with the transformation models, so the PR that changes a model is the PR that updates its doc, and a code-review hook flags any reporting-model change that doesn't touch a skill file. Roughly 90% of data-model PRs now include a skill change in the same diff. They also ensure a consistent experience across surfaces: the same skill gives the same answer in Slack, the IDE, a dashboard tool, and standalone sessions, via one canonical source and auto-sync on merge (plugin marketplace, cloud-storage blobs, MCP resources).

Layer 4 · Validation

Layer 4: Validation

Validation is how you find out which of the three failure modes is still leaking through. Three parts.

Offline evals. Simple question/answer pairs, like offline testing for an ML model. Anthropic deploys dashboard-based evals (Claude-generated, human-validated) and long-tail evals (feed Claude business context, have it generate plausible questions), and harvests every stakeholder correction as a candidate eval. Key practices: anchor ground truth to a snapshot date so it can't drift; store results like telemetry in a warehouse table (skill version, git SHA, model ID, pass/fail, tokens, wall-clock) so "did that change help?" becomes a query; gate launches per domain (initially ~90%); and offline accuracy should be ~100%, with every correct answer also hitting the semantic layer.

Ablation. Every structural skill decision is made by holding the eval set fixed and varying one component. Anthropic's most useful ablation was a negative one: it gave the agent direct grep access to thousands of SQL files and verified in transcripts that it read them before every answer, yet accuracy moved by less than a point. For the questions it got wrong, the answer was in the corpus about 80% of the time, but "answer present" didn't predict "now gets it right." The information was there, the agent saw it, and it still didn't use it. The bottleneck wasn't access to prior work, it was structure, mapping a question to the right entity. That insight redirected months of roadmap.

Online validation. An adversarial-review skill aggressively challenges every assumption behind a candidate answer (+6% accuracy, at the cost of +32% tokens and +72% latency, see the next section); every response carries a provenance footer stating source tier, freshness, and owner; passive monitoring tracks the share of queries resolving through the semantic layer and the share of responses with correction language; and a scheduled agent scans channels every few hours for corrections, drafts a one-line doc fix, and opens a PR, deliberately boring: edit markdown, merge, auto-sync.

Illustrative · provenance footer · from Anthropic's appendix

> Source: semantic layer | governed table | raw exploration
> Confidence: [tier]   Reviewed: [reviewer ✓, round N]
> Freshness: [max date in the data]   Owner: [owning team]

The one thing none of this fully catches is the silent failure: the answer is wrong, but looks plausible and is used without objection. Anthropic's mitigations are the provenance footer, explicit human sign-off on anything leadership-bound, and a standing eval for each domain's top KPIs checked daily against the blessed dashboard. They admit they don't have a robust solution yet.

04 · The load-bearing call

At the final step of computing a number, the model doesn't get to improvise

The most striking part of the stack is a counterintuitive trade-off. It hides inside that phrase "structurally required to leverage the semantic layer first," yet it carries the whole post.

The usual text-to-SQL approach: give the model enough context, then let it write the final SQL, the stronger the model, the better. Anthropic doesn't do that. Its agent is structurally forced by skill instruction: if a question maps to a defined metric in the semantic layer, it must call that metric function rather than assemble its own SQL. The model is not allowed to improvise at the final computing step; raw SQL is only a fallback.

Why tie its own hands? Because what it wants isn't "probably right," it's "wrong in a way you can see." For the same question, "US GMV last quarter," the two paths end very differently:

✓ Via the semantic layer (enforced default)

gmv(country=US, date=last_quarter)

Basis already governed: paid, refunds excluded, internal accounts excluded, tax-inclusive. The number matches the finance dashboard and the board deck exactly.

If the metric isn't defined → it returns "no coverage," then falls back or clarifies.

Failure is explicit: an error message.

✗ Letting the model write SQL freely (Anthropic doesn't)

SELECT SUM(amount) FROM orders_v2_wide WHERE ...

Also returns a number, and a plausible one. But it may use a deprecated wide table, or drop the refund filter.

Nothing signals that it's wrong.

Failure is silent: a number that looks right and isn't, that no one catches.

That's the heart of it. A system that errors out fails explicitly, you know on the spot that it can't answer, and you can escalate or clarify. A system that always returns a number fails silently, dressing up "I don't know" as "here's the answer," which is exactly the most dangerous thing in a high-stakes decision. When a number is headed for a board, an audit, or a KPI, "looks right but isn't" is far more destructive than "I can't answer that."

So the load-bearing call, in one line: use a deterministic semantic layer to force errors from silent into explicit. Model creativity is an asset when writing code and a liability at the final computing step, where there's a single correct answer and no test can falsify it on the spot. Anthropic's choice is to switch the creativity off at that step.

With text-to-SQL, failure looks like a plausible but incorrect answer. With the Semantic Layer, failure looks like an error message.

— dbt, "Semantic Layer vs. Text-to-SQL" (vendor source, see notes) "

05 · Outside evidence

The structural gap is real

Anthropic speaks from its own internal system, and those numbers can't be independently verified. But the direction, that the structural gap is real and a deterministic semantic layer closes it, is corroborated by a body of 2026 industry evidence.

A caveat up front: most of these benchmarks come from semantic-layer and governance vendors (dbt, Looker, Atlan). By this site's standard they are vendor blogs, the least trustworthy tier, with a commercial motive for the semantic layer to "win," and the specific percentages were likely cherry-picked. They're used here only for directional judgment, each labeled, none treated as definitive. The direction is credible because three kinds of source agree: motivated vendors, a neutral academic benchmark (Spider 2.0), and Anthropic's own internal numbers.

A year of model progress on text-to-SQLVendor · dbt

a year ago32.7%

now64.5%

Models did leap. But for enterprise-grade complex queries, this is still nowhere near enough.

Enterprise text-to-SQL: same GPT-4o, moved to a realistic benchmarkNeutral academic · Spider 2.0

Spider 1.086%

The most persuasive line, because it's a neutral academic benchmark. This gap isn't crossed by a stronger model, it's a structure problem: single correct answer + silent failure.

Bare schema vs. a governed semantic layerVendor · dbt

bare schema ~40%

semantic layer 83%

For queries the semantic layer covers, accuracy approaches or hits 100%: the query is generated deterministically, so the model can't produce a subtly wrong result set.

Looker: natural language → SQL vs. → governed LookMLVendor · Looker

NL→SQL avg 80%

NL→LookML avg 97%

Atlan, across 522 queries, also found governance context lifts AI-generated-SQL accuracy by a relative 38% (Vendor · Atlan). Vendor numbers don't fully agree, but the direction is uniform.

"Finding the right table": retrieval hit-rate after investing in metadataThird-party · neutral-leaning

embedding bare DDL ~40%

embedding table docs 90%

This corroborates Anthropic's skill-router approach: narrow millions of fields to a few dozen files, the same "retrieve-then-generate" lesson the field has learned.

Put together, one oft-repeated phrase fits: accuracy is a context problem, not a model problem (third-party). That is Anthropic's core thesis, said another way.

06 · The cost of running it

Not built once and left, but fed continuously

Anthropic offers a pragmatic starting point: from zero, a handful of canonical datasets, a few dozen offline evals, and a thin knowledge skill capture most of the upside; everything else in the post was added once those were built.

But to run it for the long haul, you have to face the cost. Lined up, the post's hard numbers make it plain.

21% → 95%+

Without skills, accuracy can't clear 21%. Most of the accuracy isn't free from the model, it's built by skills, a layer of human-curated assets.

95% → 65%

Offline accuracy drifts on its own, in a single month. The cost of not maintaining. Held in check only by a hook forcing "change the model, change the doc", at the price of ~90% of PRs carrying a skill change.

+6% / +32% / +72%

Adversarial review buys 6% accuracy for 32% more tokens and 72% higher latency. Accuracy isn't free, and gets pricier the higher you push.

Together these point to one conclusion: the real cost of this architecture is maintenance, not construction. It isn't a system you build and leave; it's a living thing you feed. The data model changes daily, docs go stale, evals drift, corrections have to be harvested back. Standing up that maintenance machinery is far harder, and far more valuable, than getting the agent working the first time.

Anthropic also poses a few questions to align on, since not every practice fits every team.

How important is a correct answer today vs. in the future? Models progress fast. A common waste is building heavy infrastructure to patch current model shortfalls that become moot once the model improves. Waiting for the model to catch up has far less overhead, if your risk tolerance allows.
How will your business complexity change? If you don't produce much data, have few output consumers, or your model will stay simple, some of these processes are overkill.
How technical is the audience? Building for data scientists who can spot a wrong answer means more error tolerance; building for people unfamiliar with the data model is a different matter.
How much will you spend for accuracy? Processes like adversarial validation improve accuracy significantly, but often at higher cost and latency.
What's your posture on access control and data privacy? Agents are more capable with more context, but broad access cuts against most governance postures. This decides whether you build one agent or many scoped ones.

The lesson for other teams is clear: building self-service analytics in the agent era, the real engineering isn't "wire up a model and let it query," it's the three things before that: govern down to a single answer, make it discoverable, and build an immune system that keeps correcting it. A stronger model can't substitute for these, because what they solve isn't a model-capability problem, it's the organization's data-structure problem.

Source notes

Evidence tiers and labeling

OfficialHow Anthropic enables self-service data analytics with Claude

claude.com · 2026-06-03 · Data Science & Data Engineering team (Chen Chang, Clement Peng, Justin Leder, Johanne Jiao, Josh Cherry). Sections 1–4 and 6 draw all their claims, numbers, architecture, and appendix skeletons from this post. All internal accuracy figures (95% automated / ~95% aggregate accuracy, <21% without skills, >95% with skills and ~99% in certain domains, offline accuracy 95%→65% over a month, adversarial review +6% / +32% tokens / +72% latency, ~90% of PRs carry a skill change, launch gate initially ~90%) are Anthropic's own and cannot be independently verified. Blocks marked "illustrative" reconstruct the data structures with a generic e-commerce example, not any company's internal tables.

VendorThird-party2026 text-to-SQL / semantic-layer benchmarks

dbt (text-to-SQL 32.7%→64.5%; bare schema ~40% vs semantic layer 83%; "failure looks like an error message"), Looker (NL→LookML avg 97% vs NL→SQL avg 80%), and Atlan (522 queries, +38% with governance context) are vendors selling semantic or governance layers, with a commercial motive and likely cherry-picked numbers; used here only for directional judgment. Spider 2.0 (GPT-4o 86%→6%, enterprise peak 38–59%) is a neutral academic benchmark and a non-self-interested cross-check. Pinterest's 40%→90% retrieval hit-rate and "accuracy is a context problem, not a model problem" are third-party, neutral-leaning. The direction is credible because vendor benchmarks, the neutral academic benchmark, and Anthropic's internal numbers all agree; no single percentage should be taken as definitive.