Your Data Doesn’t Have to Be Perfect. It Has to Be Explained.

sjarvie15 Apr 2026

TLDR: The debate about AI replacing data analysts is mostly noise. The real constraint is not whether AI is capable enough. It is whether your data is documented well enough for AI to orient itself. Clean data helps. Documented data is what actually changes anything. Teams investing in data dictionaries and event taxonomies before deploying AI are the ones getting repeatable value. Everyone else is still routing questions through the same people.

Click to discuss this article with your favourite AI

Every few weeks someone publishes a new take on whether AI will replace data analysts. The answers range from “absolutely, within two years” to “never, human judgment is irreplaceable.” Both camps are spending energy on the wrong question.

I’ve worked with data teams long enough to know that the bottleneck is rarely analyst capability. It’s usually something much more unglamorous. The data exists. The analysts exist. But no one has written down what the data means or how it is structured.

The knowledge is in someone’s head, and that is the actual problem

I work with a business that has data spread across multiple warehouses and tables. Their analysts are skilled. Their data infrastructure is technically solid. But there is almost no documentation of what data sits where, how tables are structured, how metrics are aggregated or which source of truth to use when two numbers conflict.

That knowledge lives in the heads of two or three people on the data team.

When a product manager needs a number, they ask one of those people. When a new hire needs to understand the data model, they book time with one of those people. When something looks wrong in a report, they ask one of those people. The team is a bottleneck and they know it. They have just never had enough time to fix it because they are too busy answering questions.

Now add AI to that picture. A product manager gets excited, opens a chat interface connected to the data warehouse and asks a question. The AI responds with something plausible-sounding. The PM has no idea whether the answer is right, because they do not know whether the AI pulled from the right table, applied the right filters or used the right definition of the metric.

The AI is not wrong because it is not smart enough. It is wrong because no one told it the rules.

Clean data and explained data are not the same thing

The conventional wisdom is that AI needs clean data to work. That is true, up to a point. But it is a distraction from the more fundamental issue.

Clean data with no documentation is still opaque to an AI agent. An event called form_submit could mean a thousand different things depending on your implementation. A column called revenue might be gross, net, invoiced or recognised depending on which table you are in and when that table was last touched. A user ID in one table might not join cleanly to a user ID in another because they were built by different teams at different points in time.

AI does not know any of that unless you tell it.

McKinsey’s 2024 State of AI survey found that 70% of the companies they classify as AI high performers reported difficulties developing the ability to quickly integrate data into AI models. These are the companies doing AI well. The problem is not the model. It is the data context the model needs to do its job. (Source: McKinsey, The State of AI in Early 2024)

A data dictionary is the translation layer AI actually needs

For simpler web analytics implementations, a well-structured event taxonomy does most of the work. For my own site, I maintain a spreadsheet that documents every GA4 event: the event name, which category it belongs to, what parameters it fires with, the conditions under which it fires and which forms or tools it applies to. A separate sheet documents custom dimensions, their scope and which events use them. Another covers what lives in GA4 versus what only appears in BigQuery.

That document is written for a human. But it is just as useful when I load it into an AI context window before asking questions about how experimenters are engaging with my Hypothesis assessment tool. The AI now knows what tool_result means, what result_rating refers to and why belief_tag exists. It can answer questions about the data without guessing so I can make evals of the recommendation engine to balance between ease of use and improving hypothesis quality in a single prompt.

For larger, more complex data environments the same principle applies at greater scale. The teams getting the most repeatable value from AI-assisted analysis are building what I would describe as a knowledge layer: structured documentation of schema, metric definitions, known data quirks, glossary terms and business context. Not a wiki that no one reads. A set of structured files that get loaded into AI context at the start of every session, so the model always knows what it is working with.

Without that layer, every AI session starts from scratch. The model makes its best guess about what your tables mean, what joins are valid and which version of a metric to use. Sometimes it is right. Often enough that it feels useful, which is the dangerous part.

Waiting to document is a decision to keep the data team as a bottleneck

Most organisations know their data documentation is inadequate. It comes up in every analytics audit I run. The gap between “we know this needs doing” and “someone is actually doing it” comes down to one thing: it is hard to prioritise documentation work when there is always something more urgent to answer.

This is the same reason implementation work gets skipped in the first place. Tagging plans, event taxonomies, data dictionaries and schema documentation are the foundation that makes everything downstream work properly. They just do not feel urgent until the moment you need them and they are not there.

The business case has always existed. AI has just made it more concrete. If you want AI to function as a useful analyst supplement, the documentation layer is not optional. It is the thing that tells AI what your data means. Without it, you are not replacing the analyst. You are adding another tool that the analyst has to supervise.

The question worth asking before you invest in AI analytics tooling

If someone on your team asked an AI a question about your data today, could they trust the answer? Not “would the AI give an answer” because it will. The question is whether anyone in the business would know if the answer was right.

If the honest answer is “only the data team would know,” you have a documentation problem. And the longer you leave it, the more the data team’s time gets consumed by questions that should have been self-service months ago.

Let’s Connect

Need a data dictionary? Lets talk

Book a free 30-minute conversation

No obligations — just a chance to talk through your challenge and see if I can help.

Pick a time

or send a message