Data quality matters — Nick Whiteley

It is 2026 and the technology world is all about AI. Meanwhile, the underlying definition and classification of your data is just as important as it ever was — arguably more so.

Some of the reasons for this have always been true:

A good data model is the foundation for a good application
When data leaves your application, its meaning must travel with it
We are generating and consuming more data than ever before; a small ambiguity quickly becomes a large problem

And then there is a new one that raises the stakes considerably: AI agents derive meaning directly from how your data is named and classified. They don't ask for clarification. They don't check back with a colleague. They read your column name, make an assumption, and act on it. A column called rate means something very different in a pricing table, a tax table, and a loans table. The agent won't know which you meant unless you tell it — and the way you tell it is through clear, precise data definitions baked into the schema itself.

Anecdote one: a sale is a sale, right? In retail, particularly online, there are many ways to measure the same word. The web marketing team measure using browser-based stats. Fulfilment measure based on confirmed orders. Finance measure shipped orders, net of returns. Merchandise planners combine confirmed sales with returns to plan stock. At the end of the year, targets and bonuses are based on the finance figure — but that definition is useless for through-the-year planning. Four teams, one word, four different numbers. No amount of good tooling fixes this. You have to agree the definition first, and document it somewhere that survives the next reorganisation.

There is a real skill in stepping back from your data definitions and viewing them as a stranger would a year from now. Can you tell exactly what each field contains? What its expected values are? What it excludes? If you cannot answer those questions easily, neither can anyone — or anything — else consuming that data.

Anecdote two: interest rates are all about time. Interest can be calculated daily, monthly, or annually. Even when quoted as an annual rate it is often applied monthly, and those monthly figures compound — so the annual total is not simply twelve times the monthly figure. Then fees enter the picture, which is why APR exists as a standardised concept. Experience in banking shows that databases frequently store interest rate as a single numeric value with no context attached. The reader must know whether it is nominal or effective, annual or monthly, pre- or post-fee. When that context is missing, someone gets it wrong — quietly, confidently, at scale.

The remedies are not complicated. Clear column names and precise data types are the cheapest and most effective tool available. Beyond that: a well-governed data dictionary, column-level comments in the schema, or descriptive metadata stored alongside the values all achieve the same goal. The outcome is the same — everyone refers to the same thing every time, whether that everyone is a human analyst, a downstream service, or an AI agent reading your schema and deciding what to do with it.

Be specific about what your data contains. Be clear about its sensitivity classification. Your future self — and your future agents — will thank you.