Data readiness audit

The precondition nobody scopes. Everyone debates the model; almost no one audits whether the data underneath it is fit, lawful and traceable — before the impact-assessment and privacy clocks run out.

The problem this quarter

Every AI steering committee this year has argued about the same thing: which model. GPT-class or open-weight, hosted or on-prem, RAG or fine-tune. Almost none have asked the prior question that determines whether any of those choices are lawful or useful: is the data underneath fit, lawful and traceable?

The model is the part you can swap. The data is the part you inherit — decades of documents, exports, scraped reference material and silently-copied third-party content, most of it never assembled with an AI use in mind. When a retrieval or training pipeline goes wrong, it almost never fails at the model. It fails because the corpus was stale, mislabelled, mixed-sensitivity, or carried a licence nobody checked.

Two regulatory clocks make this urgent rather than tidy. The Privacy Act automated decision-making transparency obligation (a new APP 1 duty) commences 10 December 2026: where qualifying automated decisions are made, entities must disclose the kinds of personal information used. That is a statement about your data, not your model. And the DTA Policy for the responsible use of AI in government v2.0 (effective 15 December 2025) brings an AI use-case impact assessment before deployment in its 12-month tranche (~December 2026) — and an impact assessment with no evidence about the underlying data is a cover sheet.

A data-readiness audit is the cheapest control you have, because it runs before you spend on infrastructure and before you create an obligation you can’t evidence.

The method

Audit the data on the dimensions that actually bind an AI use case. Each is scored — red, amber, green — for a named purpose. “The data is fine” is not an answer; “the data is green for internal RAG and red for an external-facing automated decision” is.

The seven dimensions

Quality and fitness for the AI purpose. Not generic data quality — fitness for this use. A field that is good enough for quarterly reporting may be useless as ground truth for a classifier.
Lineage and provenance. For each field that feeds the system, can you show where it came from and how it was transformed? If you can’t trace it, you can’t defend it in an impact assessment or a privacy disclosure.
Rights and licensing. Who owns this content, and are you licensed to use it for AI? Since the Australian Government ruled out a text-and-data-mining exception on 26 October 2025, you cannot assume scraped or third-party material is free to train or index. Licensing approaches are still under exploration — the default answer is “not cleared”.
Sensitivity and personal information. What classification does each source carry, and what personal information is present? This is the dimension that wires directly into the Privacy ADM obligation: the kinds of personal information in the corpus are the kinds you may have to disclose.
Representativeness and bias. Does the data represent the population the system will act on, or a skewed slice of it? Gaps here become disparate outcomes downstream.
Labelling and ground truth. Where the use case needs labels, are they accurate, consistent and current — or improvised?
Availability and freshness. For RAG or any retrieval system, how current is the index, how often does it refresh, and what is the staleness window?

The discipline is to score each dimension against a specific deployment. The same corpus carries different risk for an internal drafting assistant than for an automated eligibility decision.

A worked example

Riverina Mutual (fictional) is a mid-sized regional member organisation in NSW. Its innovation team wants RAG over the internal document store — policies, member correspondence, product guides, board papers — so frontline staff can answer member questions faster. The proof-of-concept demos beautifully. Then the data-readiness audit runs before the production go-ahead.

It surfaces three problems the demo hid:

No lineage. The document store is fifteen years of accreted shared drives. Nobody can say where many documents originated, which are current, or which were superseded. The retrieved “policy” might be a 2019 draft.
Mixed sensitivity. The store blends public product guides, internal-only board papers and member correspondence containing personal and financial information — with inconsistent classification. RAG retrieves across all of it, so a member-facing assistant can surface a board paper or another member’s details.
Third-party content of unknown licence. Scattered through the store are vendor manuals, purchased research reports and web-sourced reference PDFs. None were assembled with redistribution or AI indexing in mind, and the TDM exception that might have excused training-style use was ruled out.

None of this is a model problem, and no model swap fixes it. The audit reframes the project honestly: a member-facing RAG over this corpus is red today. A scoped internal assistant over a curated, classified, licence-cleared subset is amber-to-green and shippable. The board decision moves from “approve the AI” to “fund the data remediation that makes one AI use case lawful” — a far better decision to be making before December 2026.

The template

Copy this. Score every dimension against one named use case. Anything not green needs an owner and a remediation date before deployment.

Data-readiness scorecard

Dimension	What good looks like	Status (R / A / G)
Quality & fitness	Data demonstrably fit for this purpose; known accuracy; no silent gaps
Lineage & provenance	Every feeding field traceable to source and transformation
Rights & licensing	Ownership known; AI use cleared; no unlicensed third-party/scraped content
Sensitivity & personal info	Consistent classification; personal information identified and mapped
Representativeness & bias	Data represents the acted-on population; known gaps documented
Labelling & ground truth	Labels accurate, consistent, current; provenance of labels known
Availability & freshness	Refresh cadence and staleness window defined and acceptable

Scoring rules

Green — meets the standard for the named use case; evidence exists.
Amber — partial; a known gap with a named owner and a remediation date.
Red — fails, or you cannot evidence it. Red on lineage, licensing or sensitivity blocks deployment, because each maps to a specific obligation you would otherwise breach.

One rule worth enforcing

No “green overall”. A corpus is green for a use case. Re-score whenever the use case changes — internal assistant to external decision-maker is a different audit, not a config flag.

What changes when the regulation moves

The dimensions are stable; the consequences of a red score escalate as obligations commence.

Regulatory movement	Effect on the audit
Privacy ADM transparency (commences 10 Dec 2026; OAIC guidance ~Sep 2026)	The sensitivity & personal information dimension stops being hygiene and becomes a disclosure input. You must state the kinds of personal information feeding qualifying automated decisions — which presumes you have audited it.
Copyright / TDM ruled out (26 Oct 2025; licensing under exploration)	The rights & licensing dimension cannot be waved through. Scraped and third-party content is “not cleared” by default until a licensing path exists.
DTA Policy v2.0 impact assessment (~Dec 2026 tranche; legacy use cases by 30 Apr 2027)	The scorecard becomes the evidence base for the use-case impact assessment. No data evidence, no credible assessment.
EU AI Act Article 10 (high-risk data-governance duties; Annex III timing delayed to 2 Dec 2027 via the Digital Omnibus, not yet formally adopted)	AU organisations with EU exposure inherit explicit data-governance obligations for high-risk systems — lineage, representativeness and bias move from good practice to documented duty.

Underneath all of it, the frameworks already agree on the direction. The AI6 Guidance for AI Adoption (NAIC/DISR, six essential practices) and the VAISS ten guardrails both treat data governance as a core control, and ISO/IEC 42001:2023 makes an AI management system certifiable. None of those let you skip the audit; they assume you have done it.

The practical posture: run the readiness audit now, while it is a cheap internal exercise, rather than reconstructing data lineage under a disclosure deadline. The data you can’t trace today is the disclosure you can’t make in December.

Where this connects

This audit is one axis of a broader picture. Use the risk classifier to confirm whether your use case is the kind that triggers the obligations above, then locate your data practice on the Data & AI axis of the maturity assessment. The obligations referenced here are tracked in the regulatory canon, and the audit sits within the model as the baseline control it depends on.