Data Governance for MLOps¶
Every model is a product of its data. If you do not govern the data flowing through your ML pipeline, you cannot trust the models that come out of it. Data governance for MLOps goes beyond traditional data governance. It must account for the unique ways ML systems consume, transform, and depend on data across training, evaluation, and production.
Why ML data governance is different¶
Traditional data governance focuses on storage, access, and retention. ML data governance adds several dimensions that traditional frameworks were not designed for.
| Traditional concern | ML-specific extension |
|---|---|
| Who can access data? | Who can use data for training? Access and training rights are not the same |
| Where is data stored? | Where was data stored when the model was trained? The data may have changed since |
| Is data accurate? | Is data accurate enough to train on? Statistical quality, not just correctness |
| Data retention | Model retention depends on data retention: retiring data may require retiring models |
| Compliance | Training on personal data creates new obligations (right to erasure affects trained models) |
Data lineage¶
Data lineage tracks the origin, movement, and transformation of data throughout the ML pipeline. It answers: where did this data come from, what happened to it, and which models were trained on it?
What to track¶
| Lineage element | Why it matters |
|---|---|
| Source | Where the data originated (database, API, vendor, scrape, synthetic generator) |
| Collection timestamp | When the data was acquired, critical for detecting temporal poisoning |
| Transformations | Every preprocessing step: cleaning, normalisation, augmentation, feature engineering |
| Schema | Column names, types, and constraints at each stage |
| Who modified it | Human edits, automated pipelines, or third-party processing |
| Which models used it | Forward traceability from data to every model trained on it |
Implementing lineage tracking¶
Lineage tracking does not require a dedicated platform on day one. Start with what you can implement now.
Minimum viable lineage:
- Hash every dataset version used for training
- Record the dataset hash in the training attestation (see Secure ML Pipelines)
- Store a manifest listing data sources, collection dates, and row counts
- Log all transformations applied to the data before training
Mature lineage:
- Use a data versioning tool (DVC, LakeFS, Pachyderm) to version datasets alongside code
- Automate lineage capture in your pipeline orchestrator
- Build forward and backward traceability: from any model, find its data; from any data, find all models
- Integrate lineage with your model registry so every registered model links to its exact training data
Lineage is a security control
When a production model behaves unexpectedly, lineage is how you investigate. Without it, incident response becomes guesswork. Lineage also enables targeted remediation: if a data source is compromised, you can identify every model that was trained on it.
Data versioning¶
Data changes over time. Models trained on different versions of the same dataset can behave differently. Without versioning, you lose reproducibility and auditability.
Versioning approaches¶
| Approach | Trade-offs |
|---|---|
| File-based (DVC) | Versions data files alongside code in Git. Lightweight. Requires discipline in tagging |
| Snapshot-based (LakeFS) | Git-like branching for data lakes. Good for large datasets. Adds infrastructure |
| Pipeline-integrated (Pachyderm) | Versioning built into the pipeline. Automatic. Platform-dependent |
| Manual hashing | Compute and store hashes of datasets. Minimal tooling. No rollback capability |
Whichever approach you choose, the principle is the same: every training run must reference a specific, immutable version of its data, and that version must be retrievable.
Data classification for ML¶
Not all data is equally sensitive, and not all data is suitable for training. Data classification for ML adds training-specific considerations to your existing classification scheme.
Classification dimensions¶
Sensitivity:
- Public: Open datasets, published benchmarks. Low risk, but verify provenance
- Internal: Organisation-specific data without personal or regulated content
- Confidential: Contains personal data, trade secrets, or regulated content
- Restricted: Highly regulated data (health records, financial data, classified information)
Training suitability:
- Approved for training: Data has been reviewed and cleared for use in ML training
- Approved with conditions: Can be used for training only with specific controls (anonymisation, aggregation, geographic restrictions)
- Not approved for training: Data exists in the organisation but must not be used for training (legal restrictions, licensing, consent limitations)
Access does not equal training rights
A data analyst may have access to customer data for reporting purposes. That does not mean the same data can be used to train a model. Training creates a derivative work that embeds information from the data in the model's weights. Separate authorisation for training use is essential.
Practical classification checklist¶
- Does the data contain personal information (PII, PHI)?
- What licences or terms of service apply to the data?
- Was consent obtained for ML training use specifically?
- Does the data cross jurisdictional boundaries (GDPR, POPIA, CCPA)?
- Is the data from a trusted, verified source?
- Has the data been reviewed for bias or representational issues?
- Is there a retention policy that affects how long models trained on this data can remain in use?
Data quality for ML¶
Data quality in ML is statistical, not just structural. A dataset can be perfectly formatted and still produce a harmful model if it contains bias, is unrepresentative, or has been subtly poisoned.
Quality dimensions¶
| Dimension | What to check |
|---|---|
| Completeness | Missing values, gaps in coverage, underrepresented categories |
| Consistency | Contradictory labels, conflicting records, schema drift between versions |
| Accuracy | Label correctness, measurement precision, source reliability |
| Timeliness | Data age relative to the prediction task, temporal distribution |
| Distribution | Class balance, feature distributions, outlier presence |
| Provenance | Source trustworthiness, collection methodology, chain of custody |
Automated quality gates¶
Build data quality checks into your pipeline, not as a manual review step.
# Example: basic data quality gate before training
def validate_training_data(df, config):
"""Run before training begins. Fail the pipeline if data quality is below threshold."""
checks = {
"null_rate": df.isnull().mean().max() < config.max_null_rate,
"min_rows": len(df) >= config.min_training_rows,
"label_balance": df[config.label_col].value_counts(normalize=True).min() > config.min_class_ratio,
"schema_match": set(df.columns) == set(config.expected_columns),
}
failed = [name for name, passed in checks.items() if not passed]
if failed:
raise DataQualityError(f"Data quality checks failed: {failed}")
Regulatory considerations¶
ML training creates new regulatory obligations that traditional data governance may not cover.
Right to erasure and model retention¶
Under GDPR and similar regulations, individuals can request deletion of their personal data. If that data was used to train a model, the model may retain information derived from the deleted data in its weights. This creates a tension between data deletion obligations and model retention.
Practical approaches:
- Track which datasets contain personal data and which models were trained on them
- When data is deleted, assess whether affected models need retraining
- For high-sensitivity models, consider differential privacy techniques during training to limit memorisation
- Document your approach and the rationale for your retention decisions
Cross-border data flows¶
Training data may originate from multiple jurisdictions. The model itself may be deployed across borders. Data governance must account for:
- Where training data is stored and processed
- Whether data transfer agreements are in place
- Whether the model's deployment locations are consistent with data origin restrictions
- Local regulations on AI training (the EU AI Act imposes specific obligations on training data documentation)
Training data documentation¶
Regulatory frameworks increasingly require documentation of training data. Model cards and datasheets provide structured formats for this.
What to document:
- Data sources and their provenance
- Collection methodology and timeframe
- Preprocessing and filtering steps applied
- Known limitations, biases, or gaps in the data
- Licence and consent information
- PII handling and anonymisation methods used
The EU AI Act and training data
The EU AI Act requires providers of high-risk AI systems to document their training data, including data governance measures, data preparation steps, and any assumptions about the data. Organisations operating in or serving EU markets should treat training data documentation as a compliance requirement, not optional good practice.
Connecting data governance to MLOps¶
Data governance is not a separate activity. It integrates directly into the ML pipeline.
| Pipeline stage | Data governance control |
|---|---|
| Data collection | Source approval, provenance recording, classification |
| Data preparation | Transformation logging, quality validation, bias assessment |
| Training | Dataset versioning, lineage linking, access control enforcement |
| Evaluation | Test data independence, evaluation data governance |
| Registration | Training data metadata stored with model, data lineage linked |
| Monitoring | Input data drift detection, data quality monitoring in production |
The model registry should link every model to its training data version. The experiment tracker should record which data was used for each experiment. The pipeline orchestrator should enforce data quality gates before training begins. When these connections exist, data governance becomes auditable, traceable, and enforceable rather than aspirational.