Data Governance for MLOps¶

Every model is a product of its data. If you do not govern the data flowing through your ML pipeline, you cannot trust the models that come out of it. Data governance for MLOps goes beyond traditional data governance. It must account for the unique ways ML systems consume, transform, and depend on data across training, evaluation, and production.

Why ML data governance is different¶

Traditional data governance focuses on storage, access, and retention. ML data governance adds several dimensions that traditional frameworks were not designed for.

Traditional concern	ML-specific extension
Who can access data?	Who can use data for training? Access and training rights are not the same
Where is data stored?	Where was data stored when the model was trained? The data may have changed since
Is data accurate?	Is data accurate enough to train on? Statistical quality, not just correctness
Data retention	Model retention depends on data retention: retiring data may require retiring models
Compliance	Training on personal data creates new obligations (right to erasure affects trained models)

Data lineage¶

Data lineage tracks the origin, movement, and transformation of data throughout the ML pipeline. It answers: where did this data come from, what happened to it, and which models were trained on it?

What to track¶

Lineage element	Why it matters
Source	Where the data originated (database, API, vendor, scrape, synthetic generator)
Collection timestamp	When the data was acquired, critical for detecting temporal poisoning
Transformations	Every preprocessing step: cleaning, normalisation, augmentation, feature engineering
Schema	Column names, types, and constraints at each stage
Who modified it	Human edits, automated pipelines, or third-party processing
Which models used it	Forward traceability from data to every model trained on it

Implementing lineage tracking¶

Lineage tracking does not require a dedicated platform on day one. Start with what you can implement now.

Minimum viable lineage:

Hash every dataset version used for training
Record the dataset hash in the training attestation (see Secure ML Pipelines)
Store a manifest listing data sources, collection dates, and row counts
Log all transformations applied to the data before training

Mature lineage:

Use a data versioning tool (DVC, LakeFS, Pachyderm) to version datasets alongside code
Automate lineage capture in your pipeline orchestrator
Build forward and backward traceability: from any model, find its data; from any data, find all models
Integrate lineage with your model registry so every registered model links to its exact training data

Lineage is a security control

When a production model behaves unexpectedly, lineage is how you investigate. Without it, incident response becomes guesswork. Lineage also enables targeted remediation: if a data source is compromised, you can identify every model that was trained on it.

Data versioning¶

Data changes over time. Models trained on different versions of the same dataset can behave differently. Without versioning, you lose reproducibility and auditability.

Versioning approaches¶

Approach	Trade-offs
File-based (DVC)	Versions data files alongside code in Git. Lightweight. Requires discipline in tagging
Snapshot-based (LakeFS)	Git-like branching for data lakes. Good for large datasets. Adds infrastructure
Pipeline-integrated (Pachyderm)	Versioning built into the pipeline. Automatic. Platform-dependent
Manual hashing	Compute and store hashes of datasets. Minimal tooling. No rollback capability

Whichever approach you choose, the principle is the same: every training run must reference a specific, immutable version of its data, and that version must be retrievable.

Data classification for ML¶

Not all data is equally sensitive, and not all data is suitable for training. Data classification for ML adds training-specific considerations to your existing classification scheme.

Classification dimensions¶

Sensitivity:

Public: Open datasets, published benchmarks. Low risk, but verify provenance
Internal: Organisation-specific data without personal or regulated content
Confidential: Contains personal data, trade secrets, or regulated content
Restricted: Highly regulated data (health records, financial data, classified information)

Training suitability:

Approved for training: Data has been reviewed and cleared for use in ML training
Approved with conditions: Can be used for training only with specific controls (anonymisation, aggregation, geographic restrictions)
Not approved for training: Data exists in the organisation but must not be used for training (legal restrictions, licensing, consent limitations)

Access does not equal training rights

A data analyst may have access to customer data for reporting purposes. That does not mean the same data can be used to train a model. Training creates a derivative work that embeds information from the data in the model's weights. Separate authorisation for training use is essential.

Practical classification checklist¶

Does the data contain personal information (PII, PHI)?
What licences or terms of service apply to the data?
Was consent obtained for ML training use specifically?
Does the data cross jurisdictional boundaries (GDPR, POPIA, CCPA)?
Is the data from a trusted, verified source?
Has the data been reviewed for bias or representational issues?
Is there a retention policy that affects how long models trained on this data can remain in use?

Data quality for ML¶

Data quality in ML is statistical, not just structural. A dataset can be perfectly formatted and still produce a harmful model if it contains bias, is unrepresentative, or has been subtly poisoned.

Quality dimensions¶

Dimension	What to check
Completeness	Missing values, gaps in coverage, underrepresented categories
Consistency	Contradictory labels, conflicting records, schema drift between versions
Accuracy	Label correctness, measurement precision, source reliability
Timeliness	Data age relative to the prediction task, temporal distribution
Distribution	Class balance, feature distributions, outlier presence
Provenance	Source trustworthiness, collection methodology, chain of custody

Automated quality gates¶

Build data quality checks into your pipeline, not as a manual review step.

# Example: basic data quality gate before training
def validate_training_data(df, config):
    """Run before training begins. Fail the pipeline if data quality is below threshold."""
    checks = {
        "null_rate": df.isnull().mean().max() < config.max_null_rate,
        "min_rows": len(df) >= config.min_training_rows,
        "label_balance": df[config.label_col].value_counts(normalize=True).min() > config.min_class_ratio,
        "schema_match": set(df.columns) == set(config.expected_columns),
    }
    failed = [name for name, passed in checks.items() if not passed]
    if failed:
        raise DataQualityError(f"Data quality checks failed: {failed}")

Regulatory considerations¶

ML training creates new regulatory obligations that traditional data governance may not cover.

Right to erasure and model retention¶

Under GDPR and similar regulations, individuals can request deletion of their personal data. If that data was used to train a model, the model may retain information derived from the deleted data in its weights. This creates a tension between data deletion obligations and model retention.

Practical approaches:

Track which datasets contain personal data and which models were trained on them
When data is deleted, assess whether affected models need retraining
For high-sensitivity models, consider differential privacy techniques during training to limit memorisation
Document your approach and the rationale for your retention decisions

Cross-border data flows¶

Training data may originate from multiple jurisdictions. The model itself may be deployed across borders. Data governance must account for:

Where training data is stored and processed
Whether data transfer agreements are in place
Whether the model's deployment locations are consistent with data origin restrictions
Local regulations on AI training (the EU AI Act imposes specific obligations on training data documentation)

Training data documentation¶

Regulatory frameworks increasingly require documentation of training data. Model cards and datasheets provide structured formats for this.

What to document:

Data sources and their provenance
Collection methodology and timeframe
Preprocessing and filtering steps applied
Known limitations, biases, or gaps in the data
Licence and consent information
PII handling and anonymisation methods used

The EU AI Act and training data

The EU AI Act requires providers of high-risk AI systems to document their training data, including data governance measures, data preparation steps, and any assumptions about the data. Organisations operating in or serving EU markets should treat training data documentation as a compliance requirement, not optional good practice.

Connecting data governance to MLOps¶

Data governance is not a separate activity. It integrates directly into the ML pipeline.

Pipeline stage	Data governance control
Data collection	Source approval, provenance recording, classification
Data preparation	Transformation logging, quality validation, bias assessment
Training	Dataset versioning, lineage linking, access control enforcement
Evaluation	Test data independence, evaluation data governance
Registration	Training data metadata stored with model, data lineage linked
Monitoring	Input data drift detection, data quality monitoring in production

The model registry should link every model to its training data version. The experiment tracker should record which data was used for each experiment. The pipeline orchestrator should enforce data quality gates before training begins. When these connections exist, data governance becomes auditable, traceable, and enforceable rather than aspirational.

References