Secure ML Pipelines¶

An ML pipeline moves data through training, evaluation, and deployment. Each stage transforms inputs into outputs that the next stage depends on. If any stage is compromised, everything downstream is affected. Pipeline security means ensuring the integrity and trustworthiness of this entire chain.

Pipeline threat model¶

Before securing a pipeline, understand what can go wrong:

Threat	Attack surface	Impact
Data poisoning	Training data ingestion	Model learns attacker-chosen behaviour
Pipeline tampering	Pipeline definition, build environment	Arbitrary code runs during training
Model substitution	Model storage, registry	Wrong or malicious model is deployed
Credential theft	Pipeline secrets, environment variables	Attacker accesses data, models, or infrastructure
Configuration manipulation	Hyperparameters, serving config	Model behaviour altered without changing weights
Supply chain compromise	Framework dependencies, base images	Malicious code executes during training

Training integrity¶

Training is where the model is created. Compromising training compromises the model.

Data integrity¶

Verify data sources. Training data should come from approved, verified sources. Maintain an inventory of data sources with their classification and approval status.
Hash datasets. Compute and store hashes of training datasets. If the data changes unexpectedly, the hash mismatch signals a problem.
Access control on training data. Not everyone who can run a training job should have access to all training data. Apply least privilege to data access.
Data validation. Before training begins, validate that the data matches expected schema, distributions, and quality thresholds. Automated data validation catches poisoning attempts and data quality issues.

Compute integrity¶

Ephemeral training environments. Training should run on freshly provisioned infrastructure. Persistent environments accumulate risk.
Locked dependencies. Pin all framework versions, library versions, and system packages. Verify hashes of downloaded dependencies.
Network isolation. Training environments should have minimal network access. They need access to training data and model storage, not the internet.
Resource monitoring. Monitor training compute for anomalous behaviour: unexpected network traffic, unusual disk I/O, or resource usage patterns that do not match expected training workloads.

Reproducibility¶

Reproducibility is a security control, not just a research convenience. If you cannot reproduce a training run, you cannot verify that it produced the model you think it did.

Log everything. Random seeds, hyperparameters, data versions, framework versions, hardware configuration, environment variables.
Version data alongside code. Use tools like DVC to version training data alongside the code that processes it.
Deterministic where possible. Set random seeds, use deterministic algorithms where available, and document where non-determinism is unavoidable.
Verify reproducibility. Periodically re-run training and compare results. Significant divergence is a signal that something has changed.

Pipeline attestation¶

Pipeline attestation creates a verifiable record of what happened during a pipeline run. It answers: who ran this pipeline, with what inputs, using what code, and what did it produce?

What to attest¶

Pipeline definition. The exact pipeline code that was executed (commit hash, not just branch name).
Inputs. Data versions, model versions (for fine-tuning), configuration files.
Environment. Framework versions, dependency versions, hardware configuration.
Outputs. Model artefact hashes, evaluation metrics, logs.
Identity. Who or what triggered the pipeline run. Service accounts should be traceable to responsible humans.
Timing. When the pipeline ran, how long each stage took.

Attestation formats¶

Follow SLSA (Supply-chain Levels for Software Artifacts) principles adapted for ML:

SLSA Level	ML equivalent
Level 1	Pipeline definition is version-controlled, builds are logged
Level 2	Pipeline runs on a hosted, authenticated build service
Level 3	Pipeline definition is from a trusted source, builds are isolated
Level 4	All inputs are verified, builds are fully reproducible

Most organisations should target Level 2 as a minimum, with Level 3 for production-bound models.

Pipeline access control¶

Who can do what¶

Action	Who should be allowed
Define the pipeline	ML engineers, reviewed by security
Trigger a training run	ML engineers with project access
Access training data	Training jobs (service accounts), not humans directly
Publish a model to the registry	Pipeline automation only, not individuals
Promote a model to production	Requires explicit approval (see Model Lifecycle)
Modify pipeline configuration	Reviewed changes through version control

Separation of duties¶

No single person should be able to:

Modify training data AND approve the resulting model
Change the pipeline definition AND trigger a production deployment
Access production credentials AND modify model serving configuration

Separation of duties makes it harder for a single compromised account or malicious insider to impact the entire pipeline.

References