Skip to content

CI/CD for AI

Continuous integration and continuous delivery for AI systems extends traditional CI/CD with new artefact types, new validation steps, and new integrity requirements. The pipeline is the last line of defence before deployment. If it is compromised, every model it touches is suspect.

How AI CI/CD differs

Traditional CI/CD pipelines build, test, and deploy code. AI CI/CD pipelines must also handle:

  • Model artefacts that are orders of magnitude larger than code
  • Data dependencies that change independently of code
  • Non-deterministic tests that require statistical validation rather than binary pass/fail
  • Longer build times when training or fine-tuning is part of the pipeline
  • Multiple artefact types (code, model weights, configuration, prompt templates, evaluation datasets)

Pipeline integrity

A compromised pipeline is a compromised deployment. Integrity controls for AI pipelines include:

Build environment security

  • Ephemeral build environments. Build agents should be provisioned fresh for each run. Persistent build environments accumulate state and risk.
  • Locked dependencies. Pin all dependencies including ML framework versions, CUDA versions, and system libraries. Use lock files and verify hashes.
  • Minimal base images. Start with the smallest base image that works. Every additional package is attack surface.
  • No internet access during build. Where practical, restrict build environments from accessing the internet. Pull dependencies from an internal mirror or cache.

Pipeline-as-code

  • Version control your pipeline. The pipeline definition lives in the same repository as the code. Changes to the pipeline go through the same review process as code changes.
  • Signed commits. Require signed commits for pipeline definition changes.
  • Branch protection. Protect the branch that triggers production deployments. Require reviews and status checks.
  • Audit trail. Every pipeline run should produce a complete log of what was built, what was tested, what was deployed, and who triggered it.

Artefact integrity

  • Sign model artefacts. Every model artefact produced by the pipeline should be signed. Verify signatures before deployment.
  • Hash everything. Compute and store cryptographic hashes for all artefacts (models, data snapshots, configuration files).
  • Immutable artefact storage. Once an artefact is published, it should not be overwritable. Use immutable storage or versioned registries.
  • Provenance records. Generate and store provenance records (SLSA-style) that document how each artefact was produced.

Validation gates

AI pipelines need validation gates that go beyond unit tests and linting.

Pre-merge gates

Before code or configuration changes merge:

  • Code review completed (human review, not just automated checks)
  • Unit tests pass
  • Linting and formatting checks pass
  • Security scanning (dependency vulnerabilities, secrets detection)
  • Prompt template validation (if prompt changes are included)

Pre-deployment gates

Before a model deploys to production:

  • Model provenance verified (hash, signature, source)
  • Model format validated (reject unsafe serialisation formats)
  • Performance benchmarks meet minimum thresholds
  • Safety evaluations pass (bias, toxicity, refusal rates)
  • Adversarial test suite passes (prompt injection, jailbreak resistance)
  • Regression tests pass (no degradation from previous version)
  • Resource requirements validated (fits within allocated compute)
  • Approval recorded (who approved this deployment, when, based on what evidence)

Post-deployment verification

After deployment but before full traffic:

  • Canary deployment healthy (no errors, latency within bounds)
  • Output quality sampling passes spot checks
  • Monitoring and alerting confirmed active
  • Rollback tested and ready

Managing large artefacts

Model files do not fit in Git. Standard approaches:

Solution Approach Tradeoffs
Git LFS Large files tracked by Git, stored externally Simple, but versioning large files is expensive
DVC Data and model versioning alongside code Purpose-built for ML, integrates with Git
Model registry Dedicated registry (MLflow, Weights & Biases, etc.) Best for model lifecycle management
Object storage S3, GCS, Azure Blob with naming conventions Simple, flexible, requires discipline

Whichever approach you use, apply the same integrity controls: hash verification, access control, and immutability for production artefacts.

Pipeline security anti-patterns

Shared credentials across environments. Development, staging, and production use different credentials. Never share.

Model files committed to Git. Large binary files in Git cause repository bloat and make history management difficult. Use a model registry or object storage.

No validation between training and deployment. A model that finishes training is not a model that is ready for production. Always validate.

Manual deployment steps. Any manual step is a step that can be skipped, done wrong, or done by the wrong person. Automate everything.

Tests that always pass. If your AI tests have never failed, your thresholds are too lenient. Calibrate evaluation thresholds based on actual requirements, not convenience.