Infrastructure as Code¶

Infrastructure as Code (IaC) for AI workloads follows the same principles as traditional IaC: define infrastructure declaratively, version control it, review changes, and apply them through automated pipelines. AI workloads add specific resources and configurations that must be captured.

What AI adds to IaC¶

Traditional IaC manages compute, networking, storage, and services. AI workloads add:

Resource type	What to define	Security relevance
GPU instances	Instance types, driver versions, CUDA versions	Driver vulnerabilities, resource isolation
Model storage	Buckets, registries, access policies	Data protection, integrity, access control
Inference endpoints	Endpoint configuration, scaling policies, authentication	Exposure surface, authentication, rate limiting
Vector databases	Cluster configuration, access policies, encryption	RAG data protection, query access control
Experiment tracking	Server configuration, access policies	Experiment data protection, IP protection
Training infrastructure	Cluster configuration, spot instance policies	Data access during training, compute security

Principles for AI IaC¶

Everything in code, no exceptions¶

Every piece of AI infrastructure should be defined in code:

GPU cluster configuration
Model serving endpoint configuration
Network policies for AI workloads
IAM roles and policies for model access
Encryption configuration for model storage
Logging and monitoring configuration

Manual configuration is configuration that cannot be audited, reproduced, or reviewed.

Environment parity¶

Development, staging, and production environments should be as similar as possible. For AI workloads, this includes:

Same model serving framework version
Same inference configuration (not just the model, but the serving parameters)
Same network policies and access controls
Same logging and monitoring setup

Where environments must differ (smaller GPU instances in development, for example), document the differences explicitly and assess their impact on testing validity.

Least privilege for AI resources¶

AI workloads often accumulate broad permissions during development that persist into production.

Common over-permissions to watch for:

Training jobs with read access to all S3 buckets (should be restricted to specific training data buckets)
Inference endpoints with write access to model storage (inference should be read-only)
Notebooks with admin-level cloud permissions (should have only what is needed for the current task)
CI/CD pipelines with broad IAM permissions (scope to specific actions on specific resources)

Secrets as configuration, not in configuration¶

IaC files should reference secrets, not contain them. This means:

API keys stored in a secrets manager, referenced by ARN or path
Database credentials injected at runtime, not baked into templates
Model endpoint tokens pulled from a vault, not committed to the repository

See Secrets Management for detailed guidance.

Common IaC patterns for AI¶

Model serving endpoint¶

Define the complete serving configuration:

# Example: model serving endpoint definition (tool-agnostic)
model_endpoint:
  name: "product-classifier-v2"
  model_source: "s3://models/product-classifier/v2.1.0/model.safetensors"
  model_hash: "sha256:abc123..."
  instance_type: "ml.g5.xlarge"
  min_instances: 2
  max_instances: 10
  authentication: "iam"
  network:
    vpc_endpoint: true
    public_access: false
  logging:
    request_logging: true
    response_logging: true
    log_retention_days: 90
  monitoring:
    latency_alarm_threshold_ms: 500
    error_rate_alarm_threshold: 0.01

GPU cluster for training¶

# Example: training cluster definition
training_cluster:
  name: "model-training"
  instance_type: "p4d.24xlarge"
  max_nodes: 4
  spot_enabled: true
  spot_fallback: "on-demand"
  network:
    subnet: "private-training-subnet"
    security_group: "training-sg"
    internet_access: false
  storage:
    training_data: "s3://training-data/approved/"
    model_output: "s3://models/training-output/"
    temp_storage_gb: 500
  iam_role: "training-execution-role"
  encryption:
    ebs_encryption: true
    kms_key: "arn:aws:kms:..."

Drift detection¶

IaC is only effective if actual infrastructure matches the defined state. Drift, where actual configuration diverges from IaC, is a security risk.

Detect drift regularly. Run plan/diff operations on a schedule, not just before changes.

Alert on drift. Unexpected changes to AI infrastructure should trigger alerts. Someone manually changing a model endpoint's configuration bypasses your review process.

Remediate automatically where safe. For non-destructive drift (a security group rule was added manually), consider automatic remediation. For destructive drift (a resource was modified), alert and investigate.

References