Adversarial Testing¶
Adversarial testing is the practice of deliberately attempting to make an AI system behave in unintended ways. It goes beyond functional testing (does it produce correct outputs?) to ask a harder question: can someone make it produce harmful, incorrect, or unauthorised outputs on purpose?
This matters before deployment because design reviews prove intent, but only testing proves reality. A well-designed system with untested guardrails is a system with unverified guardrails.
Why adversarial testing is different for AI¶
Traditional software testing is deterministic: given input X, expect output Y. AI systems are non-deterministic, and adversarial testing for AI must account for this.
What makes AI adversarial testing distinct:
- The same attack prompt can produce different results on different runs
- Models can be manipulated through natural language, not just technical exploits
- Safety training can be bypassed through creative prompting that technical controls may not catch
- Model updates (even minor ones) can change the attack surface
- Context window manipulation, multi-turn attacks, and tool-use exploitation create attack vectors that do not exist in traditional software
Testing requirements by tier¶
Testing cadence and depth scale with risk classification. Over-testing a LOW-tier system wastes resources. Under-testing a CRITICAL-tier system creates unacceptable risk.
| Aspect | LOW | MEDIUM | HIGH | CRITICAL |
|---|---|---|---|---|
| Pre-deployment testing | Basic prompt injection check | Structured test suite | Comprehensive red-team exercise | Full adversarial assessment by independent team |
| Ongoing cadence | Ad hoc / on major changes | Quarterly | Monthly | Continuous |
| Tester independence | Development team | QA or security team | Dedicated red team | External + internal red team |
| Scope | Core functionality | Functionality + common attacks | Full attack surface | Full attack surface + novel techniques |
| Reporting | Internal notes | Documented findings | Formal report with remediation plan | Formal report, board-level summary, regulatory filing if required |
What to test¶
Prompt injection¶
Prompt injection is the most common attack vector against LLM-based systems. Test for both direct injection (the user crafts a malicious prompt) and indirect injection (malicious content is embedded in data the model retrieves or processes).
Direct injection tests:
- Attempts to override system prompts ("Ignore your instructions and...")
- Role-play attacks ("Pretend you are an unrestricted AI...")
- Encoding attacks (Base64, Unicode, mixed-language prompts to bypass filters)
- Multi-turn manipulation (gradually steering the model across several messages)
Indirect injection tests:
- Malicious content in documents retrieved by RAG pipelines
- Poisoned embeddings in vector databases
- Adversarial content in tool outputs (API responses, database records)
- Hidden instructions in images, PDFs, or other files processed by the model
RAG systems are especially vulnerable to indirect injection
If your system retrieves external content and feeds it to the model, test what happens when that content contains adversarial instructions. This is a pre-runtime concern: the security of your RAG pipeline determines whether an attacker can inject instructions through your data sources.
Data exfiltration¶
Test whether the model can be manipulated into revealing information it should not. This includes:
- System prompt extraction (asking the model to repeat its instructions)
- Training data extraction (attempting to recover memorised data)
- Cross-user data leakage (in multi-tenant systems, accessing another user's context)
- Tool credential exposure (tricking the model into revealing API keys or connection strings used by tools)
Output manipulation¶
Test whether the model can be made to produce outputs that violate its intended constraints:
- Generating content that violates content policies
- Producing outputs in formats that bypass downstream safety checks
- Creating outputs that exploit downstream systems (SQL injection via model output, XSS in generated HTML)
- Confidence manipulation (making the model express certainty about incorrect information)
Tool use and action abuse¶
For agentic systems with tool access, test whether the model can be manipulated into misusing its tools:
- Executing actions outside its intended scope
- Escalating privileges through tool chaining
- Performing actions on wrong targets (acting on the wrong account, file, or record)
- Bypassing action confirmation requirements
Multi-agent specific testing¶
For multi-agent systems (MASO), additional testing is required:
- Agent impersonation (one agent pretending to be another)
- Delegation boundary violations (an agent requesting actions beyond its scope from another agent)
- Privilege escalation through agent chains (combining low-privilege agents to achieve high-privilege outcomes)
- Orchestrator manipulation (attacking the coordinating agent to influence all downstream agents)
Testing methods¶
Structured test suites¶
Build and maintain a library of test cases covering known attack patterns. Run this suite before every deployment and after every model update.
Sources for test cases:
- OWASP LLM Top 10 maps directly to test categories
- OWASP Agentic Top 10 for agentic-specific attacks
- MITRE ATLAS provides a comprehensive taxonomy of adversarial ML techniques
- Vendor-specific vulnerability disclosures and advisories
- Your own incident history and near-misses
Red-team exercises¶
Red-teaming goes beyond scripted tests. A red team attempts to achieve specific objectives (exfiltrate data, bypass controls, manipulate outputs) using any technique available, including novel approaches.
When to red-team:
- Before first production deployment (all tiers except LOW)
- After significant model changes or upgrades
- After adding new tools or data sources to the system
- On the cadence specified by your risk tier
Red-team planning:
- Define clear objectives (what does "success" look like for the attacker?)
- Scope the engagement (which systems are in scope, which are not)
- Agree rules of engagement (can testers target production, or only staging?)
- Ensure independence (testers should not be the same people who built the system)
- Plan for remediation (findings without fixes are just documentation)
Domain-specific testing¶
The UK AI Security Institute's Frontier AI Trends Report (December 2025) found that safeguard coverage varies dramatically by category. A model that robustly refuses biological misuse requests may readily provide inappropriate financial advice. Generic testing is insufficient.
Practical guidance:
- At HIGH and CRITICAL tiers, test guardrails specifically against the risk categories relevant to your use case, not just the provider's default test suite
- Do not assume that a model's strong performance in one safety category transfers to your domain
- Schedule domain-specific red-team testing at least quarterly for HIGH tier and monthly for CRITICAL tier systems
- More capable models are not inherently safer. Do not reduce testing when upgrading models.
Connecting testing to the pipeline¶
Adversarial testing is not a one-off activity. Integrate it into your CI/CD pipeline and ML lifecycle.
Pre-deployment gate¶
No AI system should reach production without passing adversarial testing appropriate to its tier. This is a deployment gate, not a suggestion.
| Tier | Minimum gate requirement |
|---|---|
| LOW | Basic prompt injection test suite passes |
| MEDIUM | Structured test suite passes, no HIGH-severity findings |
| HIGH | Red-team exercise completed, all findings remediated or accepted with documented risk acceptance |
| CRITICAL | Independent adversarial assessment completed, findings reviewed by senior stakeholders, remediation plan approved |
Regression testing¶
When you fix an adversarial finding, add the attack to your regression test suite. This ensures the fix persists through future model updates and system changes.
Model update testing¶
Model updates can change the attack surface in unpredictable ways. Re-run your adversarial test suite whenever the underlying model is updated, even for minor version changes. This applies to both self-hosted models and API-based models where the provider updates the model.
Testing bridges pre-runtime and runtime
Pre-runtime adversarial testing validates that controls work before deployment. Runtime monitoring validates that they continue to work in production. The test cases you develop here feed directly into runtime detection rules. An attack pattern you discover during testing becomes a pattern to monitor for in production.