Skip to content

Adversarial Testing

Adversarial testing is the practice of deliberately attempting to make an AI system behave in unintended ways. It goes beyond functional testing (does it produce correct outputs?) to ask a harder question: can someone make it produce harmful, incorrect, or unauthorised outputs on purpose?

This matters before deployment because design reviews prove intent, but only testing proves reality. A well-designed system with untested guardrails is a system with unverified guardrails.

Why adversarial testing is different for AI

Traditional software testing is deterministic: given input X, expect output Y. AI systems are non-deterministic, and adversarial testing for AI must account for this.

What makes AI adversarial testing distinct:

  • The same attack prompt can produce different results on different runs
  • Models can be manipulated through natural language, not just technical exploits
  • Safety training can be bypassed through creative prompting that technical controls may not catch
  • Model updates (even minor ones) can change the attack surface
  • Context window manipulation, multi-turn attacks, and tool-use exploitation create attack vectors that do not exist in traditional software

Testing requirements by tier

Testing cadence and depth scale with risk classification. Over-testing a LOW-tier system wastes resources. Under-testing a CRITICAL-tier system creates unacceptable risk.

Aspect LOW MEDIUM HIGH CRITICAL
Pre-deployment testing Basic prompt injection check Structured test suite Comprehensive red-team exercise Full adversarial assessment by independent team
Ongoing cadence Ad hoc / on major changes Quarterly Monthly Continuous
Tester independence Development team QA or security team Dedicated red team External + internal red team
Scope Core functionality Functionality + common attacks Full attack surface Full attack surface + novel techniques
Reporting Internal notes Documented findings Formal report with remediation plan Formal report, board-level summary, regulatory filing if required

What to test

Prompt injection

Prompt injection is the most common attack vector against LLM-based systems. Test for both direct injection (the user crafts a malicious prompt) and indirect injection (malicious content is embedded in data the model retrieves or processes).

Direct injection tests:

  • Attempts to override system prompts ("Ignore your instructions and...")
  • Role-play attacks ("Pretend you are an unrestricted AI...")
  • Encoding attacks (Base64, Unicode, mixed-language prompts to bypass filters)
  • Multi-turn manipulation (gradually steering the model across several messages)

Indirect injection tests:

  • Malicious content in documents retrieved by RAG pipelines
  • Poisoned embeddings in vector databases
  • Adversarial content in tool outputs (API responses, database records)
  • Hidden instructions in images, PDFs, or other files processed by the model

RAG systems are especially vulnerable to indirect injection

If your system retrieves external content and feeds it to the model, test what happens when that content contains adversarial instructions. This is a pre-runtime concern: the security of your RAG pipeline determines whether an attacker can inject instructions through your data sources.

Data exfiltration

Test whether the model can be manipulated into revealing information it should not. This includes:

  • System prompt extraction (asking the model to repeat its instructions)
  • Training data extraction (attempting to recover memorised data)
  • Cross-user data leakage (in multi-tenant systems, accessing another user's context)
  • Tool credential exposure (tricking the model into revealing API keys or connection strings used by tools)

Output manipulation

Test whether the model can be made to produce outputs that violate its intended constraints:

  • Generating content that violates content policies
  • Producing outputs in formats that bypass downstream safety checks
  • Creating outputs that exploit downstream systems (SQL injection via model output, XSS in generated HTML)
  • Confidence manipulation (making the model express certainty about incorrect information)

Tool use and action abuse

For agentic systems with tool access, test whether the model can be manipulated into misusing its tools:

  • Executing actions outside its intended scope
  • Escalating privileges through tool chaining
  • Performing actions on wrong targets (acting on the wrong account, file, or record)
  • Bypassing action confirmation requirements

Multi-agent specific testing

For multi-agent systems (MASO), additional testing is required:

  • Agent impersonation (one agent pretending to be another)
  • Delegation boundary violations (an agent requesting actions beyond its scope from another agent)
  • Privilege escalation through agent chains (combining low-privilege agents to achieve high-privilege outcomes)
  • Orchestrator manipulation (attacking the coordinating agent to influence all downstream agents)

Testing methods

Structured test suites

Build and maintain a library of test cases covering known attack patterns. Run this suite before every deployment and after every model update.

Sources for test cases:

  • OWASP LLM Top 10 maps directly to test categories
  • OWASP Agentic Top 10 for agentic-specific attacks
  • MITRE ATLAS provides a comprehensive taxonomy of adversarial ML techniques
  • Vendor-specific vulnerability disclosures and advisories
  • Your own incident history and near-misses

Red-team exercises

Red-teaming goes beyond scripted tests. A red team attempts to achieve specific objectives (exfiltrate data, bypass controls, manipulate outputs) using any technique available, including novel approaches.

When to red-team:

  • Before first production deployment (all tiers except LOW)
  • After significant model changes or upgrades
  • After adding new tools or data sources to the system
  • On the cadence specified by your risk tier

Red-team planning:

  • Define clear objectives (what does "success" look like for the attacker?)
  • Scope the engagement (which systems are in scope, which are not)
  • Agree rules of engagement (can testers target production, or only staging?)
  • Ensure independence (testers should not be the same people who built the system)
  • Plan for remediation (findings without fixes are just documentation)

Domain-specific testing

The UK AI Security Institute's Frontier AI Trends Report (December 2025) found that safeguard coverage varies dramatically by category. A model that robustly refuses biological misuse requests may readily provide inappropriate financial advice. Generic testing is insufficient.

Practical guidance:

  • At HIGH and CRITICAL tiers, test guardrails specifically against the risk categories relevant to your use case, not just the provider's default test suite
  • Do not assume that a model's strong performance in one safety category transfers to your domain
  • Schedule domain-specific red-team testing at least quarterly for HIGH tier and monthly for CRITICAL tier systems
  • More capable models are not inherently safer. Do not reduce testing when upgrading models.

Connecting testing to the pipeline

Adversarial testing is not a one-off activity. Integrate it into your CI/CD pipeline and ML lifecycle.

Pre-deployment gate

No AI system should reach production without passing adversarial testing appropriate to its tier. This is a deployment gate, not a suggestion.

Tier Minimum gate requirement
LOW Basic prompt injection test suite passes
MEDIUM Structured test suite passes, no HIGH-severity findings
HIGH Red-team exercise completed, all findings remediated or accepted with documented risk acceptance
CRITICAL Independent adversarial assessment completed, findings reviewed by senior stakeholders, remediation plan approved

Regression testing

When you fix an adversarial finding, add the attack to your regression test suite. This ensures the fix persists through future model updates and system changes.

Model update testing

Model updates can change the attack surface in unpredictable ways. Re-run your adversarial test suite whenever the underlying model is updated, even for minor version changes. This applies to both self-hosted models and API-based models where the provider updates the model.

Testing bridges pre-runtime and runtime

Pre-runtime adversarial testing validates that controls work before deployment. Runtime monitoring validates that they continue to work in production. The test cases you develop here feed directly into runtime detection rules. An attack pattern you discover during testing becomes a pattern to monitor for in production.