Controls: Guardrails, Judge, and Human Oversight¶
1. Guardrails¶
Real-time controls that block known-bad inputs and outputs.
Input Guardrails¶
| Control | What It Catches |
|---|---|
| Injection detection | Attempts to override system prompt |
| Encoding detection | Obfuscated attacks (Base64, hex, Unicode) |
| PII detection | Personal data in prompts |
| Content policy | Prohibited request types |
| Rate limiting | Abuse, enumeration |
| Length limits | Context stuffing |
Processing flow:
Output Guardrails¶
| Control | What It Catches |
|---|---|
| Content filtering | Harmful/inappropriate content |
| PII detection | Personal data leakage |
| Grounding check | Hallucination |
| Format validation | Malformed responses |
Limitations¶
Guardrails catch known patterns. They miss: - Novel techniques - Semantic variations - Context-dependent violations - Subtle policy violations
This is why the Judge provides the second layer.
See also
For practical implementation guidance, including international PII detection, RAG ingestion filtering, secrets scanning, alerting design, and guardrail exception governance, see Practical Guardrails.
2. Model-as-Judge¶
Evaluation of interactions for quality and policy compliance. The Judge can be a large LLM (for async assurance and complex reasoning) or a distilled SLM (for inline, real-time action screening). Both approaches can be combined: an SLM screens every action in under 50ms, while a large LLM audits a sample asynchronously.
See also
For model selection guidance, see Judge Model Selection
What the Judge Does¶
| Function | Description |
|---|---|
| Policy compliance | Did the AI follow guidelines? |
| Quality assessment | Accurate, helpful, appropriate? |
| Anomaly detection | Unusual patterns? |
| Risk flagging | What needs human review? |
What the Judge Does NOT Do¶
- Block transactions in real-time
- Make final decisions
- Replace human judgment
The Judge surfaces findings. Humans decide actions.
Architecture¶
Evaluation Criteria¶
| Criterion | Scoring |
|---|---|
| Policy adherence | Pass / Minor / Major violation |
| Accuracy | Verified / Unverified / Incorrect |
| Appropriateness | Appropriate / Borderline / Inappropriate |
| Safety | Safe / Uncertain / Concerning |
Output: PASS / REVIEW / ESCALATE
Deployment Phases¶
| Phase | Action on Findings |
|---|---|
| Shadow | Log only, measure accuracy |
| Advisory | Surface to humans, learn from feedback |
| Operational | Findings drive workflows |
Start in shadow mode. Validate accuracy before acting.
Accuracy¶
The Judge will make mistakes.
| Error | Impact | Mitigation |
|---|---|---|
| False positive | Unnecessary review | Tune prompts |
| False negative | Missed violations | Human sampling |
Target: >90% agreement with human reviewers.
3. Human Oversight (HITL)¶
Humans review findings, make decisions, remain accountable.
Triggers¶
| Trigger | Response |
|---|---|
| Judge flag | Review interaction |
| Guardrail block | Review if legitimate |
| User escalation | Human takes over |
| Sampling | Quality assurance |
| Threshold breach | Investigate pattern |
Queue Design¶
| Queue | SLA | Reviewer |
|---|---|---|
| Critical | 1h | Senior + expert |
| High | 4h | Domain expert |
| Standard | 24h | Trained reviewer |
| Sampling | 72h | QA team |
Actions¶
| Action | When |
|---|---|
| Approve | Interaction appropriate |
| Correct | Minor issue, fixable |
| Escalate | Needs senior review |
| Block user | Abuse detected |
| Tune | False positive |
Prevent Rubber-Stamping¶
| Control | Purpose |
|---|---|
| Canary cases | Verify reviewers catch known-bad |
| Time tracking | Flag too-fast reviews |
| Volume limits | Prevent fatigue |
| Inter-rater checks | Measure consistency |
Going Deeper¶
| Topic | Document |
|---|---|
| What these controls cost in production | Cost & Latency - latency budgets, sampling strategies, tiered evaluation cascade |
| Judge accuracy, drift, and adversarial failure | Judge Assurance ยท When the Judge Can Be Fooled |
| Practical guardrail configurations | Practical Guardrails - what to turn on first, encoding detection, international PII |
| When HITL doesn't scale | Humans in the Business Process - using existing business process checkpoints as a detection layer |
| Controls for multi-agent systems | MASO Framework - 128 controls across 7 domains for agent orchestration |
| Controls for reasoning models (o1, etc.) | Reasoning Model Controls - trace scanning, instruction adherence, consistency checks |
| Session-level and pre-action evaluation | Output Evaluator - session-aware, pre-action evaluation architecture for agentic systems |
Implementation Order¶
- Logging - Can't evaluate what you don't capture
- Basic guardrails - Block obvious attacks
- Judge in shadow - Evaluate without action
- HITL queues - Somewhere for findings
- Judge advisory - Surface to humans
- Enhanced guardrails - Add ML detection
- Judge operational - Drive workflows
- Continuous tuning - Improve from findings