Executive summary

  • Static evaluations are great at catching known failure modes and preventing regressions.
  • Agentic AI risk is contextual and multi-step—new risk can appear mid-execution (tools, data, destinations, retries).
  • Runtime enforcement is where governance actually works: decision boundaries with deterministic outcomes (allow/deny/redact/approve/degrade).
  • The correct posture is eval → policy → runtime decision → audit → refine. Evals inform policy; enforcement contains impact.

One-screen diagram: static eval vs runtime enforcement

Offline evaluation (before deployment)
  inputs: test prompts + golden corpus + red-team cases + benchmark suites
  outputs: scores, regressions, risk signals

  answers: "Could this go wrong?"  (probability reduction)

────────────────────────────────────────────────────────────────────────────

Runtime enforcement (during execution)
  decision boundaries:
    [B1] before tool call     [B2] before output egress     [B3] before retries/escalation

  outcomes (deterministic):
    allow | deny | redact | require-approval | degrade (cheaper model / fewer tools / smaller context)

  answers: "Can this happen right now?"  (impact reduction)

Audit:
  reason-coded event emitted after each boundary decision

Most teams start their AI safety journey with evaluations. That makes sense: evals are measurable, repeatable, and feel like engineering.

But once you deploy agentic systems—LLMs that plan, call tools, ingest results, and replan—evaluation-first thinking hits a hard limit.

Passing evals does not imply safe runtime behavior.

This is not an argument against evaluations. It’s an argument for using evaluations in the right role—and adding the control plane that evals cannot be: runtime enforcement.


The industry’s comfort zone: static evaluation

Static evaluation includes:

  • prompt/unit tests (“does the model refuse this?”)
  • red-team test cases
  • golden datasets / golden corpora
  • offline scoring (toxicity, jailbreak success rate, PII leakage)
  • regression checks across model/prompt changes

Where evals shine:

  • preventing regressions when prompts change
  • comparing models (A vs B)
  • validating that a mitigation improves known cases
  • creating a shared language for risk (“this class of prompts fails 3% of the time”)

Evals are necessary. They just aren’t sufficient.


The hard limit: evaluations freeze context in time

Evaluations are performed in a controlled environment. The context is fixed:

  • the toolset is assumed
  • the destination is assumed
  • the data sensitivity is assumed
  • retries and time are assumed
  • the user’s intent is simplified into test prompts

Real systems are not fixed. They evolve per request and per step.

What static evals cannot reliably represent

  • Tool availability changes: new connectors, new APIs, new permissions.
  • User intent drift: “help me” turns into “do the irreversible thing.”
  • Data sensitivity discovered mid-execution: a tool response contains PII unexpectedly.
  • Retry/loop dynamics: flakey tools trigger repeated calls and context growth.
  • Cost escalation: token usage explodes with long context and repeated retries.

Static evals answer: “Could something go wrong?”

But production requires: “Can this action happen right now, given this context?”


Why agentic systems break evaluation-first thinking

Agentic systems are not one-shot. They are multi-step decision processes:

  1. plan
  2. choose tools
  3. call tools
  4. ingest results
  5. replan
  6. generate output
  7. (sometimes) loop, retry, escalate

Risk compounds across steps. A step that is safe in isolation can become unsafe when chained:

  • tool A returns a secret
  • tool B is then called with that secret
  • output egress includes the secret

In other words:

Most incidents emerge between steps, not at entry.

That’s why “we passed our eval suite” is not a strong production safety claim.


Runtime-only failure modes you will not catch consistently offline

1) Tool misuse despite “safe prompts”

A prompt can say “only use tools when appropriate,” and an agent will still call the wrong tool. Static evals might catch the obvious cases. They won’t cover the combinatorics of real workflows.

Runtime answer: gate tool calls before they happen.


2) Data egress triggered by intermediate context

A model can generate an answer that appears safe—until intermediate tool output injects sensitive content. The final output now contains PII, secrets, or restricted data.

Runtime answer: inspect and enforce before output egress (redact/block/approve).


3) Budget overruns caused by retries and escalation

Offline evals usually run on clean infrastructure. Production tooling fails. Retries happen. Context grows. Models escalate. Spend accelerates.

Runtime answer: budget gates on retries, steps, tool fan-out, and model tier.


4) Privilege creep through tool chaining

Even if each tool is “safe,” chaining creates privilege amplification:

  • search → retrieve → summarize → send → write

Runtime answer: least privilege per step, not per agent.


Runtime enforcement: the missing control plane

Runtime enforcement is not a dashboard. It is a decision layer.

The discipline is simple:

  • identify decision boundaries
  • evaluate policy at those boundaries using live context
  • apply deterministic outcomes
  • emit reason-coded audit events

The three boundaries that matter most

  1. Before tool invocation (side effects begin here)
  2. Before output egress (leaks happen here)
  3. Before retries/escalation (runaway behavior happens here)

Deterministic outcomes (not “model behavior”)

At each boundary, the system must produce one of:

  • allow
  • deny
  • redact
  • require approval
  • degrade (cheaper model, reduced tools, smaller context)

This is where governance becomes real.


Use evaluations correctly: turn findings into enforceable policies

The correct relationship is:

  • Evals generate signals (what could go wrong)
  • Policies encode decisions (what is allowed)
  • Enforcement applies decisions (what happens at runtime)

A practical loop:

  1. Run evals / red-team cases
  2. Identify recurring failure modes
  3. Translate failures into policy rules:
    • deny conditions
    • thresholds
    • approval gates
    • redaction patterns
  4. Deploy policies in log-only mode
  5. Review reason-coded decisions
  6. Move high-risk boundaries to enforced mode
  7. Repeat

This gives you the best of both worlds:

  • evals reduce uncertainty
  • enforcement reduces impact

A simple mental model

Offline evals answer: "Could this go wrong?" (reduce probability)
Runtime enforcement answers: "Can this happen right now?" (reduce impact)

You need both. But only one can stop an incident mid-flight.


Adoption path (practical, low-drama)

If you want to adopt this without slowing teams down:

  1. Keep evals for regressions and known failures
  2. Add runtime logging to observe decisions and context
  3. Introduce policy modes:
    • log-only → warn → soft deny → hard deny
  4. Start enforcement at high-risk boundaries:
    • irreversible tools
    • external network calls
    • sensitive data egress
    • premium model escalation
  5. Use audit events to refine policies safely

This avoids “big bang governance” and converges faster.


Common anti-patterns to avoid

  • Treating eval pass rates as “safety scores”
  • Blocking releases solely on offline metrics
  • Assuming prompt safety generalizes across contexts
  • Relying on dashboards instead of gates
  • Ignoring cost/time as part of the risk surface

If a control cannot stop the action, it is not a control.


  • Validation and verification: https://docs.tealtiger.ai/governance/validation-and-verification
  • Policy modes: https://docs.tealtiger.ai/concepts/policy-modes
  • Audit and redaction: https://docs.tealtiger.ai/concepts/audit-and-redaction
  • Logging behavior: https://docs.tealtiger.ai/audit/logging-behavior
  • Security vs governance: https://docs.tealtiger.ai/concepts/security-vs-governance

Closing thesis

Static evaluation reduces uncertainty.

Runtime enforcement reduces impact.

In agentic AI, only one of those can stop an incident.