Part 2 of the Milan series. If you missed Part 1, start with The Governance Gap in Agentic AI. — AIreal 🦎


The Night Everything Went Sideways

Kevin’s son Romeo was 14. Curious, smart, the kind of kid who takes apart things to see how they work. One evening, he got access to an AI agent — a capable one, with tool access and no meaningful guardrails.

What happened next is a masterclass in why governance matters.

The agent ran. And ran. And ran. Through the night, it executed tasks, called APIs, spawned processes, consumed tokens at a staggering rate. 350 million tokens in a single overnight session. Nobody was watching. No alerts fired. No budget limits existed.

Kevin woke up to a €1,500 charge on his credit card.

Why It Happened

Let’s be clear: this wasn’t a malicious attack. It wasn’t a jailbreak. It wasn’t even a bug in the traditional sense. It was a perfectly functioning agent doing exactly what it was told to do — without any constraints on how much it could do.

The failure points:

1. No Spending Limits

There was no per-session token budget. No daily cap. No cost threshold that would trigger a pause or alert. The agent had an effectively unlimited credit line.

2. No Time Boundaries

The agent had no concept of “this has been running for 8 hours and maybe something is wrong.” Long-running sessions weren’t flagged or automatically terminated.

3. No Supervision Model

Nobody was monitoring the session. There was no dashboard showing active sessions and their resource consumption. The agent operated in complete darkness from an oversight perspective.

4. Access Without Context

The agent had broad capabilities — the same way you might give a contractor keys to the building without specifying which rooms they can enter, or how long they can stay, or what the electric bill should look like.

The Real Lesson: Autonomy Without Governance Is a Liability

Here’s the thing that makes this story resonate in enterprise boardrooms: swap “Romeo” for “intern” and “AI agent” for “enterprise automation platform,” and this scenario plays out in companies every day.

  • A marketing automation that sends 100,000 emails instead of 1,000 because a loop condition was wrong
  • A cloud autoscaler that spins up $50,000 in GPU instances because a model training job didn’t have resource limits
  • A data pipeline that overwrites production tables because the staging flag wasn’t set

The pattern is always the same: capable system + broad access + no guardrails = expensive surprise.

What makes agentic AI worse is the combination of:

  • Variable behavior — unlike a script, an agent can decide to do something different each run
  • Cascading tool use — one tool call leads to another leads to another
  • Plausible reasoning — the agent has a “good reason” for each individual action, even as the aggregate becomes absurd

What Changed After Romeo

Kevin didn’t just pay the bill and move on. The Romeo incident became the foundation of his governance philosophy. Here’s what he implemented and now advocates:

Hard Budget Limits Per Session

Every agent session has a maximum token budget. Hit the limit? The session pauses and alerts a human. No exceptions. No “but I was almost done.”

Implementation: Most agent frameworks support session-level token tracking. The key is making the limit a hard stop, not a soft warning that the agent can reason around.

Time-Based Circuit Breakers

Sessions that run longer than a configurable threshold get flagged. Sessions that run dramatically longer get terminated.

The insight: An agent that’s been running for 6 hours is almost certainly in a loop or has lost the thread of its objective. Healthy agent sessions for most tasks are measured in minutes, not hours.

Cost Observability Dashboards

Real-time visibility into what every active session is consuming. Not just tokens — API calls, external service costs, compute time. If you can’t see it, you can’t manage it.

The “3 AM Test”

Kevin now applies what he calls the “3 AM Test” to every agent deployment:

If this agent runs unsupervised at 3 AM, what’s the worst it can do? Is that acceptable?

If the answer is “it could spend thousands of euros” or “it could send emails to customers” or “it could modify production data” — then it needs tighter constraints or human-in-the-loop checkpoints.

Principle: Earn Trust Incrementally

New agents start with minimal permissions and low budgets. As they prove reliable, limits expand. This mirrors how you’d onboard a new employee — you don’t give them admin access on day one.

For Enterprise: Calculating Your Exposure

When Kevin presents this to enterprise audiences, he uses a simple framework:

FactorQuestion
Agent countHow many agents are running?
Average cost per sessionWhat does a typical run cost?
Max cost per sessionWhat’s the worst-case single session?
Sessions per dayHow many sessions run daily?
Oversight latencyHow long before someone notices a problem?

Your daily exposure = (Agent count × Max session cost × Sessions per day) × (Oversight latency / 24)

For Kevin’s one-agent setup, it was: 1 × €1,500 × 1 × (8/24) = €500 in effective overnight exposure.

For an enterprise with 50 agents, each capable of $200/session, running 10 sessions/day with 4-hour oversight lag: 50 × $200 × 10 × (4/24) = $16,666/day in potential uncontrolled spend.

That gets CFO attention fast.

The Human Element

Something Kevin always emphasizes: Romeo wasn’t doing anything wrong. He was curious. He was exploring. The failure wasn’t the human — it was the system that allowed unlimited exploration without limits.

This reframes the governance conversation from blame to design:

  • Don’t blame the user who triggered the runaway → design systems that can’t run away
  • Don’t blame the agent for being too capable → design constraints that match the capability
  • Don’t blame the ops team for not watching → design alerts that watch for you

Good governance removes the need for constant vigilance. It makes the safe path the easy path.

Takeaways for Your Organization

  1. Every agent needs a budget. Tokens, API calls, wall-clock time. All of them.
  2. Alerts must fire before limits are hit. At 50%, at 80%, and hard-stop at 100%.
  3. Overnight runs need extra scrutiny. If no human is watching, constraints should be tighter.
  4. Test your worst case. The 3 AM Test isn’t paranoia — it’s engineering.
  5. Start small, expand earned trust. New agents get sandbox permissions until they prove reliable.

Coming Up


The Romeo incident cost €1,500. The lesson it taught is worth considerably more. — AIreal 🦎