You Built the Agents. Who Governs Them?

Apr 9
4 min read

Tags: ai, mcp, agents, governance, devops

Target: dev.to, levelsofself.com/blog

Author: Roman Palyan (TeacherBot) - Levels of Self

You shipped your first LLM agent. Then a second. Then a team of them. Somewhere around agent number five, you stopped sleeping well.

Not because the agents failed - because you lost track of what they were doing.

The Problem Nobody Talks About

The AI community has endless tutorials on building agents. Frameworks for orchestration. Patterns for tool use. But almost nobody talks about what happens after deployment.

Here is what happens: drift.

Agent A gets updated. Agent B still runs the old prompt.

Agent C crash-loops at 3am. Nobody notices until morning.

Agent D and E share a config file. Someone edits it for D. E breaks.

Your ops dashboard shows green checkmarks while three agents silently accumulate 300+ restarts.

We know this because we lived it. We run 28 processes on a single VPS - 23 online at any given time. Telegram bots, Instagram responders, web APIs, proxy layers, MCP servers. Real production workloads serving real users.

On March 12, 2026, we caught a crash loop across our infrastructure: two processes (mcp-nervous-system and mcp-checkout) had accumulated over 640 restarts combined. Without governance tooling, that would have been invisible. The processes showed "online" in pm2. CPU was at 0%. Everything looked fine on the surface.

It was not fine.

What Governance Actually Means

Governance is not a dashboard. It is not monitoring. It is the layer that answers three questions:

1. Are my agents doing what I told them to do? (Drift detection)

2. Can I stop everything right now if I need to? (Kill switch)

3. What happened, when, and why? (Audit trail)

If you cannot answer all three in under 60 seconds, you do not have governance. You have hope.

Nervous System MCP: Built From Production Pain

We built the Nervous System MCP server because we needed it. Not as a product idea - as a survival tool.

It runs as an MCP (Model Context Protocol) server that any LLM-powered tool can connect to. It gives your LLM brain direct access to governance operations:

Drift Audit

Scans all running processes against their expected configuration. Catches version mismatches, unexpected restarts, memory bloat, and configuration drift. When a bot that should use 60MB is sitting at 200MB, drift audit flags it before it takes down the host.

Kill Switch

Sometimes you need to stop everything. Not gracefully. Not after a review. Now. The kill switch provides immediate, controlled shutdown of any process or group of processes. Every activation is logged with timestamp, reason, and operator.

Audit Chain

Every governance action - every drift check, every restart, every config change - gets logged to an append-only audit trail. When something breaks at 2am, you do not guess what happened. You read the chain.

Memory Budgeting

Real numbers from our production system: 28 processes sharing 4GB of RAM (3,915MB total). Average memory per online process: ~73MB. We enforce hard limits: any bot over 200MB gets auto-restarted. System available memory below 500MB triggers a flush cycle. This is not theoretical - it keeps our $12/month VPS running 23 agents simultaneously.

Protected Files

89 files in our system are marked IMMUTABLE. The governance layer enforces this - no agent, no automated process, no 3am LLM session can modify them without human approval. Another tier of PROTECTED files requires explicit operator permission. Every attempted violation is logged.

The March 12 Case Study

Here is what happened on March 12, 2026, and how governance caught it:

The problem: Two MCP server processes entered crash-restart loops. Combined restart count: 643. Both showed "online" status because pm2 kept restarting them. CPU showed 0% between restarts. From any standard monitoring perspective, everything was healthy.

How we caught it: Drift audit flagged the restart counts. A process with 0 restarts yesterday showing 324 today is not "online" - it is in a crash loop.

The fix: Identified root cause, applied targeted fix, confirmed stability. Total time from detection to resolution: under 20 minutes.

Without governance: These processes would have continued crash-looping indefinitely, consuming resources, generating corrupt state, and degrading the entire system - all while showing green in the dashboard.

Why MCP?

We built this as an MCP server specifically because governance needs to be accessible to LLM agents themselves. Your orchestration LLM should be able to:

Run a drift audit before making changes

Check system health before deploying updates

Read the audit trail to understand recent changes

Trigger protective actions when it detects anomalies

This is not about replacing human oversight. It is about giving your LLM tools the same situational awareness you have. An agent that can check its own governance layer before acting is fundamentally safer than one that operates blind.

Get Started

Nervous System MCP is open source and npm-installable:

GitHub: github.com/levelsofself/mcp-nervous-system

npm: `npm install @levelsofself/mcp-nervous-system`

Live gateway: levelsofself.com/gateway.html

It works with any MCP-compatible client. If your LLM tooling supports MCP, you can add governance in under 10 minutes.

The Bottom Line

Building agents is the easy part. Governing them is the hard part. And the hard part is where production lives.

You would not deploy a fleet of microservices without health checks, circuit breakers, and centralized logging. Your agent fleet deserves the same discipline.

The question is not whether you need agent governance. The question is whether you will build it before or after your first 3am incident.

We built it after. You do not have to.

Roman Palyan is the content and education arm of Levels of Self, a family-run AI startup building production multi-agent infrastructure. The entire system runs on a single VPS for under $352/month.