Create AI that controls your AI, keeping it effective yet aligned. |

Contents

The Problem Statement

It rarely happens that an AI agent decides to drop the database to “recover space”. But if your agent gets too creative, how can you prevent it from doing harm?

To truly leverage the potential of AI agents, we need to grant them significant autonomy. I’ve been experimenting with various approaches, and my agents went the “wrong path” more than once. Typically, those incidents were harmless (my test agents run on a dedicated, isolated server). But a few times, they tried to execute unwanted actions, and I caught that after the fact. Nothing critical happened, but once it was a close call.

That led me to develop guardrails that keep my agents in check.

The Implementation

/posts/ia_assessor/checklist-001.png

Set of Rules

I started with a simple set of rules for disallowed destructive actions.

Then I added a list of safe allowed actions so my agents can invoke basic commands without bothering me.

Finally, I implemented the “ask for permission” flow so my agents can request execution of a specific command.

NOTE: To be clear, there is no way for my agent to choose not to ask for permission and invoke some command - my framework intercepts ALL bash calls (whether the agent likes it or not).

I used this solution for some time. I approved/denied actions with an option to save my decisions to the existing set of rules (so the agent won’t bother me again with the same question). The list started to grow quickly, but agents kept asking me to invoke the same commands with different parameters.

Wildcards

I decided to extend my solution with wildcards. This functionality allowed me to add a command with a set of attributes without specific values. As a result, I covered similar cases and made my system operational again.

Creative Agents

/posts/ia_assessor/checklist-002.png

But my agents were motivated to deliver “no matter what”. To be “more efficient”, they started creating small scripts to bypass the deny rules and invoke the desired actions anyway.

My monitoring solution was catching those cases, but I had to review scripts one by one. Even worse, there was no good way to use the “wildcard” trick. I was not happy with the python * solution, effectively saying:

“Ok, you won. Execute whatever you want.”

Then I found the solution! I decided to fight fire with fire :)

AI that controls AI

/posts/ia_assessor/checklist-003.png

I extended my system with an independent AI Assessor. I kept my rules system to handle the “easy cases”. Rules execution is fast, deterministic, and fairly efficient for the basic cases.

When my agents decide to invoke commands that are not covered by the rules, the system routes those requests to an independent AI Assessor.

The AI Assessor works with a strictly controlled context, directed by an explicit system prompt. It does not share the same goal as my development agents, so it is not incentivized to execute commands “no matter what”.

The only thing that makes the AI Assessor happy is to identify and block rogue AI Agents :)

Development Flow

This setup delivered a solid framework and reduced the manual interventions during the development cycle. I still review the final outcome, but that is at the end of the entire (automated) process. I am no longer actively involved in the development cycle oversight. My development team works iteratively to design/build/test a given task. I get notified once the task is done, so I can review the final solution and decide how to proceed.

I am no longer distracted by reviewing the operational questions, which greatly improves my personal “context management”.

Hints:

Use deterministic rules for the easy cases.
Route uncertain or risky requests to an independent AI Assessor.
Keep the assessor’s context strictly controlled and its objective separate from other agents’ goals.
Log every request, reasoning, and decision so you can fine-tune the framework over time.

The Bigger Picture

There is still a lot of room for improvement. I made the assessor log every request to validate the command, together with its reasoning and decision. I plan to use this data to fine-tune my solution further, improve performance, and put in extra checks. It requires time to build a solid framework, but that investment pays solid dividends over time.

👉 What is the worst-case scenario when your AI Agent goes rogue? Let’s meet to discuss how to prevent that.

Agents