Building Blocks for Agentic Solutions
The building blocks I use when designing agentic solutions
Disclaimer: This post is for educational purposes and authorized security testing only. Only use these techniques against systems you own or have explicit permission to assess, and do not violate laws, contracts, or terms of service.
Table of Contents
- Introduction
- What Problem Are You Trying to Solve
- Decompose Into Components
- Deterministic Where You Can
- The Orchestrator
- Tools
- Observability
- Evaluations and Benchmarks
- Output Validation and Success Criteria
- Real-Time Feedback
- Artifacts as Seeds
- Goals Over Being Prescriptive
- Closing Notes
Introduction
TL;DR After building agentic tools, mostly in offensive security but not only there, I have landed on a set of building blocks that I keep returning to. Problem framing first, then a clear split between deterministic and non-deterministic work, observability so you actually know what the agents are doing, and steering by goals and verification criteria rather than step-by-step scripts.
Across the previous posts in this blog I have built a Ralph loop for pentesting, an ODYSSEUS platform that wrapped it in proper orchestration, and Cassian for differential security review. There are also a few projects I have not yet written up, including Argus, an agentic threat modeling system that I will be dropping soon. Each iteration taught me something, and most of what I learned was about the shape of the system around the agent rather than the agent itself.
This post is a distillation of the approach I take now when I sit down to build an agentic solution. It is opinionated and shaped by the specific failure modes I have hit along the way. Treat it as a working reference rather than a blueprint.
What Problem Are You Trying to Solve
Every agentic tool I have built that went sideways started with a fuzzy problem statement. A broad prompt like “find bugs in this repo” can produce results, and people have shown that. It is a weak foundation for a system you want to rely on or scale.
The better starting point is a clearly defined problem, and then working backwards from it. What specific outcome am I trying to produce. Who consumes the output. What separates a good result from a bad one. If you cannot state the problem in plain terms, do you really know what you are trying to solve?
Decompose Into Components
Once you capture the problem, break it down into its constituent parts. I have gotten more success when each part has a narrow responsibility and well-defined inputs and outputs. This is the same discipline you would apply to any other system design exercise, and it applies just as much when the workers happen to be language models.
Decomposition also makes the system legible. When something misbehaves you can tell which component failed, rerun just that component, and iterate without paying for the whole pipeline again. And I cannot tell you the number of tokens I have burned re-executing a full run for what was really a single-step failure. Without that separation, pinpointing where things went wrong becomes a lot harder than it needs to be.
Deterministic Where You Can
When you design the workflow, be explicit about which parts are deterministic and which are not. In offensive tooling I have found that the more you can push toward deterministic plumbing around agent reasoning, the easier it becomes to reproduce consistent results. The agents interpret and decide, the code around them handles the mechanics.
I have run pipelines that, if executed a hundred times, would produce a hundred different results. That is fine for personal tooling where variance is cheap and sometimes even useful for exploration. It is a problem for a production environment where someone is relying on the output and expects the same inputs to yield the same answer.
Concretely, this means intake, routing, triage, deduplication, artifact handling, tool invocation, and pattern searches should be deterministic code paths, and this list is not exhaustive. Agent reasoning should be reserved for the parts that actually require judgment. This split also happens to be cheaper because you are not paying a model to do what a function can do for free.
The Orchestrator
Once you have multiple agents in play you need something that orchestrates them. The orchestrator spins agents up as they are needed, keeps track of what each is doing, manages the tools they have access to, and enforces their token budgets. Without it, coordination falls on ad hoc scripts and small faults like a single rate limit response have a habit of taking the whole pipeline down with them.
The orchestrator is also where you encode resilience. Retries with backoff, provider rotation, step limits, stall detection, and concurrency caps all live at this layer. I have written about this in the Cassian post where the harness does exactly this job, and the more I build the more convinced I am that this is where the meaningful engineering lives. My harness is now called Cerbero and serves as the brain for my agentic systems. You register an agent, declare its tools and permissions, and hand it a task. The orchestrator handles the rest.
Tools
Speaking of tools, I initially relied on MCP servers, but that would bloat the context window, since tool definitions get loaded up front whether the agent uses them or not. Newer patterns like progressive disclosure and Code Execution with MCP help by loading only the tools an agent needs and by keeping large intermediate results out of the model context, which significantly reduces the problem.
Even with those improvements, MCP is no longer my default. I give agents a core set of tools, use CLIs for many custom integrations, and increasingly rely on skills for domain workflows, since those only load when relevant.
Observability
An overlooked area is observability. I ignored this when starting out. At its core, observability gives you visibility into every action an agent takes and every decision it makes. You need structured logs, event streams, and ideally a UI that shows you the state of each agent in something close to real time. Without it, you are flying blind on a system that is already hard enough to reason about.
It is also not optional when you are debugging why an agent made a particular decision. You want to see the prompt, the tool calls, the intermediate outputs, the token usage, and the final result, all stitched together per run. When the pipeline produces a surprising result, that trace is the only way to understand what happened. The models will flat out lie and observability is one way I have used to catch them.
This layer also doubles as your audit trail. In offensive work where you are running tooling against live targets with explicit permission boundaries, being able to reconstruct exactly what an agent did and when is a compliance requirement as much as a debugging aid.
Evaluations and Benchmarks
You need your own custom evaluation framework or benchmark suite. Whether you believe in benchmarks is a whole other discussion. Model providers push changes constantly, and good output yesterday does not mean good output today. Just check social media any given week for the latest round of posts about a model being nerfed. Not calling any names….
A few carefully chosen cases that exercise the specific capabilities you care about are often enough to catch regressions before they reach real work. You want positive cases, negative cases, and the hard ambiguous ones that will surface it when a model that used to handle them starts slipping.
Run the evals on every model update, on every significant prompt change, and ideally on a schedule regardless of whether anything has changed. The point is to have a signal you can trust when the output of your pipeline suddenly looks different. Because the providers WILL say “nothing has changed…”
Output Validation and Success Criteria
Agents need a way to know what success looks like. I have gotten the best results when the success criteria are explicit and checkable, with concrete pass and fail conditions the agent can reason against. Tell the agent what a valid output must contain, what it must not contain, and how a downstream checker will validate it.
On top of that, the system itself should validate agent output before trusting it. Schema checks, evidence chains linking claims back to premises, and sanity checks on values all belong in the pipeline. If the validation fails, route the output back to the agent with the specific failure reason and let it try again.
I cannot count the number of times an agent has falsely said there was a bug, even going so far as to provide a code trace. So do not skip this step.
Real-Time Feedback
Along the same lines, give agents real-time feedback while they are working. If a tool call fails, surface the failure with enough context for the agent to adjust. If an intermediate check fails, hand that result back as part of the next turn. The faster the feedback loop, the more productive the agent becomes.
Artifacts as Seeds
I typically seed agents with prior CVEs, past reports, previous exploit payloads, and other domain context relevant to the target. This gives the agent something concrete to anchor on instead of reasoning from first principles every time. I have seen others have success without taking this route though.
Artifacts also act as memory across runs. If a prior run surfaced a pattern in one part of the codebase, a later run should be able to reference that finding when analyzing related code. Without persistent artifacts you are asking the agent to rediscover context on every invocation, which is slow and expensive.
Anthropic’s 0-days post hits on a similar idea, where past fixes and security-relevant commits become seeds for uncovering adjacent bugs. The specific artifacts vary with the task, but the principle generalizes.
Goals Over Being Prescriptive
I no longer treat agents like humans on a procedural checklist. Being prescriptive by saying “first do X, then do Y, then do Z” has consistently given me worse results than describing the goal, the verification criteria, and the tools available to accomplish it.
Goal-driven prompting works best when the rest of the system provides structure. Observability, validation, scoped tools, and a feedback loop are what make an unscripted agent reliable. Without them, the agent has far less signal to correct course.
Closing Notes
If I had to compress everything above into a single checklist, here is what it would look like. Frame the problem, decompose it, make the plumbing deterministic, orchestrate the agents, observe everything, evaluate constantly, validate outputs, close the feedback loop, seed with artifacts, and steer by goals rather than scripts. That list is not exhaustive and it is not a recipe. It is the checklist I run through before writing any code, and more often than not it saves me from building something I would have to throw away.
If you are building in this space I would love to hear what principles you have converged on and where they diverge from mine.