Building CentralAgent: A Production Multi-Agent Framework

Most agentic systems fail for the same reason: they use an LLM to both make decisions and execute business logic without distinguishing between the two. When something breaks, you cannot tell whether the model reasoned incorrectly, a tool returned bad data, or the control flow was just wrong. CentralAgent is my attempt to fix that by enforcing a hard boundary. Claude only makes decisions. Python only executes them.

The starting point for every task in the system is a directive. Directives are plain Markdown files, and a real one looks roughly like this: a one-sentence goal statement, an inputs section describing what data the task receives (file paths, API responses, prior output from another directive), a tools section listing which skills are available, an output format specification, and an edge-case section that covers what to do when something unexpected happens. The directive for competitive monitoring says, explicitly, that if a vendor page returns a 403 it should mark that vendor as access-blocked and continue rather than halting. It says that if the pricing table structure has changed in a way the browser skill cannot parse, it should escalate to a human checkpoint rather than silently returning empty data. Those two sentences took me three broken runs to write.

The discipline of writing a good directive took longer to develop than any other part of the system. Directives that are too vague fail in a specific way: Claude fills the gaps with reasonable assumptions that may not match your intent, and those assumptions are invisible unless you are reading the audit log. The first version of the research briefing directive said "research this company and produce a competitor analysis." It produced output, but that output varied across runs because Claude was making different decomposition choices each time: sometimes leading with product features, sometimes with funding history. None of those choices were wrong, but none were consistent. Directives that are too prescriptive fail differently: they over-specify the execution path and leave Claude no room to handle inputs that deviate from the expected shape. The right balance is specifying what needs to happen, why it matters, and when to stop and escalate. The how belongs to the skills.

A directive that is hard to write usually means the task needs decomposing. If I cannot describe what needs to happen in a page, the task is not ready to automate.

The orchestration layer is where Claude reads a directive and figures out the path forward. For a task like the contract extraction pipeline, Claude receives a directive covering the full job: split the uploaded PDF into sections, route each section to the right extraction template based on clause type, run the extractions in parallel, collect and validate the results, and either write to the database or prepare a human review packet. The parallel fan-out is literal. Claude dispatches one skill invocation per section simultaneously, collects results as they return, and only then runs the validation pass over the complete set. A contract with thirty-five sections takes roughly the same wall-clock time as one with ten because the bottleneck is the number of concurrent workers, not the number of sections.

Human-in-the-loop checkpoints interrupt execution at defined points. Some are explicit in the directive, for steps where a mistake is difficult to undo. Others are triggered by Claude when it encounters a situation the directive does not cover. A checkpoint presents as a structured message: what Claude was about to do, why it stopped, and the specific data that caused the pause. For the contract pipeline this usually means I see a flagged extracted value, the surrounding source text, and the validation rule it failed. The message is not a generic confirmation prompt; it is a context-specific decision request. I read it, make a judgment, and the task continues from where it stopped.

Skills are thin by design. A skill that fetches competitor pricing from a vendor page is typically 40 to 60 lines of Python: construct the request, parse the response, handle specific HTTP error codes, return a typed dict. The boundary rule is strict: if the logic requires knowing anything about the current task's context or intent, it does not belong in a skill. Skills are stateless. They receive inputs, return outputs, and raise typed exceptions on failure. Error recovery is Claude's responsibility, determined by what the directive says to do: retry with adjusted parameters, proceed without that data point and note the gap, or stop and escalate. The skill itself never makes that decision.

MCP integration changed what the system can do more than any other single decision. Giving Claude access to a real browser through Chrome DevTools means browser tasks can be expressed in a directive without writing a custom scraper for every website. What surprised me about how Claude uses browser access is how conservative it is. I assumed Claude would try multiple navigation strategies when a first attempt failed. In practice it makes one targeted extraction attempt and escalates if the result does not match the expected structure. That turned out to be the right behavior: browser automation that silently adapts to unexpected page structures is how you end up with bad data that looks like good data. The failure mode specific to MCP-mediated browser tools is concurrency against the same domain. When the competitive monitoring directive fans out browser skills in parallel against five vendor pages, hitting the same server within seconds triggers rate limiting. Early on this produced intermittent failures that looked like page structure errors rather than throttling. I now specify per-domain concurrency caps in the directive itself.

Competitive monitoring is a two-directive setup. The first fans out a browser skill against each vendor in parallel and collects raw pricing tables. The second diffs those tables against yesterday's snapshot, applies business rules (flag changes above a threshold, distinguish discontinued products from parse failures, note structural changes separately from pricing changes), and writes the delta report. The edge case that required the most directive iteration was the distinction between a product disappearing from a page and the page layout changing in a way that broke extraction. Both look like missing data. The directive now handles each case explicitly.

Contract extraction uses a three-directive chain. A coordinator directive routes each PDF section to one of four extraction templates based on clause classification. Extraction skills run in parallel against all sections. A validation directive collects the results, checks them against expected value ranges and required field lists, and either writes to the database or produces a review packet with anomalies highlighted alongside the source text. The output format is fixed: the downstream database schema never has to account for variation in how the task ran.

Research briefings are the most compositional. A fan-out directive dispatches four skills simultaneously: web search, LinkedIn company data, product changelog, and recent news. A synthesis directive receives all four, resolves contradictions between sources, and produces the final document. The synthesis directive does not know how its inputs were gathered, only what schema they conform to. That boundary means I can replace the LinkedIn skill with a different data source without touching the synthesis logic.

The failure mode that taught me the most happened about two months into production. The research briefing task started producing output that was inconsistent in structure: sometimes it included a technology stack section, sometimes it did not. I read the audit log for three consecutive runs and found the divergence point. In one run, Claude had decided the product changelog was unavailable and skipped the technology inference step. In another, it had treated an outdated changelog as current data. The directive said to "use available product information to infer the technology stack," which described the goal but not what "available" meant when the changelog was stale or unreachable. Two sentences added to the edge-case section fixed it. Finding the problem took about twenty minutes. The fix took five. Without the per-decision audit trail, I would have been comparing output documents with no way to trace back to the cause.

If I were designing CentralAgent today, I would make the directive format more formal. Right now directives are free-form Markdown, which is easy to read and write but means Claude has to infer their structure rather than parse it. A typed directive schema, with required fields validated before execution begins, would catch authoring errors early and produce clearer error messages when a directive is malformed. Free-form felt like the right tradeoff when there were three directives. At twenty-plus it introduces a class of ambiguity that a small upfront investment in schema design would have eliminated.

The other thing I would reconsider is the audit log granularity. The current log records every decision point at the same level of detail, which is thorough for debugging but produces noise for tasks that complete cleanly. A tiered approach, verbose logging for runs that hit checkpoints or errors, summary logging for clean completions, would make it much faster to spot anomalies across a day's worth of runs without having to filter through routine execution traces. The information is all there; it is just not organized for the way I actually use it.

The architecture is not novel in concept, but most frameworks blur the line between decision and execution in the name of convenience, and that blurring is where reliability problems start. CentralAgent does not blend those concerns, and that is the only reason I trust it with work that matters.

If you want to talk through the architecture or see what this looks like applied to a specific problem, you can reach me at erikrasin.io.