> Provisional patent application filed: #64/054,240 (April 30, 2026). 35 claims covering state machine guardrail enforcement for LLM agent tool access. The core engine remains Apache 2.0 open source.
I'm not sure I understand what the "core engine" is if it's not the "state machine guardrail runtime" which is what the patent cover. What parts are the open source parts exactly?
I find the idea really interesting and was nodding along the way as I read what you wrote, makes sense both for the human and the agent, seems like a really nice idea that'd help, but the patent kind of makes me want to run away and not look into it too deeply.
azurewraith 22 hours ago [-]
Thanks for digging deeper and I'm happy to clarify all three aspects:
Re: Reproducing the results: the engine, agent crate and demo TUI are all in the repo. If you have ollama running with a 13B+ model, task run:bugfix reproduces the simple bugfix result end to end. What isn't published yet is the SWE-bench experiment harness (task selection, patch scoring, control runs). I need to get that out, I prioritized the end-to-end simple Claude Code plugin for the launch. The demo crate (crates/demo) contains a demo TUI which calls ollama and runs the bugfix state machine interactively with code.
Re: Engine: The core engine (crates/engine/) is the pure Rust state machine evaluator. It's what Statewright is running on the backend. JSON in => transition decisions out. Agent (crates/agent/) builds on top of it to make it useful for LLMs. That all is Apache 2.0 with no restrictions.
Re: the Patent: The patent covers the method of using state machines to constrain LLM agent tool access at the protocol layer. It's defensive, it helps protect the managed service and the idea from "being scooped" from a larger company with more personnel and resources. It's not targeted against solo developers, self-hosters or researchers.
You'll find that the portions that I've released FSL 1.1 have explicit grants which do not restrict solo developers or single team self-hosting. The code released this way becomes Apache 2 in exactly 3 years. This is not unlike what Sentry and MariaDB did. I am planning on releasing more portions as FSL 1.1, I just hadn't crossed that bridge and honestly this thing seems to have gotten popular at the moment so I thought I'd set the record straight a bit
embedding-shape 8 hours ago [-]
It sounds to me like the interesting parts then are under the patents, and the non-essential parts are effectively what you've open sourced here.
I understand the concern and why you'd want to do something like this, but I hope you also understand from the other side that it makes it a no-go to even continue reading about.
> It's not targeted against solo developers, self-hosters or researchers.
If this is so, you might want to add an actual exclusion to the patents/licenses for those groups of people, if that's how you feel about it. Right now, they're kind of "empty words", and I probably couldn't defend myself in court if you sue me with "But they said on HN they wouldn't target me", but I'm not a lawyer, nor am I interested in paying a lawyer to figure this out either.
> The code released this way becomes Apache 2 in exactly 3 years.
I suppose I'll put a reminder to look into this project again in exactly 3 years! :) Regardless, I do wish you luck, the idea still seems solid in theory, so eagerly awaiting the future open source release.
azurewraith 7 hours ago [-]
> you might want to add an actual exclusion to the patents/licenses for those groups of people
Thank you for calling my attention to this. I think the world would be a better place if everyone just left the lawyers out of everything
striking 21 hours ago [-]
The not-quite-Apache-2 "Fair Source License, Version 1.1, ALv2 Future License" (https://github.com/getsentry/fsl.software/blob/main/FSL-1.1-...) includes the Apache 2 patent grant. That grants you conditional permission to use the software in ways that would, without the grant, infringe upon their patent. One of the conditions is that you may not make a claim against any party that the software infringes upon any patent, or else your patent grant is terminated.
Unfortunately, the license actually in the repo is not even a not-quite-Apache-2 license. It doesn't appear to be FSL-1.1-ALv2 at all: https://github.com/statewright/statewright/blob/main/plugins.... This notably does not include the patent grant, which makes it unclear whether use of the software would infringe upon the patent.
azurewraith 21 hours ago [-]
You're right, and I have just corrected this. The license in the repo now uses the canonical FSL-1.1-ALv2 based on the template from fsl.software and now includes the patent grant clause.
The omission wasn't intentional -- the patent grant wasn't on my radar when the original license text was committed. FSL licensing is very new territory for me and I duffed it slightly, now corrected.
the Cargo.toml covers the built Rust crates (engine, agent). the plugins/ directory has it's own LICENSE.md with the FSL terms. split license: the engine is completely open source, plugins FSL with a 3 year clock. I should make this clearer in the workspace config. I am planning on releasing more of the crates, they will likely be FSL and each of those crates will have a LICENSE.md override. I think this is the canonical pattern but anyone please correct me
azurewraith 20 hours ago [-]
I also just updated the https://statewright.ai/research page to accurately reflect the intent and mention the patent grant afforded under FSL-1.1-ALv2. Thanks again for calling my attention to this.
addaon 4 hours ago [-]
I’ve been using a pattern similar to this with near-frontier models to solve problems harder than coding. Structurally things are even more extreme — no tool calling allowed. Each state gives structured output that the harness then uses to derive the next state and context. So a context in one state may say “you have these lemmas with definition visible, and these by name in other files”; the agent from a certain state can consume the visible lemmas, but can also modify includes to get visibility into and ability to use other lemmas after iteration. So far, seems sane, but haven’t benchmarked on this problem against more free-form solutions.
azurewraith 2 hours ago [-]
[dead]
tecoholic 8 hours ago [-]
Very cool idea. I had something vaguely similar in my mind. It's nice one see go ahead and implement it. All the Claude code animations and not knowing what's happening, how long it will take and what will come out is really frustrating me. On top of that there is no way to actually limit the scope of things. Opencode's Plan mode and build mode helps a bit.
If a state machine can improve a local LLM to produce better results, it's welcome addition to tinkerers and solo devs.
azurewraith 7 hours ago [-]
I feel you on the Claude pulsing thing. Running (or trying to) run Opencode with any model I could throw at it to perform useful work like the frontier/proprietary models do (a tall order I know) is where I started. everyone makes the problem bigger (massive contexts, massive number of parameters, more transformers (MoE)) but I started with how can we make small/local LLMs perform better (do more with less)
Opencode's plan/build is decent, like it is in Claude Code... state machines are the next evolution. model agnostic, tooling agnostic (where feasible)
DeathArrow 3 hours ago [-]
> state machines are the next evolution
Yes but this solve just some part of a problem, it stops the agent from doing something. What would be more useful is forcing the agent to do something. To make up an example, let's say you want the agent to change a status in jira after it completes a task. With this framework you can deny the transition until the models changes the status in jira, but that doesn't mean the agent will do it.
azurewraith 3 hours ago [-]
[dead]
giancarlostoro 1 days ago [-]
Interesting, I built a ticketing system similar to Beads which has yielded more predictable results with Claude and other models, and I'm currently building a custom harness, I'm able to use offline models though my GPU ram bandwidth is much lower, but I'm also planning on doing something similar to what you've built, namely the editing tools and what not, I hate how long it takes for Claude to look for files, it feels wasteful. I'm still astounded that everyone else has figured out ways to speed up harnesses, but Claude Code is still slow like a slug. I don't even care if I am waiting on the LLM in terms of slowness, but running local tools slowly bothers the living crap out of me, stop using grep, RIPGREP IS FASTER!
In any case, I'll have to check out Statewright after work ;)
azurewraith 1 days ago [-]
I feel you on how sluggish Claude Code can be, you just never know what those pulsing prompts are doing in the background...
Given Statewright plugs into Claude Code, there is a little added overhead while managing the state machine logic, but for complicated workflows if it saves you a few debug loops, mass edit reversions or death spirals I think the case can be pretty solid for including it
giancarlostoro 23 hours ago [-]
I think this will be the next frontier for these models, improving the desktop tooling. I am surprised I've yet to see them go all in on hiring desktop app developers to overhaul Claude Code / Codex / Antigravity / etc because there's so many things they could do to reduce the footprint and issues drastically.
azurewraith 22 hours ago [-]
agreed... the tooling layer (desktop and console) is where the leverage is right now. the models are good enough, the harness operating them (even us humans) is what's holding things back. that's the basic gist behind this whole project
redhale 20 hours ago [-]
I feel like caching should be mentioned in tradeoffs, right? If you change the tool list frequently, that's a cache bust. In long sessions that seems like it could significantly affect costs.
azurewraith 20 hours ago [-]
Great question... and there are two answers depending on what you were originally referring to:
re: Claude Code... we actually don't filter or modify the tool list so all tools stay visible -- disallowed calls get blocked at execution time with an error message. No cache busts on transitions, the model sees the full tool sets. The cost there is prompt caching dollars not latency I suppose
re: The research (Rust agent + Ollama) the model only receives tool schemas for the current states' allowed tools. Ollama does have a KV cache reuse facility so changing the tool list busts that cache. Depending on your workflow this can happen as many times as you expect your states to transition until completion. For simple workflows this is 3-5x. Within each state the tool list is stable and cache operates normally. Presenting fewer tools instead of dozens on every agent processing step reduces input tokens and decision complexity, which is where the measurable gains come from.
Both enforce the same constraints depending on the execution interface. The schema level filtering in the research is the S-tier approach. Adding tools/list filtering to the MCP gateway would be beneficial if possible (it looks like we could only filter MCP tools not core ones, which could provide tangible benefit. I've added this evaluation to the roadmap.
redhale 18 hours ago [-]
Nice, thanks for the detailed answer!
dataworth 4 hours ago [-]
Visualizing agentic problem solving is a really cool concept. Feels like something I’ve seen on TV or something before. I like it.
what's the difference between a "transition" (purple line, not shown in the workflow) as opposed to happy path / failure?
azurewraith 8 hours ago [-]
the colors are based on the event name.... green for happy path events (READY, DONE, PASS), red for failure events (FAIL, ERROR) and purple for everything else
they don't indicate whether the transition is guarded or not... that is shown in the sidebar when you click a state.
nextaccountic 3 hours ago [-]
Suggestion, change the label to "other transitions"
My confusion was that happy path and failure are also transitions
azurewraith 3 hours ago [-]
shipped ^_^
2001zhaozhao 20 hours ago [-]
Interesting.
In your Github, the JSON format shown for defining custom workflows is very simple. I wonder if that limits the detail in the state-related instructions and error messages you can send to a model.
For example, in state transitions, does your tool just tell the model something like "you are in 'act' mode and no longer in 'plan' mode, here are your new available tools"? Seems difficult to give it any more informative messages given how simple the workflow definitions are. Likewise when the model attempts to do something that's not supported for tools in the given phase.
azurewraith 20 hours ago [-]
The workflow definition is intentionally simple... the enforcement layer handles the mechanics however the model gets more context than just "you're in <xyz> mode now"
Each state has an `instructions` field for phase specific guidance and when an agent's action (tool call) gets rejected the error message lets the model know what went wrong, and what's available to move forward
Tool 'Edit' is not available in the 'planning' phase.
Allowed Tools: Read, Grep, Glob
To advance, call statewright_transition with READY -> implementing
Models (even simple ones) tend to reason through these error messages, adjusting their approaches as opposed to retrying the blocked call. Additionally, on transitions the model is required to include a rationale explaining why it's transitioning (`data.rationale`) which creates an audit trail of the agent's reasoning at each phase boundary. That ends up being one of the most useful parts of the run history viewable on statewright.ai
tim-projects 14 hours ago [-]
I'm fully convinced that state machines are the key to getting low powered llm models to produce good quality code.
azurewraith 8 hours ago [-]
bingo! and that's where this journey began, it's been fun proving it out ^_^
miki_tyler 17 hours ago [-]
Very nice project!
Is the editor/composer separate from the runtime?
If I build a workflow in the visual editor, can I use that same flow inside my own app just by using the runtime/engine? Or is it mainly tied to the Statewright platform and Claude Code plugin?
I’m wondering if the runtime can be used as a standalone piece to power apps I build.
azurewraith 16 hours ago [-]
Yes, the engine handles the full workflow schema including guards. There are some aspects of runtime enforcement (env vars/command filtering, etc. exposed via the UI) that currently only live in the plugin layer but the engine parses and exposes everything. All you would have to do is wire up enforcement on your end in your app the same way the plugin does.
password4321 1 days ago [-]
Does it make sense to ship an MCP code mode API? I'm surprised you're recommending MCP as-is when concerned about context usage optimization. I don't have a lot of hands-on experience either way yet so I'm curious what's best and/or most popular... I understand MCP is less effort and still affordable at VC-subsidised prices.
azurewraith 1 days ago [-]
for the integration piece that ties into Claude Code and other places where AI is used most frequently? yes I think it does... we're not fighting context in Opus/Sonnet as much as we are in smaller models and we're only adding about 6 tools here which is a smaller footprint than other MCP exposures. Smaller models have a more direct/tight interface that doesn't bloat the tool space in my experimentation (using the core directly)
prunrCloud 9 hours ago [-]
Really interesting approach. My only concern would be how much flexibility gets lost when workflows become too rigid. Curious how it performs on tasks that require more creative exploration.
azurewraith 8 hours ago [-]
[dead]
davidkpiano 1 days ago [-]
Pretty cool. Looks like stately.ai but catered towards agentic state machine workflows. Really interesting!
azurewraith 1 days ago [-]
Stately (and XState ^_^) is pretty neat, I hadn't come across it yet... (edit:) neat to see visual XState being used for application logic as well
I see constant posts on Reddit/HN about the ways that AI is amazing and at the same time is fudging it (literally). Nobody can make reliability guarantees on something that's non-deterministic and non-idempotent. Nobody's AI workflow suite of tools can claim this. Prompting gets you closer to the mark but still non-deterministic. Breaking down the problem into chunks with valid transition criterion so that even tiny models can step through them I believe gets us closer to where we want to be semantically
azurewraith 6 hours ago [-]
Hey it's me again. Some things that didn't fit in the README or the original post -- less about features, more about where this goes.
The plan/implement/test workflow is very basic and represents the most common agentic use case. But the state machine pattern applies to any multi-step work where agents are useful but susceptible to death spirals, hallucinations, or other non-deterministic quirkiness. This also enables Claude Desktop and other non-coding agents to perform useful constrained work.
I've been building a content pipeline for tabletop publishing and tested it a bit earlier yesterday. A research phase gathers lore and game details from a compendium, a drafting phase generates structured content including schema-specific JSON validation (so my Lua+LaTeX templates work without iterating). A review gate has me editing content directly (tmux+neovim dialog is great for this). The agent shapes the content, makes sure it conforms to JSON validation and content requirements, then I write it. Before I adapted the state machine to it, the agent tried to do everything all at once — calling multiple agents is sometimes effective but details get lost and you definitely lose visibility in the summarization. The state machine runs everyone serially (for now) but chaining and parallelization are on the roadmap.
While working with statewright on a different workflow over the weekend and Claude (as Claude does) attempted to write an intricate bash script to work around a guardrail... and statewright blocked it! I think that was when I knew there was some real power behind what's been built here. Enforcement has to be structural, not advisory.
Also, being generally useful for things besides coding you can start to think about things like SOC 2 change management. Every change needs a plan, a human review gate, audited implementation, pull request, review, human approval, and then finally a human to approve a production deployment. Today teams enforce this with checklists and hope. An agent constrained by a workflow that won't let it deploy without all the prerequisite pieces is enterprise delivery with an auditable paper trail and humans injected for approvals where they need to be - not managing each change's lifecycle.
The piece I'm most excited about is agent-generated workflows. You solve a problem once and maintain your context, then point the agent at the JSON schema and it creates and uploads a new workflow to statewright automatically that you can use immediately. No fine-tuning, no exhaustive prompt engineering, no dozens of agents... best-fit lightweight guardrails that agents help build themselves, compiling your intent into structure the models can't weasel their way out of. This is a fundamentally different reality than what the current state of the art is practicing. I think that's a big deal.
chris_st 21 hours ago [-]
Please add support for the Windsurf editor as well. Thanks!
azurewraith 21 hours ago [-]
Mr. Claude's Opus says that this is a very feasible thing. It has better support for hooks than Cursor and full MCP support so protocol-layer blocking (like Claude) is possible. Adding to the roadmap...
brainless 20 hours ago [-]
I have to check how you are using state machines but I have also been focused on small models for a while now.
nocodo is one of my product experiments, currently using 120B model but I have tested a few agents inside it with 20B models.
I create a bunch of agents, each with very specific goals. Like Project Manager, Backend Engineer, etc.
Each agent gets a very compact list of tools and access to only certain parts of the filesystem or commands.
Nice project... the per-agent tool restriction is the same core insight (smaller tool space -> better reasoning)
The main difference with Statewright is that tool access changes over time within a single agent. Planning phase gets read-only tools, edit capability unlocks after the agent proves it has adequate understanding... test tools unlock after the fix. State machines handle the phase transitions, guards and retry loops.
Your multi-agent approach decomposes by role instead of by phase/state. Both are valid. Since you're already in Rust, the engine crate (crates/engine) is a pure library with no deps. It might be interesting to see if putting a state machine around your orchestration layer improves your observed performance
brainless 13 hours ago [-]
I will give it a shot. I am very happy to see other projects where people are trying to build with small models.
veunes 7 hours ago [-]
[flagged]
DeathArrow 14 hours ago [-]
First thought: But why do we need statewright.ai external api? Why can't we do everything locally?
Second thought: enforcing tools is useful and I built myself a Pi extension to deny access to particular tools in some workflows.
But we need somehow to force agents obey the rules.
For example I have rules when using Pi to ask main agent to dispatch implementer agents in parallel using git worktrees. Some time it uses git worktrees, sometimes not.
The thoughts are like this: "the user asked me to use git worktrees so let me start using git worktrees. But wait, the task is simple so maybe I don't need git worktrees..."
If I ask why it didn't follow the rules, it says something like: "The user is right, I should have followed the rules..."
azurewraith 8 hours ago [-]
you're hitting the nail on the head... rules in prompts are suggestions the model can rationalize away.
"the task is so simple that maybe I don't need worktrees" is the model overriding your intent with its own judgement and that's a pattern I'm seeing more and more as these models mature. statewright provides the guardrails... strong suggestions up front on what it can do in X state via injection and if it still wants to try and outsmart that, it gets hit in the post hook and the model gets the message "oh, you're right I shouldn't do it that way" ... instead of you course correcting, it's the state machine
to your first question: the engine is Apache 2.0 and runs locally. the managed service adds the visual editor, run history and plugin install. the enforcement itself doesn't require the cloud, I run the exact same engine on the backend
the MCP server is just the way to get statewright in the hands of a wide array of existing use cases, claude code included. not all agentic clients are created equal and Pi is actually the experience I want to hone next (also the most extensible)
esperent 13 hours ago [-]
> example I have rules when using Pi to ask main agent to dispatch implementer agents in parallel using git worktrees. Some time it uses git worktrees, sometimes no
I've taken the approach that whenever this happens, it's my fault. The instructions were not clear enough, not direct enough, or more often, there's just too many of them.
I'm now at the point where my pi system prompt + agents + skills + tools starts out at just 7k context. It's all very clear and concise. I almost never have ambiguous responses like this, at least not bear the start of a session.
Combined with instructions to keep the main session as a coordinator and use subagents for all non trivial work, I can get a lot of work done before hitting 100k context and basically never go over 150k.
It's a stark contrast with Claude code where I was starting at about 35k context even after trimming my stuff down. It's hardly surprising if an agent doesn't know what to do if you dump 30k+ of context with all kinds of rules and workflows, most of them unrelated to the current tasks, before you even do anything.
azurewraith 8 hours ago [-]
you're not wrong and trimming context is legitimately the first thing that everyone should do. even with context trimming and a tight prompt the model still makes judgement calls about which tools to use and when to stop.
that's fine 90% of the time... the state machine for the other 10% where the model's judgement call costs you an hour of debugging later (confidently fixed wrong, or overzealously) or stops a mostly automated thing because it got stuck on the wrong path.
esafak 24 hours ago [-]
I just have a smart model write a testable phased plan, have a cheaper model implement them, and yet another model to review each phase. I don't see the value of adding a Rust state engine. Algorithmically verifiable things can be tests, and more nebulous things (like pattern compliance) need an LLM to do the heavy lifting and can make mistakes, so what does the state engine buy you?
azurewraith 23 hours ago [-]
the state engine is the part that can't hallucinate. even with simple steps/prompting the review model can miss things... it's still an LLM making a judgement call at the end of the day.
the state engine doesn't judge, it enforces... with code and not transformers ^_^
if a tool (or any other guardrail) isn't valid at a given state the model call gets rejected before the model sees the result. that's the gap between "a model said this is okay" vs. "the system structurally prevents this"
esafak 23 hours ago [-]
I don't understand. Let's stay my state is whether we are in conformance with repo patterns. Walk me through how you don't/can't hallucinate, given that you need an LLM to determine the state. For state variables that don't need LLMs, you can simply use tests and commit hooks, no?
azurewraith 23 hours ago [-]
the LLM doesn't determine the state... it requests a transition to change the state. the engine evaluates guards (data carried along the way) to decide if the transition is valid.
it (the LLM) can't skip from implementation to deploy if the guard says the tests haven't passed. the model will receive feedback that what it's tried to do is invalid and give the reasons why. it can't be skipped. it then tries to resolve that new information to make the state transition... almost like it would responding to a human in the chair denying a step.
the model can't merge if it hasn't gone through your review state, even if it wants to (it'll try though)
The research page (https://statewright.ai/research) mentions a patent, and a "core engine";
> Provisional patent application filed: #64/054,240 (April 30, 2026). 35 claims covering state machine guardrail enforcement for LLM agent tool access. The core engine remains Apache 2.0 open source.
I'm not sure I understand what the "core engine" is if it's not the "state machine guardrail runtime" which is what the patent cover. What parts are the open source parts exactly?
I find the idea really interesting and was nodding along the way as I read what you wrote, makes sense both for the human and the agent, seems like a really nice idea that'd help, but the patent kind of makes me want to run away and not look into it too deeply.
Re: Reproducing the results: the engine, agent crate and demo TUI are all in the repo. If you have ollama running with a 13B+ model, task run:bugfix reproduces the simple bugfix result end to end. What isn't published yet is the SWE-bench experiment harness (task selection, patch scoring, control runs). I need to get that out, I prioritized the end-to-end simple Claude Code plugin for the launch. The demo crate (crates/demo) contains a demo TUI which calls ollama and runs the bugfix state machine interactively with code.
Re: Engine: The core engine (crates/engine/) is the pure Rust state machine evaluator. It's what Statewright is running on the backend. JSON in => transition decisions out. Agent (crates/agent/) builds on top of it to make it useful for LLMs. That all is Apache 2.0 with no restrictions.
Re: the Patent: The patent covers the method of using state machines to constrain LLM agent tool access at the protocol layer. It's defensive, it helps protect the managed service and the idea from "being scooped" from a larger company with more personnel and resources. It's not targeted against solo developers, self-hosters or researchers.
You'll find that the portions that I've released FSL 1.1 have explicit grants which do not restrict solo developers or single team self-hosting. The code released this way becomes Apache 2 in exactly 3 years. This is not unlike what Sentry and MariaDB did. I am planning on releasing more portions as FSL 1.1, I just hadn't crossed that bridge and honestly this thing seems to have gotten popular at the moment so I thought I'd set the record straight a bit
I understand the concern and why you'd want to do something like this, but I hope you also understand from the other side that it makes it a no-go to even continue reading about.
> It's not targeted against solo developers, self-hosters or researchers.
If this is so, you might want to add an actual exclusion to the patents/licenses for those groups of people, if that's how you feel about it. Right now, they're kind of "empty words", and I probably couldn't defend myself in court if you sue me with "But they said on HN they wouldn't target me", but I'm not a lawyer, nor am I interested in paying a lawyer to figure this out either.
> The code released this way becomes Apache 2 in exactly 3 years.
I suppose I'll put a reminder to look into this project again in exactly 3 years! :) Regardless, I do wish you luck, the idea still seems solid in theory, so eagerly awaiting the future open source release.
Done :) https://github.com/statewright/statewright/blob/main/PATENTS...
Thank you for calling my attention to this. I think the world would be a better place if everyone just left the lawyers out of everything
Unfortunately, the license actually in the repo is not even a not-quite-Apache-2 license. It doesn't appear to be FSL-1.1-ALv2 at all: https://github.com/statewright/statewright/blob/main/plugins.... This notably does not include the patent grant, which makes it unclear whether use of the software would infringe upon the patent.
The omission wasn't intentional -- the patent grant wasn't on my radar when the original license text was committed. FSL licensing is very new territory for me and I duffed it slightly, now corrected.
https://github.com/statewright/statewright/blob/main/Cargo.t...
Is that wrong?
If a state machine can improve a local LLM to produce better results, it's welcome addition to tinkerers and solo devs.
Opencode's plan/build is decent, like it is in Claude Code... state machines are the next evolution. model agnostic, tooling agnostic (where feasible)
Yes but this solve just some part of a problem, it stops the agent from doing something. What would be more useful is forcing the agent to do something. To make up an example, let's say you want the agent to change a status in jira after it completes a task. With this framework you can deny the transition until the models changes the status in jira, but that doesn't mean the agent will do it.
In any case, I'll have to check out Statewright after work ;)
Given Statewright plugs into Claude Code, there is a little added overhead while managing the state machine logic, but for complicated workflows if it saves you a few debug loops, mass edit reversions or death spirals I think the case can be pretty solid for including it
re: Claude Code... we actually don't filter or modify the tool list so all tools stay visible -- disallowed calls get blocked at execution time with an error message. No cache busts on transitions, the model sees the full tool sets. The cost there is prompt caching dollars not latency I suppose
re: The research (Rust agent + Ollama) the model only receives tool schemas for the current states' allowed tools. Ollama does have a KV cache reuse facility so changing the tool list busts that cache. Depending on your workflow this can happen as many times as you expect your states to transition until completion. For simple workflows this is 3-5x. Within each state the tool list is stable and cache operates normally. Presenting fewer tools instead of dozens on every agent processing step reduces input tokens and decision complexity, which is where the measurable gains come from.
Both enforce the same constraints depending on the execution interface. The schema level filtering in the research is the S-tier approach. Adding tools/list filtering to the MCP gateway would be beneficial if possible (it looks like we could only filter MCP tools not core ones, which could provide tangible benefit. I've added this evaluation to the roadmap.
what's the difference between a "transition" (purple line, not shown in the workflow) as opposed to happy path / failure?
they don't indicate whether the transition is guarded or not... that is shown in the sidebar when you click a state.
My confusion was that happy path and failure are also transitions
In your Github, the JSON format shown for defining custom workflows is very simple. I wonder if that limits the detail in the state-related instructions and error messages you can send to a model.
For example, in state transitions, does your tool just tell the model something like "you are in 'act' mode and no longer in 'plan' mode, here are your new available tools"? Seems difficult to give it any more informative messages given how simple the workflow definitions are. Likewise when the model attempts to do something that's not supported for tools in the given phase.
Each state has an `instructions` field for phase specific guidance and when an agent's action (tool call) gets rejected the error message lets the model know what went wrong, and what's available to move forward
Tool 'Edit' is not available in the 'planning' phase. Allowed Tools: Read, Grep, Glob To advance, call statewright_transition with READY -> implementing
Models (even simple ones) tend to reason through these error messages, adjusting their approaches as opposed to retrying the blocked call. Additionally, on transitions the model is required to include a rationale explaining why it's transitioning (`data.rationale`) which creates an audit trail of the agent's reasoning at each phase boundary. That ends up being one of the most useful parts of the run history viewable on statewright.ai
Is the editor/composer separate from the runtime?
If I build a workflow in the visual editor, can I use that same flow inside my own app just by using the runtime/engine? Or is it mainly tied to the Statewright platform and Claude Code plugin?
I’m wondering if the runtime can be used as a standalone piece to power apps I build.
I see constant posts on Reddit/HN about the ways that AI is amazing and at the same time is fudging it (literally). Nobody can make reliability guarantees on something that's non-deterministic and non-idempotent. Nobody's AI workflow suite of tools can claim this. Prompting gets you closer to the mark but still non-deterministic. Breaking down the problem into chunks with valid transition criterion so that even tiny models can step through them I believe gets us closer to where we want to be semantically
The plan/implement/test workflow is very basic and represents the most common agentic use case. But the state machine pattern applies to any multi-step work where agents are useful but susceptible to death spirals, hallucinations, or other non-deterministic quirkiness. This also enables Claude Desktop and other non-coding agents to perform useful constrained work.
I've been building a content pipeline for tabletop publishing and tested it a bit earlier yesterday. A research phase gathers lore and game details from a compendium, a drafting phase generates structured content including schema-specific JSON validation (so my Lua+LaTeX templates work without iterating). A review gate has me editing content directly (tmux+neovim dialog is great for this). The agent shapes the content, makes sure it conforms to JSON validation and content requirements, then I write it. Before I adapted the state machine to it, the agent tried to do everything all at once — calling multiple agents is sometimes effective but details get lost and you definitely lose visibility in the summarization. The state machine runs everyone serially (for now) but chaining and parallelization are on the roadmap.
While working with statewright on a different workflow over the weekend and Claude (as Claude does) attempted to write an intricate bash script to work around a guardrail... and statewright blocked it! I think that was when I knew there was some real power behind what's been built here. Enforcement has to be structural, not advisory.
Also, being generally useful for things besides coding you can start to think about things like SOC 2 change management. Every change needs a plan, a human review gate, audited implementation, pull request, review, human approval, and then finally a human to approve a production deployment. Today teams enforce this with checklists and hope. An agent constrained by a workflow that won't let it deploy without all the prerequisite pieces is enterprise delivery with an auditable paper trail and humans injected for approvals where they need to be - not managing each change's lifecycle.
The piece I'm most excited about is agent-generated workflows. You solve a problem once and maintain your context, then point the agent at the JSON schema and it creates and uploads a new workflow to statewright automatically that you can use immediately. No fine-tuning, no exhaustive prompt engineering, no dozens of agents... best-fit lightweight guardrails that agents help build themselves, compiling your intent into structure the models can't weasel their way out of. This is a fundamentally different reality than what the current state of the art is practicing. I think that's a big deal.
nocodo is one of my product experiments, currently using 120B model but I have tested a few agents inside it with 20B models.
I create a bunch of agents, each with very specific goals. Like Project Manager, Backend Engineer, etc.
Each agent gets a very compact list of tools and access to only certain parts of the filesystem or commands.
https://github.com/brainless/nocodo/tree/main/agents/src
The main difference with Statewright is that tool access changes over time within a single agent. Planning phase gets read-only tools, edit capability unlocks after the agent proves it has adequate understanding... test tools unlock after the fix. State machines handle the phase transitions, guards and retry loops.
Your multi-agent approach decomposes by role instead of by phase/state. Both are valid. Since you're already in Rust, the engine crate (crates/engine) is a pure library with no deps. It might be interesting to see if putting a state machine around your orchestration layer improves your observed performance
Second thought: enforcing tools is useful and I built myself a Pi extension to deny access to particular tools in some workflows.
But we need somehow to force agents obey the rules.
For example I have rules when using Pi to ask main agent to dispatch implementer agents in parallel using git worktrees. Some time it uses git worktrees, sometimes not.
The thoughts are like this: "the user asked me to use git worktrees so let me start using git worktrees. But wait, the task is simple so maybe I don't need git worktrees..."
If I ask why it didn't follow the rules, it says something like: "The user is right, I should have followed the rules..."
"the task is so simple that maybe I don't need worktrees" is the model overriding your intent with its own judgement and that's a pattern I'm seeing more and more as these models mature. statewright provides the guardrails... strong suggestions up front on what it can do in X state via injection and if it still wants to try and outsmart that, it gets hit in the post hook and the model gets the message "oh, you're right I shouldn't do it that way" ... instead of you course correcting, it's the state machine
to your first question: the engine is Apache 2.0 and runs locally. the managed service adds the visual editor, run history and plugin install. the enforcement itself doesn't require the cloud, I run the exact same engine on the backend
the MCP server is just the way to get statewright in the hands of a wide array of existing use cases, claude code included. not all agentic clients are created equal and Pi is actually the experience I want to hone next (also the most extensible)
I've taken the approach that whenever this happens, it's my fault. The instructions were not clear enough, not direct enough, or more often, there's just too many of them.
I'm now at the point where my pi system prompt + agents + skills + tools starts out at just 7k context. It's all very clear and concise. I almost never have ambiguous responses like this, at least not bear the start of a session.
Combined with instructions to keep the main session as a coordinator and use subagents for all non trivial work, I can get a lot of work done before hitting 100k context and basically never go over 150k.
It's a stark contrast with Claude code where I was starting at about 35k context even after trimming my stuff down. It's hardly surprising if an agent doesn't know what to do if you dump 30k+ of context with all kinds of rules and workflows, most of them unrelated to the current tasks, before you even do anything.
that's fine 90% of the time... the state machine for the other 10% where the model's judgement call costs you an hour of debugging later (confidently fixed wrong, or overzealously) or stops a mostly automated thing because it got stuck on the wrong path.
the state engine doesn't judge, it enforces... with code and not transformers ^_^
if a tool (or any other guardrail) isn't valid at a given state the model call gets rejected before the model sees the result. that's the gap between "a model said this is okay" vs. "the system structurally prevents this"
it (the LLM) can't skip from implementation to deploy if the guard says the tests haven't passed. the model will receive feedback that what it's tried to do is invalid and give the reasons why. it can't be skipped. it then tries to resolve that new information to make the state transition... almost like it would responding to a human in the chair denying a step.
the model can't merge if it hasn't gone through your review state, even if it wants to (it'll try though)