LangChain RubricMiddleware: AI Agents That Grade Themselves

LangChain released RubricMiddleware, a component for its Deep Agents framework that makes an AI agent check its own output against a list of pass-or-fail criteria and redo the work if it falls short. It landed in deepagents 0.6.5 at the end of May, and it is still labeled beta, which means the API can change under you.

The pitch is familiar to anyone who has watched an agent confidently declare a task done. It writes the report, skips a required section, and hands it back like nothing happened. The code mostly works but two tests fail. RubricMiddleware is built for exactly that gap: tasks where "done" has a clear, checkable definition the model won't reliably hit on the first pass.

How the loop actually runs

You declare a rubric. Plain text, one criterion per line. The example in the docs asks an agent to write a paragraph about the ocean without using the letter "e" (a lipogram, if you want the word for it), with at least two sensory details and a 3-to-4-sentence length. When the working agent thinks it is finished, a separate grader sub-agent reads the whole transcript and scores it against each line.

If the grader returns needs_revision, its per-criterion feedback gets injected back into the conversation and the agent tries again. The cycle ends on one of a few terminal states: every criterion satisfied, the iteration cap reached, the rubric judged impossible to evaluate, or the grader itself throwing an exception. Five verdicts total. The default cap is three iterations, and you cannot push it past twenty.

Here is the part I find more interesting than the loop itself. The grader can be handed its own tools. In LangChain's example, it gets a Python function that actually counts the forbidden letters rather than eyeballing the prose. So the grader isn't only reasoning about whether the text reads well. It runs a check and gets a number back. For a code task, that same hook is where you would wire in the test suite.

Is this just LLM-as-a-judge with extra steps?

Sort of, and LangChain says as much. The company frames it as the same LLM-as-a-judge pattern its LangSmith product already uses for offline evaluation, except moved to runtime so it drives revision instead of just scoring a batch after the fact. That is an honest framing, and it is also worth being a little skeptical about. A model grading another model's output inherits all the usual problems of a model grading anything. The tool-calling escape hatch helps when the criterion is mechanical, like counting letters or running tests. It helps a lot less when the rubric says something like "includes vivid sensory details," which is exactly the kind of fuzzy judgment LLM graders are shakiest on.

The source material I started from compared this to a /goal feature in Claude Code and Codex. I couldn't verify that those tools ship anything by that name, so take the specific comparison with caution. What LangChain does claim is that a dedicated grader sub-agent, one that sees the entire execution trace rather than just the final answer, gives it more flexibility than alternatives. Reasonable enough as an architectural argument. Whether it produces better outputs in practice is not something a docs page can settle.

The contradiction nobody's mentioning

Deep Agents follows what its maintainers openly call a "trust the LLM" model. The project README spells it out: enforce boundaries at the tool and sandbox level, not by expecting the model to self-police. And then here comes a feature whose entire job is the model policing itself.

The reconciliation, I think, is that RubricMiddleware doesn't trust the model to self-police, exactly. It trusts the grader's tools. When a criterion can be checked by running actual code, the self-evaluation has teeth. When it can't, you're back to one language model's opinion of another's, dressed up in a verdict enum.

What you can do with it now

The component is provider-agnostic, like the rest of Deep Agents, so it runs on any model that supports tool calling. You can stream each grader pass as it fires, register a callback to log every evaluation, and persist a rubric across follow-up invocations if you attach a checkpointer and reuse the thread ID. There's a worked example and a documented trace in the docs if you want to see the iteration history before committing to it.

Beta means the surface will move. If you build against the current verdict schema or the streaming events, budget for some churn before this stabilizes. Whether self-grading agents turn out to be a durable pattern or a stopgap until models stop forgetting required sections on their own, the next few releases of deepagents will tell you more than the announcement does.

LangChain Ships RubricMiddleware to Make AI Agents Grade Their Own Work

How the loop actually runs

Is this just LLM-as-a-judge with extra steps?

The contradiction nobody's mentioning

What you can do with it now

Andrés Martínez

Related Articles

Bain Survey: Companies Miss AI Cost-Savings Targets, Blame Human Oversight

OpenAI Says Office Workers Now One-Fifth of Codex Users

Perplexity Launches Search as Code for AI Agents

Stay Ahead of the AI Curve