Formalizing Agent State and the Rise of Decision-Centric Evaluation

Today’s batch centers on moving agent research from ‘prompt engineering’ toward rigorous systems engineering. We see a clear shift toward formalizing agent state, optimizing long-horizon memory via rate-distortion theory, and grounding evaluation in realistic physical and industrial constraints.

Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace

Simon Yu et al. · [abs] [pdf]

Shepherd introduces a functional substrate that treats meta-agent interactions as typed events, recording execution traces in a Git-like structure. By replacing standard containerization with a specialized process-forking mechanism, it achieves 5x faster state capture and enables supervisor intervention that boosted pair-coding pass rates from 28.8% to 54.5%.

↳ This is a meaningful step toward making agentic workflows deterministic, debuggable, and safe for production environments.

agents systems formal methods

Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

Mingxi Zou et al. · [abs] [pdf]

The authors reframe agent memory as a rate-distortion problem, arguing that effective compression should prioritize decision boundaries over generic relevance scores. By minimizing the loss in achievable decision quality, they provide a principled way to manage memory budgets in long-horizon tasks.

↳ It provides a much-needed theoretical foundation for memory management, moving away from heuristic-based RAG approaches.

memory decision theory LLMs

The Generalized Turing Test: A Foundation for Comparing Intelligence

Daniel Mitropolsky et al. · [abs] [pdf]

The Generalized Turing Test (GTT) proposes a formal, agent-agnostic comparator where agent A is ‘smarter’ than B if B cannot distinguish between A-imitating-B and B-itself. This provides a mathematical ordering over intelligence based on indistinguishability rather than benchmark accuracy.

↳ A bold attempt to unify capability evaluation, though likely difficult to compute in practice for high-dimensional latent models.

theory evaluation

BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD

Haozhe Zhang et al. · [abs] [pdf]

BenchCAD evaluates MLLMs on their ability to generate executable parametric CAD code for industrial design. Unlike simple shape-recognition tasks, this requires reasoning about physical feasibility, engineering parameters, and manufacturing logic.

↳ This shifts focus from ‘visual’ understanding to ‘functional’ understanding, which is critical for real-world robotics and manufacturing.

CAD multimodal benchmarking

MaD Physics: Evaluating information seeking under constraints in physical environments

Moksh Jain et al. · [abs] [pdf]

MaD Physics evaluates scientific discovery agents on their ability to plan measurements under strict physical and cost constraints. It highlights the failure of existing models to balance experimental discovery with resource limitations.

↳ The focus on constrained experimental design is a prerequisite for moving AI agents into real laboratory settings.

scientific discovery robotics evaluation

From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

Pedro Conde et al. · [abs] [pdf]

The paper critiques existing AI pentesting benchmarks for relying on sterile CTF-style environments that fail to capture real-world strategic complexity. They propose a framework that better aligns evaluation with the open-ended nature of offensive security.

↳ A necessary reality check for the security agent hype; real-world pentesting is fundamentally different from solving narrow exploit puzzles.

security agents evaluation

Back to the grind—some of these evaluation frameworks are actually worth keeping an eye on for your next architecture review.