Today’s batch centers on moving agent research from ‘prompt engineering’ toward rigorous systems engineering. We see a clear shift toward formalizing agent state, optimizing long-horizon memory via rate-distortion theory, and grounding evaluation in realistic physical and industrial constraints.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
Shepherd introduces a functional substrate that treats meta-agent interactions as typed events, recording execution traces in a Git-like structure. By replacing standard containerization with a specialized process-forking mechanism, it achieves 5x faster state capture and enables supervisor intervention that boosted pair-coding pass rates from 28.8% to 54.5%.
↳ This is a meaningful step toward making agentic workflows deterministic, debuggable, and safe for production environments.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
The authors reframe agent memory as a rate-distortion problem, arguing that effective compression should prioritize decision boundaries over generic relevance scores. By minimizing the loss in achievable decision quality, they provide a principled way to manage memory budgets in long-horizon tasks.
↳ It provides a much-needed theoretical foundation for memory management, moving away from heuristic-based RAG approaches.
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test (GTT) proposes a formal, agent-agnostic comparator where agent A is ‘smarter’ than B if B cannot distinguish between A-imitating-B and B-itself. This provides a mathematical ordering over intelligence based on indistinguishability rather than benchmark accuracy.
↳ A bold attempt to unify capability evaluation, though likely difficult to compute in practice for high-dimensional latent models.
BenchCAD: A Comprehensive, Industry-Standard Benchmark for Programmatic CAD
BenchCAD evaluates MLLMs on their ability to generate executable parametric CAD code for industrial design. Unlike simple shape-recognition tasks, this requires reasoning about physical feasibility, engineering parameters, and manufacturing logic.
↳ This shifts focus from ‘visual’ understanding to ‘functional’ understanding, which is critical for real-world robotics and manufacturing.
MaD Physics: Evaluating information seeking under constraints in physical environments
MaD Physics evaluates scientific discovery agents on their ability to plan measurements under strict physical and cost constraints. It highlights the failure of existing models to balance experimental discovery with resource limitations.
↳ The focus on constrained experimental design is a prerequisite for moving AI agents into real laboratory settings.
From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World
The paper critiques existing AI pentesting benchmarks for relying on sterile CTF-style environments that fail to capture real-world strategic complexity. They propose a framework that better aligns evaluation with the open-ended nature of offensive security.
↳ A necessary reality check for the security agent hype; real-world pentesting is fundamentally different from solving narrow exploit puzzles.
📈 Patterns
The field is finally tiring of ‘vibe-based’ evaluation. The focus is clearly shifting toward system-level stability, decision-theoretic memory, and task-specific constraints that mimic real-world manufacturing and engineering.
Back to the grind—some of these evaluation frameworks are actually worth keeping an eye on for your next architecture review.
