Moving beyond prompt engineering: The shift toward agentic systems, formal verification, and structural memory.

Today’s batch highlights a clear maturation in the agent ecosystem. We are seeing a transition from simple sequential reasoning to structured frameworks that integrate formal verification, complex memory management, and specialized infrastructure for sparse operations.

Goedel-Architect: Streamlining Formal Theorem Proving with Blueprint Generation and Refinement

Chung et al. · [abs] [pdf]

This framework moves away from monolithic proof generation by employing a blueprint-first strategy in Lean 4. By decomposing proofs into dependency graphs and iteratively refining lemmas, the system achieves higher success rates in formal verification tasks compared to flat, end-to-end prompting.

↳ It successfully applies software engineering modularity to the inherently messy process of LLM-driven formal verification.

Formal Methods Lean 4 Reasoning

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

Chen et al. · [abs] [pdf]

Vortex introduces a domain-specific language for sparse attention, allowing developers to define custom attention patterns that map efficiently to underlying GPU kernels. By abstracting the complexity of hardware-level optimization, it enables faster prototyping and deployment of long-context sparse models.

↳ Essential for practitioners dealing with long-horizon agents where dense attention becomes a primary bottleneck in both latency and VRAM.

Systems Attention Optimization

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Omri et al. · [abs] [pdf]

This work provides a formal taxonomy of agent memory systems, ranging from naive retrieval to stateful update flows. The authors analyze how different memory architectures impact performance in long-horizon tasks, identifying the specific trade-offs between update overhead and recall accuracy.

↳ This is a necessary step toward standardizing ‘statefulness’ in agent design, moving beyond the current ‘anything goes’ approach to memory.

Agentic Systems Memory

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Du et al. · [abs] [pdf]

MLEvolve addresses the limitations of isolated search branches in automated discovery by using a ‘Progressive MCGS’ (Monte Carlo Graph Search) structure. This allows agents to share knowledge and findings across disparate search paths, resulting in more robust machine learning algorithm discovery.

↳ It replaces memoryless search with a persistent stateful graph, which is arguably the correct way to handle multi-step scientific discovery.

AutoML Agentic Discovery

Benchmark Everything Everywhere All at Once

Xiong et al. · [abs] [pdf]

The authors present a system for autonomous benchmark creation, aiming to mitigate the data leakage and saturation issues seen in manual benchmarks. The system orchestrates the pipeline from data generation to evaluation criteria definition without human-in-the-loop intervention.

↳ While automated benchmarking is prone to its own biases, it is likely the only way to keep pace with model evaluation requirements given current release velocities.

Evaluation Benchmarks

Humans’ ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

Chen et al. · [abs] [pdf]

This paper focuses on the ‘mental model’ gap in human-agent collaboration. The authors provide a dataset of human interactions with fine-grained annotations tracking intent and goal alignment, providing a much-needed benchmark for agent social reasoning.

↳ Evaluation of ‘collaboration’ has been purely anecdotal; this dataset forces us to define it quantitatively at an action level.

Human-AI Interaction Collaboration

Keep your benchmarks tight and your memory hierarchies efficient. See you tomorrow.