Moving beyond prompt engineering: Autonomous scientific discovery and the rise of formal agent evaluation

Today’s batch highlights a clear shift in AI research: moving away from ‘just prompting’ toward rigorous, autonomous system design. We are seeing a maturation of agentic workflows—spanning from autonomous disease forecasting to auditable clinical pipelines and formal methods in safety monitoring.

Prospective multi-pathogen disease forecasting using autonomous LLM-guided tree search

Martinson et al. · [abs] [pdf]

This work replaces manual epidemiological modeling with an LLM-driven tree search that autonomously generates and refines executable forecasting code. Validated in real-time during the 2025-2026 respiratory season, it demonstrates that LLMs can effectively navigate complex hypothesis spaces to generate models competitive with human-curated stacks.

↳ This is a blueprint for scaling scientific modeling workflows without constant human intervention.

Autonomous Science Forecasting LLM-driven Discovery

Fully Open Meditron: An Auditable Pipeline for Clinical LLMs

Theimer-Lienhard et al. · [abs] [pdf]

The authors introduce a fully auditable clinical LLM pipeline, releasing not just weights, but the entire data provenance, curation stack, and training logic. This pushes back against the ‘black box’ status quo in medical AI, providing a standard for reproducible clinical decision support systems.

↳ Essential for any developer working in regulated industries where model explainability is a legal and ethical requirement.

Healthcare Open Source Auditable AI

Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems

Alamdari et al. · [abs] [pdf]

This paper integrates formal methods with LLMs to provide runtime monitoring and safety guarantees for AI behavior. By defining temporally extended constraints, the authors demonstrate how to maintain compliance in high-stakes environments throughout the development lifecycle.

↳ Formal methods are the only mature way to guarantee safety in complex agents; this is how you make AI production-ready for enterprise.

Formal Methods Governance AI Safety

Confirming Correct, Missing the Rest: LLM Tutoring Agents Struggle Where Feedback Matters Most

Yasir et al. · [abs] [pdf]

A large-scale analysis of seven LLM feedback agents reveals they perform well on perfect solutions but consistently fail to distinguish between valid-but-suboptimal and incorrect steps. The study highlights a critical diagnostic ‘blind spot’ in current conversational tutors.

↳ If you are building an educational agent, do not rely on zero-shot LLM evaluation for feedback; your agent is likely hallucinating pedagogical value.

Evaluation EdTech Reliability

FORGE: Self-Evolving Agent Memory With No Weight Updates via Population Broadcast

Bogdanov et al. · [abs] [pdf]

FORGE evolves agent memory by using a population-based protocol to convert failed trajectories into textual heuristics and few-shot examples without updating model weights. It effectively turns the LLM into a self-improving system by accumulating knowledge as external memory artifacts.

↳ A practical, compute-efficient way to make agents smarter over time without the overhead of SFT or RL fine-tuning.

Agents Memory In-Context Learning

Look Before You Leap: Autonomous Exploration for LLM Agents

Ye et al. · [abs] [pdf]

The authors identify ‘premature exploitation’ as a key failure mode where agents act on priors instead of exploring environments. They propose ‘Exploration Checkpoint Coverage’ as a metric to force agents to map affordances before committing to task completion.

↳ If your agent keeps failing in new environments, it is likely not a lack of reasoning, but a lack of exploration—this metric helps you measure that gap.

RL Agents Exploration

Keep your prompts sharp and your evaluations even sharper. See you tomorrow.