Notes from this morning's panel on agentic AI with Stanford
Three insights on agentic deployments at Stanford, and two more from across the field.
One more thanks to Kameron Black and Timothy Keyes for a panel that named limits as well as the wins. Key highlights:
1. Eligibility screening is no longer a research problem, it’s a buildable one. Stanford’s surgical co-management agent has been in production for six months. SCM Navigator triages incoming surgical patients for hospitalist consult. Across 6,193 patients, it runs at 83.6% accuracy in deployment, drives ~90% of the SCM team’s caseload, and has been associated with zero patient safety events. For HaH programs filtering hundreds of patients for candidacy each day against well-established criteria, this is the same problem, and the solution isn’t theoretical anymore.
2. Conversely, agents work — just not for every job. On long-horizon tasks in PhysicianBench, GPT-5.5 performed best, but got it right on the first try only 46% of the time. The bulk of failures came from clinical reasoning itself, not from data lookup or EHR actions. But on bounded tasks, they ship. Timothy summed up Stanford's deployment record: "We've had a lot of success with deployments, even those where the level of autonomy is actually relatively low." The opening for HaH leaders building today: scope deployments to what agents are actually good at — screening, drafting, retrieval, surfacing decompensation signals — and keep the clinical judgment with your team. That's where your program's edge is.
3. The model is the easy part. Your local context is the moat. Stanford’s MedAgentBench V2 showed that better tool design, hospital-specific knowledge fed in as worked examples, and access to a memory of prior cases raised performance from 70% to 91% on the same benchmark. The win there isn't model selection; it's tool design, hospital-specific worked examples, and accumulated case memory. Kameron framed where this leads: "Memory will enable specialty preferences and institutional protocols to be baked in." If you're building, this is the part your team actually owns: how your program runs, the cases you've seen, the workflow you've refined. The frontier model is a commodity, but your institutional know-how is not.
Two more from across the field
Anthropic: assemble agent infrastructure like an operating system, not one all-in-one app. Their April engineering essay walks through what they got from splitting the parts that *think*, *act*, and *remember* into independent, replaceable pieces — each free to evolve as the underlying models improve. They saw a 90% drop in their slowest response times after the restructure.
For HaH leaders thinking about how to build: modular wins. The constant pace of model improvement from frontier labs is the largest free source of gains, so design your agent system to capitalize on it, not compete.
MGB is betting expansion on eligibility automation. Mass General Brigham’s HaH program announced entering “growth mode” — now five hospitals, 70-bed capacity, expanding into oncology, postoperative care, behavioral health, and dementia. Capacity growth was explicitly tied to ML-driven eligibility screening and a Philips partnership on an “autonomous digital workforce”.
The largest HaH program in the country is treating eligibility automation as a build, not a side experiment, and tying it directly to how they grow.

Real interesting breakdown! Sounds like what you're saying is the real moat is no longer the model, it’s the operational context around it. Institutional memory, workflow design, and tightly scoped autonomy seem to matter more than raw benchmark scores.