Orchestrating the H&P

Nick Mokey at Venture Beat has a good discussion of an Oxford authored paper that further shows the gaps in current LLM systems for healthcare. The authors of the paper (Bean, et al) found


that while LLMs could correctly identify relevant conditions 94.9% of the time when directly presented with test scenarios, human participants using LLMs to diagnose the same scenarios identified the correct conditions less than 34.5% of the time.

Perhaps even more notably, patients using LLMs performed even worse than a control group that was merely instructed to diagnose themselves using “any methods they would typically employ at home.” The group left to their own devices was 76% more likely to identify the correct conditions than the group assisted by LLMs.

The Oxford study raises questions about the suitability of LLMs for medical advice and the benchmarks we use to evaluate chatbot deployments for various applications.

Mokey notes,

When we say an LLM can pass a medical licensing test, real estate licensing exam, or a state bar exam, we’re probing the depths of its knowledge base using tools designed to evaluate humans. However, these measures tell us very little about how successfully these chatbots will interact with humans.

“The prompts were textbook (as validated by the source and medical community), but life and people are not textbook,” explains Dr. Volkheimer….

Real customers use vague terms, express frustration, or describe problems in unexpected ways. The LLM, benchmarked only on clear-cut questions, gets confused and provides incorrect or unhelpful answers…

This study serves as a critical reminder for AI engineers and orchestration specialists: if an LLM is designed to interact with humans, relying solely on non-interactive benchmarks can create a dangerous false sense of security about its real-world capabilities. If you’re designing an LLM to interact with humans, you need to test it with humans – not tests for humans.

Medical students are taught the art of performing a history and physical. It is an integration of eliciting information, pattern matching on the fly, listening and developing an inner empathy balanced with learned skepticism, all while building trust through bedside manner. Their final two years are spent building this critical skill. Some third and fourth year medical students struggle, because while they had always been the best in multiple choice question (mcq) assessments in every class they had ever taken, now, they were being evaluated by peers on how they interact and treat other people.

In residency, this is taken further – in the fast pace of internship and beyond, one must learn how to optimize time management. What are the key facts this patient in front of me is conveying? How can I quickly assess their needs to make sure I can treat them effectively?

We do not have the benchmarks needed to assess real world AI performance in a patient setting. Healthcare benchmarks to date are like the tests a second year med student has aced: mcq based knowledge assessments that indicate little about the most important elements of patient care.

Additionally, we need to think about how we orchestrate effective experiences. How do we break down the components of a patient interaction and connect those elements? “AI” needs to be viewed as a system that orchestrates multiple tasks that the caregiver is managing in the encounter: eliciting a history as a trusted partner, pattern matching on the fly with guided follow up questions, a targeted physical exam, and ordering the necessary and sufficient diagnostics for a workup.

This is why the paper from Apple last week matters for healthcare. The complexity of the healthcare challenge is like the Tower of Hanoi, seemingly just an extension of better pattern matching but actually a problem of insight. It is not a scaling problem, but a systems problem. It won’t need bigger LLMs or more compute, but better architectures that perhaps agents will address.

Comments

Leave a Reply

Discover more from BalaHota.com

Subscribe now to keep reading and get access to the full archive.

Continue reading