BalaHota.com

Author: balahota

Maybe AI can’t reason yet actually

I’m linking to a new paper (The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity – Apple Machine Learning Research) out by Apple ML researchers which tests the limitations of reasoning in large language models in the context “problem complexity” and commentary (A knockout blow for LLMs? – by Gary Marcus – Marcus on AI) by Gary Marcus to put this into context. Gary Marcus has been a (correct) skeptic about what current AI will achieve and cautious about the hype for AGI.

The key points in the article are in the abstract:

Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.

Gary Marcus puts this all into context – in this paper, LLMs could not, on their own, generate the insight of an algorithm once it advanced beyond a certain complexity. And this was not due to limits of compute or token length (ie was not a scaling problem) but seemed to be something structural to the approach (predicting the next token, and long verbose reasoning steps).

Arizona State University computer scientist Subbarao (Rao) Kambhampati has also been criticizing these types of models. Marcus mentions a bias we have introduced in how we describe what the newer LLM approaches are doing:

Rao, as everyone calls him, has been having none of it, writing a clever series of papers that show, among other things that the chains of thoughts that LLMs produce don’t always correspond to what they actually do. Recently, for example, he observed that people tend to overanthromorphize the reasoning traces of LLMs, calling it “thinking” when it perhaps doesn’t deserve that name. Another of his recent papers showed that even when reasoning traces appear to be correct, final answers sometimes aren’t. Rao was also perhaps the first to show that a “reasoning model”, namely o1, had the kind of problem that Apple documents, ultimately publishing his initial work online here, with followup work here.

Here’s where I think this connects to healthcare. For automation of well patterned tasks, like listening to a conversation and converting that to a well-formed clinical note, or summarizing clinical text and formatting that for a claim rebuttal or a prior authorization, we are likely near the beginning of a cycle of product market fit, product maturity, and transformation of the way we use natural language in healthcare. When you consider that over 80% of knowledge of the natural history of disease and patient clinical course is in free text, there is a long road to run just with chat interfaces, text summarization, and templated outputs.

But if this paper is to be believed, clinical diagnostics and diagnostic reasoning is farther off. We can get pattern matching out of the current models. But the failure modes of these models is still poorly known and they may not even be capable of diagnostic leaps of insight.

June 9, 2025
Big AI is thinking about healthcare

Two notable innovations from “Big AI” for healthcare were released recently. These are both more on the R&D side of things, but demonstrate ongoing focus.

MedGemma was released by Google DeepMind team. It is an open source LLM that has text and multimodal features, with both text based medical question and answering and medical image classification and interpretation. The 4B model (smaller) is multimodal and the 27B model (larger) is suitable for text. On the model card, it states that MedGemma 4B was trained on deidentified Chest Xrays, dermatology images, opthalmology images, and histopathology slides. This strikes me as the data sets that have been used in previously published studies by the DeepMind group. MedGemma 27B is an instruction tuned model (ie optimal for chat based Q&A). Google suggests this is available as a base model that can be further trained and incorporated into workflows.

The models have a limitation described at the end, with the following caveats:”The outputs generated by MedGemma are not intended to directly inform clinical diagnosis, patient management decisions, treatment recommendations, or any other direct clinical practice applications.”

This is still a research model.

A second milestone was OpenAI’s release of evaluation criteria for healthcare, “HealthBench”. From their post:

Today, we’re introducing HealthBench: a new benchmark designed to better measure capabilities of AI systems for health. Built in partnership with 262 physicians who have practiced in 60 countries, HealthBench includes 5,000 realistic health conversations, each with a custom physician-created rubric to grade model responses.

The rigor of evaluation criteria for LLMs in the healthcare domain is low to date, and mostly focuses on responses to multiple choice questions and classifiers for data sets like Mimic (deidentified healthcare data). A lack of deep longitudinal healthcare data that is sufficiently cleaned to be useful for training, and also similar enough to real world data to be generalizable, remains a big challenge for the industry. It is amazing that a data set with a relatively small N (5000 conversations) is considered state of the art. This is probably less than 1 month of work for a busy clinician, but here we are.

I will have a larger post in the near future (after I’ve kicked the tires on these two ) that gives an opinion about their utility and for what purpose.

June 4, 2025
Quick Epic integration

Sidharth Ramesh hosts a YouTube Channel where he reviews FHIR development and deployment approaches. In this video he reviews an approach where a developer can have rapid deployment to many epic instances of an app they create, through automatic client ID distribution.

In short, applications that are created through Epic on FHIR, only use USCDI v3, only reads data, does not use refresh tokens (or uses refresh tokens and has a client secret uploaded by a community member), and is patient facing, will automatically deploy to sites on a modern Epic instance, have not disabled auto download capability, and have signed the open.epic API subscription agreement.

The devil is in the details (like the degree to which hospitals nationally are on a newer Epic instance and receive auto downloads), but this strikes me as a step change improvement for vendors to achieve Epic integration for certain kinds of apps.

June 1, 2025
An Epic Primer

Speaking of Epic, this series of articles is basically Epic 101 for anyone trying to understand the landscape of how it became the dominant player in the EHR space, and where it may be going. Brendan and this set of articles appeared to be a source for the Acquired podcast, as well. Brendan writes at the Health API Guy newsletter on substack.

An Epic Saga: The Origin Story – by Brendan Keeler

The Epic Defense: A Legacy of Protected Innovation

An Epic Tale: The Startup Odyssey – by Brendan Keeler

Epic Beyond the Provider Empire – by Brendan Keeler

May 31, 2025
Epic – Acquired Podcast
The Acquired Podcast (Acquired | About) is well known for its thorough evaluations of large businesses and complete histories of how they came to dominate their markets. Ben Gilbert and David Rosenthal bring a VC perspective to how to evaluate companies, markets, and the business decisions that have been made along the way.

Anyone with an interest in healthcare informatics, or the digital health space, should listed to the episode on Epic (Epic Systems (MyChart): The Complete History and Strategy). Clocking in at almost 4 hours, this is a comprehensive history of Epic, electronic health records, meaningful use, and the current EHR landscape. For me, a few bullets stood out among the great content.
- Epic’s primary customer (and this should be a key learning for startups planning B2B sales to health care providers) is the CFO, CEO, and CIO of the health system.
- Epic has never lost a customer.
- Meaningful use locked in incumbent EHR vendors, and as the higher cost, higher quality vendor, Epic grew into the dominant player.
- Given its market breadth and expansion, margins are lower than one would expect, and what Epic could charge.
- Ben and David compare Epic and its culture to Microsoft, and make the very strong case that Epic is the OS for healthcare in the US. I’ve thought about this a lot since listening to the podcast, and more and more this makes sense. It is the unavoidable software that makes healthcare go at the provider level for most large medical centers.
- Extending this analogy, I think Epic is a combination of Microsoft (enterprise platform + safe choice) and Apple (walled garden + want to own the customer experience in a very controlled way).
May 28, 2025
Informatics RFIs

Two RFIs have been issued by the federal government which seek input for facts on the ground for use of FHIR in apps and provider directories. Federal Register :: Request for Information; Health Technology Ecosystem and the VA SAM.gov.
These are a good opportunity to share feedback on the real world use of interoperability standards and directories, and the gaps for providers, patients, and vendors.

May 27, 2025
Price Transparency Gaps

In principle, putting the costs and prices of healthcare services should have made it easy for consumers to be informed and choose the best value care. Payers and hospitals have released their reimbursement rates and chargemasters, respectively, due to federal legislation. Two good resources that cover the gaps to date in payer transparency data are Considerations for Federal Agencies Tasked with Improving Health Plan Price Transparency Data | CHIRblog and Challenges with effective price transparency analyses – Peterson-KFF Health System Tracker.

If I make one suggestion, it would be for there to be a central repository of links to all MRF files which payers were required to submit to, or handled via an API service. It still wouldn’t address the issues of file size, lack of bundling, and other gaps identified in the articles above.

May 27, 2025
Why write?

Why write here?

I have the good fortune to choose how I want to spend my time, which I will use to spend more time thinking, writing, and reading. I have had this wordpress domain purchased and static for a number of years. It feels like the right time to focus on capturing thoughts in a more thoughtful and nuanced way. One goal for me is to spend time writing longer form pieces which can serve the dual tasks of documenting what I am learning and focusing on, and also to help, in a more active way, to organize thoughts that are more complex or strategic.

I also follow a number of sites which are essentially daily journals of discovery, sites where the reader can follow the author in their path of discovery of content, research, and miscellaneous items of interest. Specific touchstones in this area are are Simon Willamson’s blog(link), Om.co, and a number of substack sites as well.

Twitter once served this function: bringing together personalized content that was sufficiently curated that I felt in touch with my interests, but I have found it hard to find the content I like in medical informatics, the digital health industry, and my other interests lately: AI in medicine, product development, use of FHIR, quality measurement for clinical care, new biologics, the impressive trajectory of GLP-1s, and what is happening in public health, misinformation, and vaccine usage.

My secondary goal here will be to aggregate content from across areas of interest with not much comment, but a chance to bring great sources together.

May 27, 2025