I’m linking to a new paper (The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity – Apple Machine Learning Research) out by Apple ML researchers which tests the limitations of reasoning in large language models in the context “problem complexity” and commentary (A knockout blow for LLMs? – by Gary Marcus – Marcus on AI) by Gary Marcus to put this into context. Gary Marcus has been a (correct) skeptic about what current AI will achieve and cautious about the hype for AGI.
The key points in the article are in the abstract:
Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter-intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low-complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse.
Gary Marcus puts this all into context – in this paper, LLMs could not, on their own, generate the insight of an algorithm once it advanced beyond a certain complexity. And this was not due to limits of compute or token length (ie was not a scaling problem) but seemed to be something structural to the approach (predicting the next token, and long verbose reasoning steps).
Arizona State University computer scientist Subbarao (Rao) Kambhampati has also been criticizing these types of models. Marcus mentions a bias we have introduced in how we describe what the newer LLM approaches are doing:
Rao, as everyone calls him, has been having none of it, writing a clever series of papers that show, among other things that the chains of thoughts that LLMs produce don’t always correspond to what they actually do. Recently, for example, he observed that people tend to overanthromorphize the reasoning traces of LLMs, calling it “thinking” when it perhaps doesn’t deserve that name. Another of his recent papers showed that even when reasoning traces appear to be correct, final answers sometimes aren’t. Rao was also perhaps the first to show that a “reasoning model”, namely o1, had the kind of problem that Apple documents, ultimately publishing his initial work online here, with followup work here.
Here’s where I think this connects to healthcare. For automation of well patterned tasks, like listening to a conversation and converting that to a well-formed clinical note, or summarizing clinical text and formatting that for a claim rebuttal or a prior authorization, we are likely near the beginning of a cycle of product market fit, product maturity, and transformation of the way we use natural language in healthcare. When you consider that over 80% of knowledge of the natural history of disease and patient clinical course is in free text, there is a long road to run just with chat interfaces, text summarization, and templated outputs.
But if this paper is to be believed, clinical diagnostics and diagnostic reasoning is farther off. We can get pattern matching out of the current models. But the failure modes of these models is still poorly known and they may not even be capable of diagnostic leaps of insight.