Put Large Reasoning Models under pressure and they stop making sense, say boffins

Opinion Among the forever wars in geekdom, defining the difference between science fiction and fantasy is a hot potato destined to outlive the heat death of the universe.

There is no right answer and it doesn't matter, hence the abiding popularity of the question, but attempting to make that delineation can still be useful when analyzing IT industry hype. Is a promise technically feasible, or are dragon-riding pixies happening first? Yes, AI, we're talking about you again.

Look at the suggestion that IT staff should make agentic digital twins of themselves to, ahem, reduce the amount of burdensome work they have to personally do. That's a room with enough elephants to restock Africa, if it worked. If your twin mucks up, who carries the can? What's the difference between "burdensome work" and "job?" Who owns the twin when you leave? Have none of these people seen the Sorcerer's Apprentice segment of Fantasia? Fortunately, a better question leading on from that: whether the idea is science fiction or fantasy, and like all good speculative fiction there's both history and logic to help us decide.

History first. The proposal isn't new, it's a reprise of a spectacular AI failure from the mid-'80s: expert systems. The idea was to combine the then-hotness of Lisp, a language designed to work with huge lists of conceptual data to reach correct conclusions, with training acquired by analyzing how domain experts did their work. Exciting stuff, and the dollars flowed in. At last, real AI was here! Real AI was not here, sadly, and the whole field quietly died for the highly technical reason that it just didn't work.

It wasn't so much that '80s technology wasn't up to the job - there were promising early results; Moore's Law was in its exponential pomp; and there was an avalanche of money. Besides, we're now in the impossibly puissant digital world of 2025 and could run Lisp at superluminal speed if we wanted to. Nobody wants to.

The problem was that it isn't clear how humans make expert decisions. We aren't built from arrays and flow charts, and decades of experience cannot be siphoned out of the brains which own and use it. That's why new graduates come out of 15-years plus of full-time education by expert humans and aren't very good at their first job. AI can't fix that.

Even if it could break the brain bottleneck, AI is a long way from being good enough to become a digital twin of anyone, no matter how inexpert. In a science fiction scenario, it could plausibly become so over time as machines and techniques improve; in fantasy, you can't get there from here without Gandalf as team lead. There are many signs that we'll need to shop for pointy hats soon. AI isn't living up to its hype even now, and attempts to push it further aren't going well.

We know this, because the actual results from AI in our daily lives, such as search, have things it can't do that aren't getting better, perhaps the opposite. AI model collapse from bad training isn't cured by bigger models. You in particular know this, because professional IT humans are right at the heart of the AI experiment and you know just how well and how badly AI coding goes. Find and stitch together constructs and components, useful when not tripping its bits off. Functional analysis and creating novel solutions to novel problems? Not so much.

This experiential, anecdotal suspicion that not all is roses in the AI garden is backed up by actual analysis. Apple researchers have published a paper [PDF] that looks at how well frontier large language models (LMMs) with enhanced reasoning - large reasoning models (LRMs) such as OpenAI's o1/o3, DeepSeek-R1 etc - stack up in problem solving, by feeding them tasks differentiated by complexity. Some are reasoning tests, like the classic Tower of Hanoi disc stacking conundrum or ferrying foxes and chickens across a river without getting a fat fox and no chicken.

The least complex problems saw LLMs often outperform the LRMs, while LRMs did better on queries of medium complexity. The most complex problems could defeat everything, with even the LRMs hitting barriers and producing basically useless results, and sometimes even giving up altogether. This persisted even when the researchers gave LRMs the exact algorithms they needed to solve the puzzles.

Put simply, past a certain complexity the models collapsed. As the researchers conclude, "Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent compute scaling limit in LRMs." Add to that the wildly different performance with different problems, the researchers say, and the assumption that LRMs can become generalized reasoning machines does not currently look justified.

Of course, this reflects the state of the art now and the approach chosen by the researchers. Chase the many citations in the paper, though, and these concerns aren't unique, rather they're part of a consistent and wide-ranging set of findings with frontier AI. In particular, it looks as if the self-reflection that underpins LRMs has limits that are not understood, and that task-based testing is much better than benchmarking for characterizing how well AI works. Neither of these things are reflected in AI marketing, naturally enough. Both are true, as is model collapse through data poisoning, as is persistent hallucination.

These are open questions which directly question the projected trajectory of AI as a trustworthy tool that can only get better. This is an illusion, as much as AI itself gives the illusion of thinking, and both have great dangers. Anthropomorphization sells. It also kills.

The upside for the IT industry is that in the coalmine of AI, devs are the anthropomorphized and strangely dressed canaries. Not all industries have the tightly integrated function and quality testing regimes of production code generation.

It's a moral duty to report how well things are working, to show how the caveats uncovered by researchers are panning out in the real world. The global geek army knows better than most when real life turns into cosplay and science fiction becomes fantasy. As both genres demand: use these powers for good. There's a world to save. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Jul 8
Georgia court throws out earlier ruling that relied on fake cases made up by AI

'We are troubled by the citation of bogus cases in the trial court's order'

Jul 8
SUSE launching region-locked support for the sovereignty-conscious

Move targets European orgs wary of cross-border data exposure

Jul 8
Feds brag about hefty Oracle discount - licensing experts smell a lock-in

If a deal looks too good to be true, it probably is

Jul 8
Firefox is fine. The people running it are not

Opinion Mozilla's management is a bug, not a feature

Jul 8
Microsoft developer ported vector database coded in SAP's ABAP to the ZX Spectrum

The mighty Z80 processor ran the code at astounding speed, proving retro-tech got a lot of things right

Jul 8
Samsung predicts profit slump as its HBM3e apparently continues to underwhelm Nvidia

Analysis Markets advised to brace for 45 percent fall from Q1 to Q2

Jul 8
Scholars sneaking phrases into papers to fool AI reviewers

Using prompt injections to play a Jedi mind trick on LLMs