Top AI models fail spectacularly when faced with slightly altered medical questions
Top AI models fail spectacularly when faced with slightly altered medical questions
Just a moment...
cross-posted from: https://programming.dev/post/36289727
Our findings reveal a robustness gap for LLMs in medical reasoning, demonstrating that evaluating these systems requires looking beyond standard accuracy metrics to assess their true reasoning capabilities.6 When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment.
A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.
Finally someone who gets it. They're just best-next-word guessers that drank a literature smoothie and burp up something resembling a correct answer. They can only interpolate between word chunks that are pre-existing in the dataset. If the data set is grounded in things that have already been said, could it possibly interpolate an answer to a question that was never answered? No. But it could spill some kind of convincing yet nonsense answer that passes the CEO's sniff test.
Well it doesn't matter what they are designed for that is what they are being marketed for. That's what you have to test.
That's... Not actually accurate, but it's an accurate sounding confabulation that you could put out which collapses the energy you need to keep interpreting the problem.
Which IS what llms are doing. The failure comes from the incentive structure and style of intelligence. Very right we shouldn't blind trust the responses though.
The criticism of "just probability" falls flat as soon as you recognize current expert consensus is that humans minds are... predictive processors, based on scale free principles leading to layered Bayesian predictive models.
Where LLMs struggle adapting to things outside of distribution (not in the training data) they do not have a way to actively update their weights and biases as they contextualize the growing novel context.
Also novel context is basically inevitable when interacting with real life, because our environments and preferences are also growing, so, they lack something very important for correcting weak confabulations that collapsed the predictive process into action. There's also weird softmax/AI 'reasoning' fuzzyness helping to emulate some of the malleability of our more active ruminative, and very very social models.
I usually get downvoted for going against the narrative, but remember we normalize to the tribal opinions around us to socially collapse our group predictive model, because nuance takes energy. But if you can't communicate, you can't learn, and you display the same lack of intelligence as confident LLM confabulations.
I wish I heard people talking about this outside of strictly academic spaces, but communication channels need to be open.
Keep your eyes out for AI that is diverse, but good at communication/translation/meditation and actively grows.
Although you might see more like the genie3 stuff that is dealing with intermodal dissonance within a monolithic model perspective, which means always confabulating without using other models to actively balance and grow.
Well, attempts are being made to make up for that, but you can see how RLHF leads to sycophantic models that will confirm your own confabulations so that you can confirm each other into delusion without other systems for grounding
I agree, but also being able to reconstruct content like that has some intelligence.
That being said, when you have no way of telling what is in-sample vs out-of-sample, and what might be correct or convincing gibberish, you should never rely on these tools.
The only time I really find them useful is with tools and RAG, where they can filter piles of content and then route me to useful parts faster.
I think we mostly agree but saying "anything related to intelligence, skills, and knowledge" is too broad.
Its not a popular idea here on lemmy, but IMO gen AI can save some time in some circumstances.
Problems arise when one relies on them for reasoning, and it can be very difficult to know when youre doing that.
For example, I'm a consultant and work in a specialised area of law, heavily regulated. I've been doing this for 20 years.
Gen AI can convert my verbal notes into a pretty impressive written file note ready for presentation to a client or third party.
However, it's critically important to know when it omits something, which is often.
In summary, in situations like this gen AI can save me several hours of drafting a complex document.
However, a layperson could explain a problem and gen AI could produce a convincing analysis, but the lay person wouldn't know what has been omitted or overlooked.
If i dont know anything about nutrition, and ask for a meal plan tailored to my needs, I would have no way to evaluate whether something has been overlooked.