2w ago

Top AI models fail spectacularly when faced with slightly altered medical questions

Just a moment...

Our findings reveal a robustness gap for LLMs in medical reasoning, demonstrating that evaluating these systems requires looking beyond standard accuracy metrics to assess their true reasoning capabilities.^{[6](https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2837372#zld250161r6)} When forced to reason beyond familiar answer patterns, all models demonstrate declines in accuracy, challenging claims of artificial intelligence’s readiness for autonomous clinical deployment.
A system dropping from 80% to 42% accuracy when confronted with a pattern disruption would be unreliable in clinical settings, where novel presentations are common. The results suggest that these systems are more brittle than their benchmark scores suggest.

Fuck AI @lemmy.world

Pro @programming.dev

2w ago

Top AI models fail spectacularly when faced with slightly altered medical questions

jamanetwork.com /journals/jamanetworkopen/fullarticle/2837372

4 comments

That's a clever way to setup the test.
The LLM got 9%-38% worse at the task when the correct answer was changed to be "none of the others" (ie: all of the answers were wrong).
I'm curious how humans perform against the original questions and the modified questions, because humans are the benchmark.
“All models are wrong, but some are useful” is a quote from before LLMs, but it still applies.
- Yeah I think that would definitely make humans perform worse too. Hard to say by how much though.
Makes sense. If the benchmark is what everyone looks at then it will be most optimised for that.
- LLM benchmarks have been effectively meaningless for a long time already