Science Journalists Find ChatGPT Is Bad at Summarizing Scientific Papers

As someone who works in communications in a very science-heavy field: in fairness, journalists are also typically terrible at summarizing scientific papers.

Do you recall any examples of science journalists doing it well in your field? I've been increasingly interested in science journalism, and have been on the lookout for examples of it done right.
- I don't, really. But my field is also kinda niche (it's not like some popular field like genetics or infectious diseases. There aren't many journalists covering us at all, yet. I work in marketing for an industrial exosuit company (think practical, assistive, biomechanical wearables). Most of the journalists that are covering us at this point are used to covering news about forklifts or warehouse automation, so they aren't used to reading peer reviewed scientific publications at all. Their exposure to papers on biomechanics and injury risk factors is more rare, and they might as well be Latin (well, sometimes they do have a lot of Latin).
  But it's also something of a joke. When I was back taking journalism classes for my communications and marketing degree, the professors would joke about how journalists covering either legal summaries or scientific summaries would say that 1 + 2 = 5 all the time, leaving out important details that were critical to the conclusions because they weren't interesting. I think the scientists put up with it because as long as the conclusion is correct, they're just happy to have anyone paying attention.

If only scientific papers had some kind of summary. You know like something in front of the paper that abstracts the entire thing down to less than a page.

to be fair abstracts are extremely dense and not really meant to be read by anyone outside that particular field

Okay, whenever I see an article that says that "LLMs are bad at doing X task", I always comment something to the tune of "not surprised, those clankers are shit at everything", but for this one, I genuinely am surprised, because if there has been one thing that LLMs have seemed to be consistently good at, it is generating summaries for passages of text. They were never perfect at that task, but they seemed to actually be the suitable tool for that task. Apparently not.

As the post says:
LLM “tended to sacrifice accuracy for simplicity” when writing news briefs.
Which make sense as they're a statistical text prediction engine and have no notion of what is important or not in a text, so unlike humans don't treat different parts differently depending on how important they are in that domain.
In STEM fields accuracy is paramount and there are things which simply cannot be dropped from the text when summarizing it, but LLMs are totally unable to guarantee that.
It's the same reason why code generated by LLMs almost never works without being reviewed and corrected - the LLM will drop essential elements so the code doesn't work as it should or won't even compile, but at least with code the compiler validates some of the accuracy of that text at compile time against a set of fixed rules, whilst the person reviewing the code knows upfront the intention for that code - i.e. what it was supposed to do - and can use that as a guide for spotting problems in the generated code.
One thing is summarizing a management meeting where most of the things said are vague waffle, with thing often repeated and were nobody really expects precision and accuracy (sometimes, quite the opposite) and hence a loss of precision is generally fine, a whole different think is summarizing things were at least some parts must be precise and accurate.
Personally I totally expect that LLMs fail miserable in areas requiring precision and accuracy, were a text statistical engine with a pretty much uniform error probability in terms of gravity of the error (i.e. just as likely to make a critical mistake as it is to make a minor mistake) will in summarizing just as easily mangle or drop elements in critical areas requiring accuracy as it does in areas which are just padding.
- An LLM can probably be trained to distinguish what humans regard as "important" using an evolutionary training strategy.

Shocker, LLMs are only built to predict the next most likely word in a sentence and nothing else. Glad we test all the things there not good for individually though.

ChatGPT summary: ChatGPT is really good at summarizing scientific papers and these scientists are judgmental meanies.

Hydrologists find water is wet

Four words too many in that title.

Summaries and searches with imprecise terms is actually my main use case for AI at work. My main use case however is having it generate custom short stories for my 4-year old of our two cats, a third imaginary one and a princess having adventures and learning important lessons. That is kind of it.