I think the scale of image believability is logarithmic. Going from "believable at a glance" to "believable under scrutiny" requires an exponential increase in performance compared with going from "not believable at a glance" to "believable at a glance". The same principle applies to text generation, facial recognition, sound generation, image enhancement, etc. One of the many reasons AI should not be being integrated in many of the ways world governments and corporations are trying to integrate it.
It's unclear if the current models can reach that level though. They seem more like they're asymtotically approaching their limit.
Maybe I'm wrong and GPT 5 will be the end all and be all - but I don't think generalist models will ever be consistent enough. Specialist generators focused on specifics, trained to output very defined data seem more likely to be useful than this attempt to make a single catch all LLM.
"AI" image generation actually takes a lot of human knowledge and understanding of the models to manipulate outcomes. It is a different kind of effort, but the issues with it are based in human use of the tool.