Technology @beehaw.org shanghaibebop @beehaw.org 2y ago

AI model output quality decreases when trained with AI models

futurism.com AI Loses Its Mind After Being Trained on AI-Generated Data

Training AI models with AI-generated synthetic content causes the quality of the models' outputs to disintegrate, a new paper shows.

"Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease," they added. "We term this condition Model Autophagy Disorder (MAD)."

Interestingly, this might be a more challenging problem as we increase the use of generative AI models online.

59 comments

Note that humans do not exhibit this property when trained on other humans, so this would seem to prove that “AI” isn't actually intelligent.
- Almost as if current models are fancy token predictors with no reasoning about the input
- do we even need to prove this? Like anyone study a bit how generative AI works know it's not intelligent.
  
  There’s enough arguments about that even among highly intelligent people.
- Wasn't the echo chambers during the covid pandemic kind of proof that humans DO exhibit the same property? A good amount will start repeating stuff about nanoparticles and some black lint in a mask are worms that will control your brain?
  
  That only happened to some humans. Something must be seriously wrong with them.
- Current AI is not actually "intelligent" and, as far as I know, not even their creators directly describe them as that. The programs and models existing at the moment aren't capable of abstract thinking oder reasoning and other processes that make an intelligent being or thing intelligent.
  
  The companies involved are certainly eager to create something like a general intelligence. But even when they reach that goal, we don't know yet if such an AGI would even be truly intelligent.
- I don't think LLMs are intelligent, but "does it work the same as humans" is a really bad way to judge something's intelligence
  
  Even if we look at other animals, when they learn by observing other members of their own species, they get more competent rather than less. So AIs are literally the only thing that get worse when trained on their own kind, rather than better. It's hard to argue they're intelligent if the answer to "does it work the same as any other lifeform that we know of?" is "no".
- Humans are not entirely trained on other humans, though. We learn plenty of stuff from our environment and experiences. Note this very important part of the primary conclusion:
  
  without enough fresh real data in each generation
  
  Math for example is something one could argue is purely taught by humans.
If you let the AI feed on its own bullshit long enough it will eventually vote for Donald Trump
AI incest at work. Just look at that Hapsburg jaw.
- Love it. AI incest is the perfect term for it haha.
Good!

Was that petty?

But, you know, good luck completely replacing human artists, musicians, writers, programmers, and everyone else who actually creates new content, if all generative AI models essentially give themselves prion diseases when they feed on each other.
- To my mind this kind of confirms that generative AI is not the same as a human who learns by imitating other artists, and instead is simply a machine for stealing labor if it uses artwork created by humans who are not appropriately compensated.
  
  A human can still create without input from other humans, it might take them longer, but a sufficiently observant person could eventually figure out the fundementals of art without ever looking at another artist's work. An AI can only create if it is fed a diet of human-created data.
  
  People argue it's like the cotton gin or something and workers/artists just need to adapt, but a cotton gin doesn't require the labor of all those same field workers in order to continue functioning.
  
  That's the major sticking point for me on generative AI -- by all means, create an AI that's sophisticated enough to learn how to create art, but if that can only be accomplished in an illusory way by feeding it other people's hard work I don't think it's good for society.
  
  I absolutely agree! I've seen so many proponents of AI argue that AI learning from artworks scraped from the internet is no different to a human learning by looking at other artists, and while anyone who is actually an artist (or involved in any creative industry at all, including things like coding that require a creative mind) can see the difference, I've always struggled to coherently express why. And I think this it. Human artists benefit from other human art to look at, as it helps them improve faster, but they don't need it in the same way, and they're more than capable of coming up with new ideas without it. Even a brief look at art history shows plenty of examples of human artists coming up with completely new ideas, artworks that had absolutely no precedent. I really can't imagine AI ever being able to invent, say, Cubism without having seen a human do it first.
  
  I feel like the only people that are in favour of AI artworks are those who don't see the value of art outside of its commercial use. They're the same people who are, presumably, quite happy playing the same same-y games and watching same-y TV and films over and over again. AI just can't replicate the human spark of creativity, and I really can't see it being good for society either economically or culturally to replace artists with algorithms that can only produce derivations of what they've already seen.
I only have a small amount of experience with generating images using AI models, but I have found this to be true. It's like making a photocopy of a photocopy. The results can be unintentionally hilarious though.
You know how when you're on a voice/video call and the audio keeps bouncing between two people and gets all feedback-y and screechy?

That, but with LLMs.
But...isn't unsupervised backfeeding the same as simply overtraining the same dataset? We already know overtraining causes broken models.

Besides, the next AI models will be fed with the interactions from humans with AI, not just it's own content. ChatGPT already works like this, it learns with every interaction, every chat.

And the generative image models will be fed with AI-assisted images where humans will have fixed flaws like anatomy (the famous hands) or other glitches.

So as interesting as this is, as long as humans interact with AI the hybrid output used for training will contain enough new "input" to keep the models on track. There are already refined image generators trained with their own but human-assisted output that are better than their predecessor.
- People in this thread seem really eager to jump to any "aha, AIs aren't intelligent after all" conclusions they can grab hold of. This experiment isn't analogous to anything that we put real people or animals through and seems like a relatively straightforward thing to correct for in future AI training.
That paper makes a bunch of(implicit) assumptions that make it pretty unrealistic: basically they assume that once we have decently working models already, we would still continue to do normal "brain-off" web scraping.
In practice you can use even relatively simple models to start filtering and creating more training data:
Think about it like the original LLM being a huge trashcan in which you try to compress Terrabytes of mostly garbage web data.
Then, you use fine-tuning (like the instruction tuning used the assistant models) to increases the likelihood of deriving non-trash from the model (or to accurately classify trash vs non-trash).
In general this will produce a datasets that is of significantly higher quality simply because you got rid of all the low-quality stuff.

This is not even a theoretical construction: Phi-1 (https://arxiv.org/abs/2306.11644) does exactly that to train a state-of-the-art language model on a tiny amount of high quality data (the model is also tiny: only half a percent the size of gpt-3).
Previously tiny stories https://arxiv.org/abs/2305.07759 showed something similar: you can build high quality models with very little data, if you have good data (in the case of tiny stories they generate simply stories to train small language models).

In general LLM people seem to re-discover that good data is actually good and you don't really need these "shotgun approach" web scrape datasets.
- Given the prevalence of bots and attempts to pass off fake data as real though, is there still any way to reliably differentiate good data from bad?
  
  Yes: keep in mind that with "good" nobody is talking about the content of the data, but rather how statistically interesting it is for the model.
  
  Really what machine learning is doing is trying to deduce a probability distribution q from a sampled distribution x ~ p(x).
  The problem with statistical learning is that we only ever see an infinitesimally small amount of the true distribution (we only have finite samples from an infinite sample space of images/language/etc....).
  
  So now what we really need to do is pick samples that adequately cover the entire distribution, without being redundant, since redundancy produces both more work (you simply have more things to fit against), and can obscure the true distribution:
  Let's say that we have a uniform probability distribution over [1,2,3] (uniform means everything has the same probability of 1/3).
  
  If we faithfully sample from this we can learn a distribution that will also return [1,2,3] with equal probability.
  But let's say we have some redundancy in there (either direct duplicates, or, in the case of language, close-to duplicates):
  The empirical distribution may look like {1,1,1,2,2,3} which seems to make ones a lot more likely than they are.
  One way to deal with this is to just sample a lot more points: if we sample 6000 points, we are naturally going to get closer to the true distribution (similar how flipping a coin twice can give you 100% tails probability, even if the coin is actually fair. Once you flip it more often, it will return to the true probability).
  
  Another way is to correct our observations towards what we already know to be true in our distribution (e.g. a direct 1:1 duplicate in language is presumably a copy-paste rather than a true increase in probability for a subsequence).
  
  <continued in next comment>
Muahahahahahahaha.

Looks like we found a relatively easy way to "poison" an AI dataset silently. Just feed it AI output.

I could see this mechanic being exploited by websites to provide a bottomless amount of junk text that only a bot doing content scraping would see.
Its like making a photocopy of a photocopy.
- needs more .jpg
So we have generation loss instead of AI making better AI. At least for now. That's strangely comforting.
- The summary said:
  
  without enough fresh real data in each generation
  
  So as long as you're mixing enough fresh data in you should be fine.
MadAI’s disease.

I guess we didn’t learn when we did it with cows.
For the love of God please stop posting the same story about AI model collapse. This paper has been out since May, been discussed multiple times, and the scenario it presents is highly unrealistic.

Training on the whole internet is known to produce shit model output, requiring humans to produce their own high quality datasets to feed to these models to yield high quality results. That is why we have techniques like fine-tuning, LoRAs and RLHF as well as countless datasets to feed to models.

Yes, if a model for some reason was trained on the internet for several iterations, it would collapse and produce garbage. But the current frontier approach for datasets is for LLMs (e.g. GPT4) to produce high quality datasets and for new LLMs to train on that. This has been shown to work with Phi-1 (really good at writing Python code, trained on high quality textbook level content and GPT3.5) and Orca/OpenOrca (GPT-3.5 level model trained on millions of examples from GPT4 and GPT-3.5). Additionally, GPT4 has itself likely been trained on synthetic data and future iterations will train on more and more.

Notably, by selecting a narrow range of outputs, instead of the whole range, we are able to avoid model collapse and in fact produce even better outputs.
- We're all just learning here, but yeah, that's pretty interesting to learn about effective synthetic data used for training.
I forget where I heard this quote:

Pre-LLM web scrapes are like low-background steel
- gonna file this away for five or ten years from now..
- So, the most valuable Reddit archives are the ones that were already made before they set ruinous prices on their APIs. Nice.
It's like when the cows are fed chicken shit, and the chickens are fed cow bones.
I, too, saw Multiplicity.
- I wouldn't base any expectations about real-world artificial intelligence off of a 27-year-old sci-fi comedy romance. With a 6/10 IMDB rating at that, if you really want to use pop culture as a basis for scientific thought.
- hey! i'm so old i saw that in the movie theater!
Wow. How is this going to affect all the projects that fine-tune Meta's Llama model with synthetic training data?

59 comments