There have been a number of papers recently ("TinyStories" and "Textbooks are all you need" are two that come to mind) showing that higher quality training data for LLMs leads to significantly better models. So yes, this is technically true.
Does anyone really need to read / publish papers to come to conclusion that better training data creates better models? Obvious no? Am I missing something here?
Yeah. Science is often about rigorously verifying the things that seem obvious, and often going into significantly more detail about it too. TinyStories for example not only confirms that better quality data gives better quality models, but also how to go about to get that data, what parameters of the models lead to what observed effects, and how all of that can be used in the future. The simple one line summary of the conclusion might be "better training data creates better models" but that does nothing to really explain what better quality actually means or how to go about getting that better quality data, or why the data used for previous models was worse quality.
This does seem to be a big revelation to ML/AI types. I was at a conference last year for NLP and one of the data scientists was reporting back on some other conference they'd been to where they had worked out that you don't get better performance by running your data through a dozen different models, but by running better data through a single model.
I guess @garyyo@lemmy.world is right that you need to confirm these things but it definitely spoke to quite a naive mindset.
I took it the other way, instead of us teaching the machines, we should have the machines teaching us, or develop the machines to teach us at least.
What's the point of having all the information in the world at our fingertips if we're just going to ignore it in preference of what we already agree with?
Oh, I guess that's a valid interpretation. Especially since we are doing that too! I haven't been reading up on the intelligent tutoring systems side of AI/pedagogy in a while but I know that we have already attempted to teach math and computer science using them. I am technically working with a research team on creating a creativity support tool/tutoring system for analyst tasks using LLMs right now. Or I would be if I got off Lemmy 😆