I work in bioinformatics and this is the kind of thing I keep trying to communicate to people in the field. Yes, these AI tools (like AlphaFold) are amazing, but if there's a significant gap in their training data, the AI is going to have that gap too (most of the structures in the protein database were solved via X ray crystallography, which isn't great for studying highly flexible or disordered proteins)
Yes. My (minimally informed from a single class) understanding is that it sort-of depends on the problem too. Like perhaps in looking at all the data on proteins, the neural network might notice a pattern in protein folding is applicable to the tweaked problem. Of course, there is no guarantee that such a generally applicable rule exists. And even if it does, it might not be discovered by the net before overtraining occurs.
It sounds like your memory from that class is pretty good, and you're right, it depends on what we're trying to solve, but the problem in this case is protein folding, so if a neural network spotted a pattern, that's what we want. Figuring out the generalisable "rules" (i.e. why proteins fold a certain way) isn't what we're trying to do with these tools (yet), we're just on the pattern finding side, which is why the developments from AlphaFold are so incredible, it's just limited.
It feels like my job for the next few years is going to be "professional killjoy", because I get people's excitement, but we can't properly use these tools if we don't acknowledge their limitations. If we did that, they'd actually become more powerful because we could develop new and different tools, or go gather experimental data to validate some of the generated structures (or to round out the training data).
I don't know if this would count as overtraining, because it has so far performed amazingly on structures that are similar to the training data but not in the training data. The problem is we don't have much training data for the tricky parts. That's fine, it just means it won't help us learn much about those areas, but headlines like "alphafold predicts the structures of all human proteins" are so misleading
I agree. If you can't test the structures against proteins that don't crystalize, then you can't really say much about the output.
best experimental protocol evar™ do not steal 👍
If it were me I would try to use the generated structures to make predictions. Perhaps get the model to tell you how many residues of a certain AA are on the protein surface, attach some sort of fluorescent marker to all the surface residues in a physical sample, remove the unattached marker from the solution, and then measure the intensity of the emitted light to estimate how many of those residues are actually on the surface of the structure.
Idk if exactly that would work, but it seems like an easier question to test if a protein is consistent with a generated structure. So perhaps you could tell the biologists that the “ball is in their court” to get them off your back 😃 (if it is biologists of course).
Also, I certainly didn't mean to imply that you could get the pattern the neural net found out of it. Only that that it has in there somewhere if the net is generalizing.