As it turns out, it’s impossible to remove a user’s data from a trained A.I. model. Deleting the model entirely is also difficult—and there’s little regulation to enforce either option.
I'm rather curious to see how the EU's privacy laws are going to handle this.
(Original article is from Fortune, but Yahoo Finance doesn't have a paywall)
I'm not an AI expert, and I wouldn't say it is too hard, but I believe removing a specific piece of data from a model is like trying to remove excess salt from a stew. You can add things to make the stew less salty but you can't really remove the salt.
The alternative, which is a lot of effort but boo-hoo for big tech, is to throw out the model and start over without the data in question. These companies would do well to start with models built on public or royalty free data and then add more risky data on top of that (so you only have to rebake starting from the "public" version).
If there's something illegal in your dish, you throw it out. It's not a question. I don't care that you spent a lot of time and money on it. "I spent a lot of time preparing the circumstances leading to this crime" is not an excuse, neither is "if I have to face consequences for committing this crime, I might lose money".
No, especially because it's not the same thing at all. You're talking about the output, we're talking about the input.
The training data was illegally obtained. That's all that matters here. They can train it on fart jokes or Trump propaganda, it doesn't really matter, as long as the Trump propaganda in question was legally obtained by whoever trained the model.
Whether we should then allow chatbots to generate harmful content, and how we will regulate that by limiting acceptable training data, is a much more complex issue that can be discussed separately. To address your specific example, it would make the most sense that the chatbot is guided towards a viewpoint that aligns with its intended userbase. This just means that certain chatbots might be more or less willing to discuss certain topics. In the same way that an AI for children probably shouldn't be able to discuss certain topics, a chatbot that's made for use in highly religious area, where homosexuality is very taboo, would most likely not be willing to discuss gay marriage at all, rather than being made intentionally homophobic.
You seem to think the majority of LGBT+ positive material is somehow illegal to obtain. That is not the case. You can feed it as much LGBT+ positive material as you like, as long as you have legally obtained it. What you can't do is train it on LGBT+ positive material that you've stolen from its original authors. Does that make more sense?
Yes I am aware of that. However, I'm not sure how this has anything to do with the fact that it is also illegal to steal data, then continue to use said data to make profits after having been found out. The two are not connected in any logical way, which makes it hard for me to continue to address your concerns in a way that makes sense.
The way I see it, you're either completely missing what we're talking about, or you have some misunderstanding of what the AI language models actually are, and what they can do.
For the record, I'm in no way disagreeing with your views, or your statements that legal and ethical don't always overlap. It is clear to me that you are open minded and well-intended, which I appreciate, and I hope you don't take this the wrong way.
I work in this field a good bit, and you're largely correct. That's a great analogy of trying to remove salt from a stew. The only issue with that analogy is that that's technically possible still by distilling the stew and recovering the salt. Even though it would destroy the stew.
At the point that pii data is in the model, it's fully baked. It'd be like trying to get the eggs out of a baked cake. The chemical composition has changed into something else completely.
That's how building a model works today. Like baking a cake.
I'm order to remove or even identify pii data in ML models or LLMs today, we'd need a whole new way of baking a cake that would keep the eggs separate from the cake until just before you tried to take a bite out of it. The tools today don't allow you to do anything like that. They bake you a complete cake.
Something to take in mind is that yes, they would need to retrain the models from zero, but if they did it in any kind of basic decent method they should have backups and versions of the data they used to train and they would need to retrain everything with a subset of the original data. Then, the optimizations they have already applied to the system should be able to be reapplied in the same manner and the product should be somewhat similar. Another thing would be to design a de training process, where you generate an input from the "must be deleted" input that when trained acts as some sort of "negative input" and the model ends up in the same place it would have ended up if it were not trained with the "must be deleted" data.
I bet you that if governments act harsh enough tech companies will develop some sort of "negative training".
In the end this is a solvable math optimization problem, what input do I need to feed the already trained model for it to become the equivalent model it would be if trained without the requested data.
We could even create an ML model that computes a "good enough negative input" from several examples, since testing the quality of the results is quite simple, and we can train it with several trained model examples. This model would be fed with a base model, some input data and another base model trained without that data.
All in all, AI companies will tell you that this is very hard because they would essentially be investing hours and development to create a tool that makes their model worse instead of better, so expect a lot of pushback.
It's actually a pretty normal thing in law. Laws are created with common sense in mind and compromises.
Currently EU laws do not cover generative AI. Now EU needs to decide how to deal with it. If consider it as a "lossy compressed database", trying to enforce a variation of gdpr with added fuzziness, or do something else
It's more like the law is saying you must draw seven red lines, all of them strictly perpendicular, some with green ink and some with transparent ink.
It's not "virtually" impossible, it's literally impossible. If the law requires that it be possible then it's the law that must change. Otherwise it's simply a more complicated way of banning AI entirely, which means that some other jurisdiction will become the world leader in such things.
It's more like the law is saying you must draw seven red lines, all of them strictly perpendicular, some with green ink and some with transparent ink.
No, it's more like the law is saying you have to draw seven red lines and you're saying, "well I can't do that with indigo, because indigo creates purple ink, therefore the law must change!" No, you just can't use indigo. Find a different resource.
It's not "virtually" impossible, it's literally impossible. If the law requires that it be possible then it's the law that must change.
There's nothing that says AI has to exist in a form created from harvesting massive user data in a way that can't be reversed or retracted. It's not technically impossible to do that at all, we just haven't done it because it's inconvenient and more work.
The law sometimes makes things illegal because they should be illegal. It's not like you run around saying we need to change murder laws because you can't kill your annoying neighbor without going to prison.
Otherwise it's simply a more complicated way of banning AI entirely
No it's not, AI is way broader than this. There are tons of forms of AI besides forms that consume raw existing data. And there are ways you could harvest only data you could then "untrain", it's just more work.
Some things, like user privacy, are actually worth protecting.
There’s nothing that says AI has to exist in a form created from harvesting massive user data in a way that can’t be reversed or retracted. It’s not technically impossible to do that at all, we just haven’t done it because it’s inconvenient and more work.
What if you want to create a model that predicts, say, diseases or medical conditions? You have to train that on medical data or you can't train it at all. There's simply no way that such a model could be created without using private data. Are you suggesting that we simply not build models like that? What if they can save lives and massively reduce medical costs? Should we scrap a massively expensive and successful medical AI model just because one person whose data was used in training wants their data removed?
This is an entirely different context - most of the talk here is about LLMs, health data is entirely different, health regulations and legalities are entirely different, people don't publicly post their health data to begin with, health data isn't obtained without consent and already has tons of red tape around it. It would be much easier to obtain "well sourced" medical data than thebroad swaths of stuff LLMs are sifting through.
But the point still stands - if you want to train a model on private data, there are different ways to do it.
ok i guess you don't get to use private data in your models too bad so sad
why does the capitalistic urge to become "the world leader" in whatever technology-of-the-month is popular right now supersede a basic human right to privacy?
ok i guess you don’t get to use private data in your models too bad so sad
You seem to have an assumption that all AI models are intended for the sole benefit of corporations. What about medical models that can predict disease more accurately and more quickly than human doctors? Something like that could be hugely beneficial for society as a whole. Do you think we should just not do it because someone doesn't like that their data was used to train the model?
You seem to have an assumption that all AI models are intended for the sole benefit of corporations.
You seem to have the assumption that they're not. And that "helping society" is anything more than a happy accident that results from "making big profits".
What about medical models
A pretty big "what if" when every single model that's been tried for the purpose you suggest so far has either predicted based off the age of a medical imaging scan, or off the doctor's signature in the corner of one.
Are you asking me whether it's a good idea to give up the concept of "Privacy" in return for an image classifier that detects how much film grain there is in a given image?
You seem to have the assumption that they’re not. And that “helping society” is anything more than a happy accident that results from “making big profits”.
It's not an assumption. There's academic researchers at universities working on developing these kinds of models as we speak.
Are you asking me whether it’s a good idea to give up the concept of “Privacy” in return for an image classifier that detects how much film grain there is in a given image?
There’s academic researchers at universities working on developing these kinds of models as we speak.
Where does the funding for these models come from? Why are they willing to fund those models? And in comparison, why does so little funding go towards research into how to make neural networks more privacy-compatible?
I’m not wasting time responding to straw men.
Please learn what a straw man argument is
The technology you're describing doesn't exist, and likely won't for a very long time, so all you're doing is allowing data harvesting en-masse in return for nothing. Your hypothetical would have more teeth if it was anywhere close to being anything but a hypothetical.
How is "don't rely on content you have no right to use" litteraly impossible?
We teach to children that there is a Google filter to include only the CC images (that they should use for their presentations).
Also it's not like we are talking small companies here, a new billion-making industry is being born and it could totally afford contracts with big platforms that would allow to use their content.
And we're saying that if peeling out knowledge that someone has a right to have forgotten is difficult or impossible, that knowledge should not have been used to begin with. If enforcement means big tech companies have to throw out models because they used personal information without knowledge or consent, boo fucking hoo, let me find a Lilliputian to build a violin for me to play.
Okay I get it but that's a different argument. Starting fresh only gets you so far. Once am LLM exists and is exposed to the public users can submit any data they like and the LLM has no idea the source.
You could argue then that these models shouldn't be able to use user submitted data but that would be a devastating restriction to the technology and that starts to become a question of whatever we want this tech to exist at all.
If enforcement means big tech companies have to throw out models because they used personal information without knowledge or consent, boo fucking hoo
A) this article isn't about a big tech company, it's about an academic researcher.
B) he had consent to use the data when he trained the model. The participants later revoked their consent to have their data used.
Because the question of what data one has the right to use is a very open legal question right now.
There is absolutely nothing illegal about a person examining publicly accessible artwork or text, learning from it, and attempting to reproduce a similar style. AIs are, in essence, doing basically the same thing. However, the sheer difference in time and scale may warrant a different legal treatment. That has not yet been settled, and it will probably take a fair amount of societal debate and new legislation before we have a definite answer.
And the rest of the data Google has been viewing, cataloging and selling back to everyone for years, because they’re legally allowed to do so… you don’t see the irony in that?
Are they selling back scrapped content? I thought it was only user behaviors through the ad network?
About cataloging at least it is opt-out though robot.txt 🤷
EDIT: plus, "we are already doing bad" is never a good argument to continue doing bad, if Google were to be in fault this could get the traction to slap their ass
Google crawls the internet, archives entire actual photos, large snippets (at least) from every website it sees, aggregates it into a different form and serves it back to people for profit. It’s the same business model, different results with the processing of the data.
Google doesn't sell the data they collect... They sell ads and use their data to better target people with said ads. Third parties are paying google to target their ads to the right people.
You go to google because of the data they collected from the open internet. Peoples’ photos, articles they’ve written, books, etc. They aggregate it, process it and serve it back to you alongside ads. They also collect data about you and sell that as well. But no one would go to Google if they hadn’t aggregated, processed and repackaged the internet’s data.
They also collect data about you and sell that as well.
No they don't. Why would they sell the data they use to target ads? If other corporations could just buy the data, they wouldn't need to pay google to target the ads, they'd just buy the data and do it themselves, Google isn't a data broker. They keep the data for them, it would be business suicide if they'd just sell all the data they collect.
How is “don’t rely on content you have no right to use” litteraly impossible?
At the time they used the data, they had a right to use it. The participants later revoked their consent for their data to be used, after the model was already trained at an enormous cost.
I have to admit my comment is not really relevant to the article itself (also, I read only the free part of it).
It was more a reaction to the comment above, which felt more generic. My concern about LLMs is that I could never find an auditable list of websites that were crawled, which would be reasonable to ask for, I think.
All applications of ai & assimilated aren’t nefarious… I’m shopping for a solution to help my company classify its data and do data discovery. I really hope I find a solution - which will likely be based on ai - because the alternative is either we don’t do the activity or the guys that will do it will be miserable. No one should have to spend days looking at very old data stores and wonder what’s in it - and then be accountable for the classification.
Always has been. The laws are there to incentivize good behavior, but when the cost of complying is larger than the projected cost of not complying they will ignore it and deal with the consequences. For us regular folk we generally can't afford to not comply (except for all the low stakes laws that you break on a day to day basis), but when you have money to burn and a lot is at stake, the decision becomes more complicated.
The tech part of that is that we don't really even know if removing data from these sorts of model is possible in the first place. The only way to remove it is to throw away the old one and make a new one (aka retraining the model) without the offending data. This is similar to how you can't get a person to forget something without some really drastic measures, even then how do you know they forgot it, that information may still be used to inform their decisions, they might just not be aware of it or feign ignorance. Only real way to be sure is to scrap the person. Given how insanely costly it can be to retrain a model, the laws start looking like "necessary operating costs" instead of absolute rules.
I just saw an article that said that ISPs are trying to whine their way out of listing the fees they charge because it's too hard. Which is wild because they certainly know what I owe them after I sign the contract, but somehow it's just impossible for them to determine right up until the moment that I'm obligated to pay it.