Sarah Silverman is suing OpenAI and Meta for copyright infringement
Sarah Silverman is suing OpenAI and Meta for copyright infringement

Sarah Silverman is suing OpenAI and Meta for copyright infringement

Comedian and author Sarah Silverman, as well as authors Christopher Golden and Richard Kadrey — are suing OpenAI and Meta each in a US District Court over dual claims of copyright infringement.
Interested to see how this plays out! Their argument that the only way a LLM could summarize their book is by ingesting the full copyrighted work seems a bit suspect, as it could've ingested plenty of reviews and summaries written by humans and combined that information.
I'm not confident that they'll be able to prove OpenAI or Meta infringed copyright, just as i'm not confident they'll be able to prove that they didn't violate copyright. I don't know if anyone really knows what these things are trained on.
We got to where we are now with fair use in search and online commentary because of a ton of lawsuits setting precedent, not surprising we'll have to do the same with machine learning.
I think this is where the crux of the case lies since the article mentions these are only available illegally through torrents.
This is starting to touch on the root of why they keep calling this "AI", "training", etc. They aren't doing this for strictly marketing, they are attempting to skew public opinion. These companies know intimately how to do that.
They're going to argue that if torrents are legal for educational purposes (ie the loophole that all trackers use), and they're just "training" an "AI" then they're just engaging in education. And an ignorant public might buy it.
These kinds of cases will be viewed as landmark cases in the future and honestly I don't have huge hopes. The history of these companies is engineer first, excuse the lack of ethics later. Or the philosophy of "it's easier to apologize than ask".
Even if they did train the model on the entire text of the book, that's still not necessarily copyright violation. I would think not, since the resulting model doesn't actually have a copy of the book embedded within it.
Do we know that it isn't?
But the server used to calculate the model would have a copy of it. If training an AI model is not fair use then the mere act of loading a book you don't have a license for into the server would be copyright infringement. Like text book. It's a unauthorized digital copy. It's all very untested legal grounds and seems like lots of people want to be the first to test it. Not everyone has a great case but if the courts interpret things a certain way there's gonna be lots of payouts so maybe best to get in line early?
It’s difficult to tell to what extent books are encoded into the model. The data might be there in some abstract form or another.
During training it is kind of instructed to plagiarize the text it’s given. The instruction is basically “guess the next word of this unfinished excerpt”. It probably won’t memorize all input it’s given, but there’s a nonzero chance it manages to memorize some significant excerpts.