We shouldn’t be affording companies the ability to profit off other people’s creations without their consent, and despite the intentions, that's basically how current copyright law works.
A long form response to the concerns and comments and general principles many people had in the post about authors suing companies creating LLMs.
If the rumor is true that OpenAI is using libgen to obtain books, then this will be a very interesting fight.
Authors profiteering from arcane copyright laws vs. a sleazy company that hypes up an LLM as if it were HAL from 2001. Who is worse? Who should lose?! I’m on the edge of my seat already!
I get this argument from the film, movie, television, videogame industry, and other more modern ones out there. But outside a handful of actual big name authors the average writer isnt exactly raking it in.
Also thanks to being a relic of the past we do still have libraries which offer books for free to read with a subscription and not only is this common, but its a celebrated thing among most authors and the reading community.
If I read her book, someone asked me to summarize it and I did - would she sue me for copyright infringement too? Do I need her permission to read her book?
It seems to me like a cheap attempt to advertise her book.
US Courts have already ruled in the past that human authorship is required for copyright. It'd be a logical conclusion as such that human authorship would also be required to justify a fair use defence. You providing a summary without any quotations would likely justify fair use - which is still copyright infringement, but a mere defence of said infringement. A machine or algorithm that cannot perform the act of creative authorship would thus not be exempted by the fair use defence.
To read it in the first place, before you summarize it, you need to obtain it legally by either buying it, or checking it out from the library (which has bought it).
Or you sit in the library and "read" it. Now how do you define where the library is? Many libraries loan out digital copies. You can sit in a book store (they exist!) and read a book without purchasing it too.
It's going to be difficult to use the "they couldn't possibly have had legit access to all these books" argument in court.
I think this has to do with intent. If I read a book to use it for the basis of a play, that would be illegal. If I read for enjoyment, that is legal. Since AI does not read for enjoyment, but only to use it for the basis of creating something else, that would be illegal.
This isn't how it works at all. I can, and should, and do, read and consume all sorts of media with the intention of stealing from it for my own works. If you ask for writing advice, this is actually probably one of the first things you'll hear: read how other people do it.
So this does not work as an argument, "the intent of the reading" because if so humans could never generate any new media either.
This is the thing I kept shouting when diffusion models took off. People are effectively saying "make it illegal for neural nets to learn from anything creative or productive anywhere in any way"
Because despite the differences in architecture, I think it is parallel.
If the intent and purpose of the tool was to make copies of the work in a way we would consider theft of done by a human, I would understand.
The same way there isn't any legal protection on neural nets learning from personal and abstract information to manipulate and predict or control the public, the intended function of the tool should make it illegal.
But people are too self focused and ignorant to riot enmass about that one.
The dialogue should also be in creating a safety net as more and more people lose value in the face of new technology.
But fuck any of that, what if an a.i. learned from a painting I made ten year ago, like every other artists who may have learned from it? Unforgivable.
I don't believe it's reproducing my art, even if asked to do so, and I don't think I'm entitled to anything.
Also copyright has been fucked for decades. It hasn't served the people since long before the Mickey mouse protection act.
Your logic is flawed in that derivative works are not a violation of copyright. Generally, copyright protects a text or piece of art from being reproduced. Specific characters and settings can be protected by copyright, concepts and themes cannot. People take inspiration from the work of others all the time. Lots of TV shows or whatever are heavily informed by previous works, and that's totally fine.
Copyright protects the reproduction of other peoples work, and the reuse of their specific characters. It doesn't protect style, themes, concepts, etc. IE. the things that an AI is trying to derive. So like if you trained your LLM only on Tolkien such that it always told stories about Gandalf and the hobbits, then that would be a problem.
I don't know what the authors are complaining about. All the AI is doing is trawling through a lexicon of words and rearranging them into an order that will sell books. It's exactly what authors do. This is about money.
In the article I explain that it is not exactly what authors do, we reading and writing are an inherently human activity and the consumption and processing of massive amounts of data (far more than a human with a photographic memory could process in a hundred million lifetimes) is a completely different process to that.
I also point out that I don't have a problem with LLMs as a concept, and I'm actually excited about what they can do, but that they are inherently different from humans and should be treated as such by the law.
My main point is that authors should have the ability to decree that they don't want their work used as training data for megacorporations to profit from without their consent.
So, yes in a way it is about money, but the money in question being the money OpenAI and Meta are making off the backs of millions of unpaid and often unsuspecting people.
I think it's an interesting topic, thanks for the article.
It does start to raise some interesting questions, if an author doesn't want they book to be ingested by a LLM, then what is acceptable? Should all LLMs now be ignorant of that work? What about summaries or reviews of that work?
What if from a summary of a book an LLM could extrapolate what's in the book? Or write a similar book to the original, does that become a new work or is it still fall into the issue of copyright?
I do fear that copyright laws will muddy the waters and slow down the development of LLMs and have a greater impact more than any government standards ever will!
I wish people would stop comparing AI to human beings, an AI using the product of your labor without your consent to emulate the characteristics of your work is not the same as an actual human being studying the works of someone for inspiration or to learn.
If you trained an AI to write a Sarah Silverman book, that is unethical, so I don't understand why it's ok to do that in just a more dispersed way. You're still profiting off someone else's work (sometimes life's work) without any kind of compenastion.
This fervor to shove the human being to the side in favor of AI is really dehumanizing and doesn't serve to foster creativity but to stifle and clip it for the profit of some company's bottom line. It's the worst aspect of capitalism made even more efficient, now not just stealing people's physical labor but scraping off the intangible qualities that define an individual to dump them into a machine. It's horrific.
They’re “complaining” about unique qualities of their art being used, without consent, to create new things which ultimately de-value their original art.
It’s a debate to be had, I’m not clearly in favour of either argument here, but it’s quite obvious what they’re upset with.
If it's a debate to be had then it's something that should have been debated hundreds of years ago when copyright was first invented, because every author or artist re-uses the "unique qualities" of other peoples' art when making their own new stuff.
There's the famous "good authors copy, great authors steal" quote, but I rather like the related one by C. E. M. Joad mentioned in that article: "the height of originality is skill in concealing origins."
I think the whole thing about megacorps being the problem here is a bit short sighted, I don't think it will be too much longer before anyone can spin up their own LLM. It doesn't exactly take Google levels of resources. I'm as happy to shit on megacorps as the next person here but IP law as it is is BS.
More likely than not any changes made will be to benefit large corporations at the expense of individuals and competition. I'm imagining a world where copyright law has made it so that only big corporations can afford to pay for LLM training data. As if individuals had to pay library book prices for a personal book to train their personal LLM. This desire to "cash in" may just play right into the megacorporation's hand.
I agree that cashing in is at least important part of this. As I understand it, however, past a certain point creating and using LLMs is in fact extremely expensive. That's why GPT4 limits user interactions, for example. I also think that the more restricted these tools are in general, the better for everyone. It's absolutely possible to use them in positive ways, but as it stamps they are mostly just flooding the internet with garbage at killing low level content jobs.