Make illegally trained LLMs public domain as punishment

Wouldnt that give people who is it for bad things easier access? It should be made illegal to create if they dont legally have access to that data

They don't mean your data, silly. They don't give a fuck about that.

They mean other huge corporations data.

intellectual property doesn't really exist in most of the world. they don't give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore...

it's arbitrary law that is designed to protect corporations and it's generally unenforceable.

I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un'ipnosi non verbale

Clearly stolen from this Dr paret YouTube channels where he's selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

Although I'm a firm believer that most AI models should be public domain or open source by default, the premise of "illegally trained LLMs" is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of... well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn't tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

"Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data"

I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

I liked the author's earlier very-unlikely-to-be-met-demand activism last year better:

I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

...which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it's technically true - a court didn't order it, but a guy who goes by the name "That One Privacy Guy" while blogging on linkedin did).

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I'm not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

So banks will be public domain when they're bailed out with taxpayer funds, too, right?

Correct

To speak of AI models being "made public domain" is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by "public domain" the author means that they should be required to publish the weights and also that they shouldn't get any trade secret protections related to those weights?

It won't really do anything though. The model itself is whatever. The training tools, data and resulting generations of weights are where the meat is. Unless you can prove they are using unlicensed data from those three pieces, open sourcing it is kind of moot.

What we need is legislation to stop it from happening in perpetuity. Maybe just ONE civil case win to make them think twice about training on unlicensed data, but they'll drag that out for years until people go broke fighting, or stop giving a shit.

They pulled a very public and out in the open data heist and got away with it. Stopping it from continuously happening is the only way to win here.

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

It could also contain non-public domain data, and you can't declare someone else's intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.

Are you threatening me with a good time?

First of all, whether these LLMs are "illegally trained" is still a matter before the courts. When an LLM is trained it doesn't literally copy the training data, so it's unclear whether copyright is even relevant.

Secondly, I don't think that making these models "public domain" would have the negative effects that people angry about AI think it would. When a company is running a closed model internally, like ChatGPT for example, the model is never available for download in the first place. It doesn't matter if it's public domain or not because you can't get a copy of it. When a company releases an open-weight model for public use, on the other hand, they usually encumber them with some sort of license that makes them harder for competitors to monetize or build on. Making those public-domain would greatly increase their utility. It might make future releases less likely, but in the meantime it'll greatly enhance AI development.

The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs... probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don't have much pressure to optimize GPU usage.

Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

I mean, if we really are following the spirit of copyright, since no-one at open AI or other companies developed matrix and vector multiplication (operations existing in the public domain because Platonism is a thing).

Edit: oh my, I guess the consensus is that stealing the work of mathematicians is ok (or more, classifying our constructions as discoveries).

Yes!

Doesn't seem like this helps out all the writers / artists that the LLM stole from.