Nvidia, Apple, and others allegedly trained AI using 173,000 YouTube videos — professional creators frustrated by latest AI training scandal: Report
Nvidia, Apple, and others allegedly trained AI using 173,000 YouTube videos — professional creators frustrated by latest AI training scandal: Report

Nvidia, Apple, and others allegedly trained AI using 173,000 YouTube videos — professional creators frustrated by latest AI training scandal: Report

Some of the world's wealthiest companies, including Apple and Nvidia, are among countless parties who allegedly trained their AI using scraped YouTube videos as training data. The YouTube transcripts were reportedly accumulated through means that violate YouTube's Terms of Service and have some creators seeing red. The news was first discovered in a joint investigation by Proof News and Wired.
While major AI companies and producers often keep their AI training data secret, heavyweights like Apple, Nvidia, and Salesforce have revealed their use of "The Pile", an 800GB training dataset created by EleutherAI, and the YouTube Subtitles dataset within it. The YouTube Subtitles training data is made up of 173,536 YouTube plaintext transcripts scraped from the site, including 12,000+ videos which have been removed since the dataset's creation in 2020.
Affected parties whose work was purportedly scraped for the training data include education channels like Crash Course (1,862 videos taken for training) and Philosophy Tube (146 videos taken), YouTube megastars like MrBeast (two videos) and Pewdiepie (337 videos), and TechTubers like Marques Brownlee (seven videos) and Linus Tech Tips (90 videos). Proof News created a tool you can use to survey the entirety of the YouTube videos allegedly used without consent.
Here’s what I don’t understand: these are the wealthiest corporations in the world. They literally have trillions of dollars at their disposal. Since they clearly believed there was value in the videos they stole, why could they not just ask the creators’ permission, and if they consent, pay them a fair fee for access? If they don’t consent, why not just hire a creative to make some more content for them to use? I mean, Apple owns a massive production studio for fucks sake. Tim Cook farts money, I don’t think a thousand dollar investment in a real person is going to break the bank. They could even order up a whole new show just to train the model.
Instead, they piss off creatives by stealing their work. Just use your money for once. Invest in content. Everybody would be happier, they’d garner some trust, and nobody’s livelihood would be harmed.
But no, instead they choose the most devious, underhanded, selfishly shitty way to conduct their business. Fuck these evilcorps.
Because the techbros know that licensing is far more expensive than theft.
It'd cost so much money to license the content that the AI model they're trying to shit out needs that it'd literally never be profitable, so they're doing that thing from Fight Club where they assume the number of times they'll get sued and lose is going to cost less than paying anyone reasonable license fees.
The stupid thing is, that in the US at least, they're not wrong: in a civil suit over this you have to pay for your own lawyer fees, and since this would be a Federal case, that ends up being pretty expensive.
And, even if you win, you're just going to likely get statutory damages since proving real actual losses is probably impossible, so you'd be lucky if, after a few years in court, to end up coming out ahead - and having to pay for all the legal and other costs in the mean time - so why would you bother?
It's a pretty shitty situation that's being exploited because the remedies are out of the reach of most people who've had their shit stolen so that OpenAI can suggest you cover your pizza with glue.
Thank you for the thoughtful answer. This is so frustrating, and is very similar to other situations where megacorps decide that paying fines is cheaper than following the law.
Another terrible byproduct of all this is the false incentive structure it sets up. Rather than investing in people who are capable of producing unique and creative products, it incentivizes people to make more quantity of shitty content rather than high quality stuff, and that will ultimately make the eventual consumer product that’s based on shitty stolen work, well, shitty.
Rich people don't become rich or stay rich by spending money they perceive they don't have to.
Pay for what you can steal???
Are you new to capitalism or
It's all about money. To them, it's free content. I hope Google sues them (which will be ironic).
Anti Commercial-AI license
They used "The Pile". Everyone can download it for free: https://pile.eleuther.ai/
They pirated the videos?