The popular open-source VLC video player was demonstrated on the floor of CES 2025 with automatic AI subtitling and translation, generated locally and offline in real time. Parent organization VideoLAN shared a video on Tuesday in which president Jean-Baptiste Kempf shows off the new feature, which uses open-source AI models to generate subtitles for videos in several languages.
This sounds like a great thing for deaf people and just in general, but I don't think AI will ever replace anime fansub makers who have no problem throwing a wall of text on screen for a split second just to explain an obscure untranslatable pun.
I have family members who can't really understand spoken English because it's a bit fast, and can't read English subtitles again, because again, too fast for them.
Sometimes you download a movie and all the Estonian subtitles are for an older release and they desynchronize. Sometimes you can barely even find synchronized English subtitles, so even that doesn't work.
This seems like a godsend, honestly.
Funnily enough, of all the streaming services, I'm again going to have to commend Apple TV+ here. Their shit has Estonian subtitles. Netflix, Prime, etc, do not. Meaning if I'm watching with a family member who doesn't understand English well, I'll watch Apple TV+ with a subscription, and everything else is going to be pirated for subtitles. So I don't bother subscribing anymore. We're a tiny country, but for some reason Apple of all companies has chosen to acknowledge us. Meanwhile, I was setting up an Xbox for someone a few years ago, and Estonia just... straight up doesn't exist. I'm not talking about language support - you literally couldn't pick it as your LOCATION.
Now I want some AR glasses that display subtitles above someone's head when they talk à la Cyberpunk that also auto-translates. Of course, it has to be done entirely locally.
I guess we have most of the ingredients to make this happen. Software-wise we're there, hardware wise I'm still waiting for AR glasses I can replace my normal glasses with (that I wear 24/7 except for sleep). I'd accept having to carry a spare in a charging case so I swap them out once a day or something but other than that I want them to be close enough in terms of weight and comfort to my regular glasses and just give me AR like overlaid GPS, notifications, etc, and indeed instant translation with subtitles would be a function that I could see having a massive impact on civilization tbh.
I think we're closer with hardware than software. the xreal/rokid category of hmds are comfortable enough to wear all day, and I don't mind a cable running from behind my ear under a clothes layer to a phone or mini PC in my pocket. Unfortunately you still need to byo cameras to get the overlays appearing in the correct points in space, but cameras are cheap, I suspect these glasses will grow some cameras in the next couple of iterations.
Breaking news: "WW3 starts over an insult due to a mistranslated phrase at the G7 summit. We will be nuked in 37 seconds. Fuck like rabbits, it's all we can do. Now over to Robert with traffic."
It'd be incredible for deaf people being able to read captions for spoken conversations and to have the other person's glasses translate from ASL to English.
Honestly I'd be a bit shocked if the AI ASL -> English doesn't exist already, there's so much training data available, the Deaf community loves video for obvious reasons.
I mean, it would. For example Jellyfin implements it, but it does so by extracting the pictures ahead of time and saving them. It takes days to do this for my library.
Yeah, I do this for plex as well, and stash. I think if the file already exists in the directory vlc should use it. It’s up to you to generate them. That is exactly how cover art for albums on songs worked in VLC for a decade before they added the feature to pull cover art on the fly.
Video decoding is resource intensive. We're used to it, we have hardware acceleration for some of it, but spewing something around 52 million pixels every second from a highly compressed data source is not cheap. I'm not sure how both compare, but small LLM models are not that costly to run if you don't factor their creation in.
All they’d need to do is generate thumbnails for every period on video load. Make that period adjustable. Might take a few extra seconds to load a video. Make it off by default if they’re worried about the performance hit.
There are other desktop video players that make this work.
I prefer watching Mexican football announcers, and it would be nice to know what they're saying. Though that might actually detract from the experience.
The technology is nowhere near being good though. On synthetic tests, on the data it was trained and tweeked on, maybe, I don't know.
I corun an event when we invite speakers from all over the world, and we tried every way to generate subtitles, all of them run on the level of YouTube autogenerated ones. It's better than nothing, but you can't rely on it really.
When you do live streaming there is no time for backup, it either works or not. Better than nothing, that's for sure, but also maybe marginally better than whatever we had 10 years ago
Relax, they didn't write a new way of doing magic, they integrated a solution from the market.
I don't know what the new BMW car they introduce this year is capable of, but I know for a fact it can't fly.
Haven't watched the video yet, but it makes a lot of sense that you could train an AI using already subtitled movies and their audio. There are times when official subtitles paraphrase the speech to make it easier to read quickly, so I wonder how that would work. There's also just a lot of voice recognition everywhere nowadays, so maybe that's all they need?