Google DeepMind creates AI model that can add sound to silent videos

 Charlie Chaplin in Gold Rush (1925).
Credit: Alamy

Fresh off of animating memes over the last few days, AI has turned its attention to silent videos. Specifically, bringing audio to AI-generated clips.

Google’s DeepMind research arm has built a powerful new AI model that can add audio to videos without sound, dubbing over the top with sound effects and music.

What is most impressive about the new research is the ability to accurately follow the visuals. In one clips they show a close up of guitar playing and the music in the SFX closely matches the actual notes being played.

In some ways, it’s the other side of the coin that saw the generation of music based on a visual prompt last month via ElevenLabs and brings with it plenty of potential for restoration of old media that no longer has an audio component — and Charlie Chaplin may be about to get a new voice if this progresses further.

While the Google DeepMind model isn't available to use yet, there is a similar tool from ElevenLabs that you can try today. If you want to create a video to try it you can check out our 5 best AI video generators list.

Google's new audio generation is off to a solid start

In the thread of posts on X, Google’s DeepMind account starts things off with a character walking through an eerily lit tunnel.

Some light choir music can be heard over the top of dramatic percussion as the character’s footsteps can be heard as they move through the scene.

The second, audio generated with “Wolf howling at the moon” as the prompt, ties in nicely with the animation, and even offers a chorus of howls in the distance.

The harmonica example sounds a little too “uncanny valley” in the way its pitch shifts, but the backing underneath is solid, while the jellyfish one sounds like, well, jellyfish. Notably, that has some extra prompts, though, including “marine life” and “ocean”.

The video with the prompt “A drummer on a stage at a concert surrounded by flashing lights and a cheering crowd” is a little off, though. For one, the beats don’t quite match the rhythm in the video once it gets going, while the sticks appear to be focused on the snare and maybe a floor tom, while the audio sounds a tad more complex with some other drums involved.

Still, it’s an impressive start to a project that’s only likely to grow with time.

Limitations of the DeepMind model

Like many projects from Google this hasn't been released yet, its just a research preview. Google says there are limitations and safety issues to address first.

For example: "Since the quality of the audio output is dependent on the quality of the video input, artefacts or distortions in the video, which are outside the model’s training distribution, can lead to a noticeable drop in audio quality."

They are also working on lip synching for videos with speech as, while it currently attempts to do this it isn't always accurate and creates an uncanny valley effect.

ElevenLabs is working on a similar project

Not to be outdone, ElevenLabs this week revealed its new Text to Sound Effects API that can generate audio effects based on what you upload to it.

Unlike Google's V2A model, the API from ElevenLabs is already accessible and from experiments works surprisingly well.

In the example above, a video of a bottle smashing gets a few different options to choose from, while the DiCaprio laughing meme gets a additional audio from other people in the room.

The company 'bootstrapped' a quick app to demonstrate what is possible with the API, allowing you to upload a video and have it add the sound. This is free to use and open source, and you can try it right now.

ElevenLabs told Tom's Guide the real aim is to have other companies and developers build things with the API themselves, such as integrating into generative video.

More from Tom's Guide