Google's artificial intelligence research unit has developed a neural network, which can pick out distinct voices in a crowd if it can see their faces, a Google Research Blog post says.
By training the neural network on different videos with varying amounts of background noise, the research team have been able to separate out audio tracks of each distinct person the network can hear. This can then highlight and amplify the person who the audience wants to hear, whether this be in a busy bar, a video conference session in a noisy room, or a presenter on stage.
This system works by the neural network looking at the person's face and then correlating that video track with the audio track and putting the two together, isolating all other noise. The team trained the network on over 2,000 hours of video with one person speaking and no background noise. They then combined these videos, creating "synthetic cocktail parties" with multiple speakers and the corresponding audio, along with background noise. The network learns separate encodings for each video and audio track, fusing them together until each speaker is distinct.
A visual diagram of how the neural network separates out voices. (Image: Google Research)
So how could this be used in a real-life situation?
Well, for one thing, it could create better hearing aids, meaning the person who the user is focusing on is louder and/or clearer. Another application is closed-captioning -- computers often struggle in automated captioning, especially when more than one person is speaking (take a look at videos on YouTube with automatically generated closed captions for an example of this).
Finally, it could be used in video conferencing to amplify the person who is speaking. However, there are potential privacy violations which needed to be sorted: Namely how do you stop the AI picking out a voice if the person speaking does not want to be overheard?
Check out the blog post for more technical information on how the Google Research team achieved this, or the paper for an in-depth look.