from Hacker News

A voice separation model that distinguishes multiple speakers simultaneously

by venmul on 7/11/20, 8:40 AM with 25 comments

by wenc on 7/11/20, 1:37 PM
This is known as Blind Source Separation [1], and it's been a field of study for decades. The specific problem here seems to be the "cocktail party problem", where you want to isolate a single speaker (or in this case 5?) in a room full of conversations.
When I was in grad school, I knew an EE research group in the building next to mine working on this problem using ICA (independent components analysis) -- this was ca 2004, before the resurgence of deep learning. Even with ICA useful results could be obtained.
The results of the FB work [2] with RNNs are pretty impressive (audio samples).
[1] https://en.wikipedia.org/wiki/Signal_separation
[2] https://enk100.github.io/speaker_separation/
by boublepop on 7/11/20, 6:13 PM
I feel that they are underplaying just how big a deal this would be in hearing aids. It’s not just a case of “slightly better noice filtering” for some it is the difference between being able to go to social events or not. For a large group of people using hearing aids the cocktail party effect means they can’t hear anything at all in social settings, so they avoid them completely because of the negative effects that come from everyone assuming your able to follow group conversations when your in fact sitting in your own little bubble only able to pick up what’s going on when someone semi-yells directly at you.
In any case the box you’d be selling them this product in wouldn’t say “better sound” it would say: “Get back your ability to attend and enjoy parties, enjoy group conversations and socialize unencumbered”. That’s a huge quality of life improvement.
You still have the issue of how to figure out which voices to boost and which to reduce, but I’d expect that to be simpler issue of using multiple receivers and directional detection.
by yodon on 7/11/20, 2:16 PM
Facebook's work on separating multiple sources in an audio stream is fundamentally different from prior ICA-based methods of Blind Source Separation [0] in ways that are both interesting and seem to be part of a broader trend at FB Research.
ICA-based BSS requires at least n microphones to separate n sources of sound. This work does the separation with one microphone.
What makes this more broadly interesting is FB Research has separately developed the capability to reconstruct full 3D models from single image photos[1].
Both of these reconstruct-from-single-sensor problems are MUCH harder than their associated reconstruct-from-multiple-sensors variants (ICA in the case of audio, stereo separation or photogrammetry in the case of video) so they aren't efforts one undertakes casually.
The obvious motivation for this single-sensor approach is augmenting existing video and audio clips, most of which are single camera, single microphone (or very closely spaced stereo microphones with minimal separation), and all of which people have already uploaded massive numbers to Facebook.
The more interesting motivation could be that FB (Oculus) is widely believed to be developing next generation AR or VR glasses. Most of the discussion around AR/VR headsets focuses on the displays, but if you wanted to keep both your physical size and hardware parts cost to an absolute minimum, one of the things you'd want to minimize is your sensor count.
FB Research seems to have a strong interest in things that reduce the number of sensors required to provide high grade AR/VR experiences and that make it possible to explore pre-existing conventional media in spatialized 3D contexts.
[0] https://en.m.wikipedia.org/wiki/Independent_component_analys...
[1] https://ai.facebook.com/blog/facebook-research-at-cvpr-2020/
by thaumasiotes on 7/11/20, 11:43 AM
This is a really interesting problem to work on. A couple obvious points:
1. This is a task that humans must do all the time. It's very important in all kinds of different circumstances.
2. This is also a task that humans find very difficult. It's not like recognizing someone by their face, where humans do it effortlessly but struggle to describe how. We frequently fail at this.
Combining (1) and (2), and the assumption that this task has been just as important historically as it still is now, we might conclude that this is a really hard problem and AI is unlikely to reach the level of performance we might hope for.
And if AI quickly jumps to superhuman levels of performance, that too would have many interesting implications.
by ComputerGuru on 7/11/20, 5:21 PM
Not an expert in this domain but I'm not sure this can be done (well) without a physical component.
Recent studies have shown that we can consciously and subconsciously physically manipulate the position and directionality of our outer ear and some of the mechanics in the inner ear to "zero in" on noises and affect the frequency response of the ear. Our ears move imperceptibly when we look from side to side to synchronize what we hear with what we see. Try listening to one person in a busy room is saying then try doing the same while looking somewhere else.
There is hardware actively filtering out interfering sounds based on location and frequency, then there's the wetware that further processes the incoming signals and attempts to strip unwanted noise. I don't believe the second can be effectively done without a feedback loop to the first.
by Yhippa on 7/11/20, 4:13 PM
The "Why it matters" section is interesting. Cynically I'm trying to think of commercial uses of this for FB. I'm thinking if you built a device that you could put into public places, restaurants, or stores:
* People could order food from their table without summoning a server. I guess some restaurants have tablets or other devices at their table but it seems to break immersion if you're enjoying your company.
* In a big box store someone could come help you where you are without having to have workers roam the store and then you hope you run into someone.
* Fingerprint people in public or private for targeted advertising.
by iandanforth on 7/11/20, 4:52 PM
The assumption that this is possible come from our ability to isolate voices in a crowd by paying attention to one or more of them. However our ability to do so rests on two important factors that don't exist in these datasets. 1. We have two ears to allow for sound localization and 2. The sounds we distinguish are colocated in space allowing us to use ambient information for disambiguation.
This means that the problem being solved here is harder than the natural problem we have evolved and learned to solve.
This is both impressive and possibly problematic. Some feature of training in a goal directed fashion in naturalistic environments could be essential for higher quality speaker isolation, or it might not matter at all. The multiplicity of models phenomenon tells us there are likely many solutions to this problem.
by fredmonroe on 7/12/20, 3:04 AM
i'm excited that FB develops and shares their research and simultaneously terrified of what they will do with it given past behavior
its very disconcerting - i feel this way everytime i use pytorch - which i love
by atum47 on 7/11/20, 6:09 PM
Nice, now Facebook can spy on several people at once.