by venmul on 7/11/20, 8:40 AM with 25 comments
by wenc on 7/11/20, 1:37 PM
When I was in grad school, I knew an EE research group in the building next to mine working on this problem using ICA (independent components analysis) -- this was ca 2004, before the resurgence of deep learning. Even with ICA useful results could be obtained.
The results of the FB work [2] with RNNs are pretty impressive (audio samples).
by boublepop on 7/11/20, 6:13 PM
In any case the box you’d be selling them this product in wouldn’t say “better sound” it would say: “Get back your ability to attend and enjoy parties, enjoy group conversations and socialize unencumbered”. That’s a huge quality of life improvement.
You still have the issue of how to figure out which voices to boost and which to reduce, but I’d expect that to be simpler issue of using multiple receivers and directional detection.
by yodon on 7/11/20, 2:16 PM
ICA-based BSS requires at least n microphones to separate n sources of sound. This work does the separation with one microphone.
What makes this more broadly interesting is FB Research has separately developed the capability to reconstruct full 3D models from single image photos[1].
Both of these reconstruct-from-single-sensor problems are MUCH harder than their associated reconstruct-from-multiple-sensors variants (ICA in the case of audio, stereo separation or photogrammetry in the case of video) so they aren't efforts one undertakes casually.
The obvious motivation for this single-sensor approach is augmenting existing video and audio clips, most of which are single camera, single microphone (or very closely spaced stereo microphones with minimal separation), and all of which people have already uploaded massive numbers to Facebook.
The more interesting motivation could be that FB (Oculus) is widely believed to be developing next generation AR or VR glasses. Most of the discussion around AR/VR headsets focuses on the displays, but if you wanted to keep both your physical size and hardware parts cost to an absolute minimum, one of the things you'd want to minimize is your sensor count.
FB Research seems to have a strong interest in things that reduce the number of sensors required to provide high grade AR/VR experiences and that make it possible to explore pre-existing conventional media in spatialized 3D contexts.
[0] https://en.m.wikipedia.org/wiki/Independent_component_analys...
[1] https://ai.facebook.com/blog/facebook-research-at-cvpr-2020/
by thaumasiotes on 7/11/20, 11:43 AM
1. This is a task that humans must do all the time. It's very important in all kinds of different circumstances.
2. This is also a task that humans find very difficult. It's not like recognizing someone by their face, where humans do it effortlessly but struggle to describe how. We frequently fail at this.
Combining (1) and (2), and the assumption that this task has been just as important historically as it still is now, we might conclude that this is a really hard problem and AI is unlikely to reach the level of performance we might hope for.
And if AI quickly jumps to superhuman levels of performance, that too would have many interesting implications.
by ComputerGuru on 7/11/20, 5:21 PM
Recent studies have shown that we can consciously and subconsciously physically manipulate the position and directionality of our outer ear and some of the mechanics in the inner ear to "zero in" on noises and affect the frequency response of the ear. Our ears move imperceptibly when we look from side to side to synchronize what we hear with what we see. Try listening to one person in a busy room is saying then try doing the same while looking somewhere else.
There is hardware actively filtering out interfering sounds based on location and frequency, then there's the wetware that further processes the incoming signals and attempts to strip unwanted noise. I don't believe the second can be effectively done without a feedback loop to the first.
by Yhippa on 7/11/20, 4:13 PM
* People could order food from their table without summoning a server. I guess some restaurants have tablets or other devices at their table but it seems to break immersion if you're enjoying your company.
* In a big box store someone could come help you where you are without having to have workers roam the store and then you hope you run into someone.
* Fingerprint people in public or private for targeted advertising.
by iandanforth on 7/11/20, 4:52 PM
This means that the problem being solved here is harder than the natural problem we have evolved and learned to solve.
This is both impressive and possibly problematic. Some feature of training in a goal directed fashion in naturalistic environments could be essential for higher quality speaker isolation, or it might not matter at all. The multiplicity of models phenomenon tells us there are likely many solutions to this problem.
by fredmonroe on 7/12/20, 3:04 AM
its very disconcerting - i feel this way everytime i use pytorch - which i love
by atum47 on 7/11/20, 6:09 PM