In this page I have sounds and videos of various projects
I’ve worked on in the past. You can find the technical papers describing
all this work in my publications page. One would
hope this page will get updated regularly :)
Audio Interfaces
Audio editors are pretty lousy, you can’t use a graphical
interface to do anything useful when editing sound mixtures. In this demo
we present an audio-driven interface which allows a user to vocalize the
sound they want to select and an automatic process matches that input to
the most appropriate sound. Once the selection is done then we can
manipulate sounds independently and then throw them back in the mix. This
ties in a lot of work on audio separation shown in a later section.
We've been working on technologies for hearable devices, earbuds which will be smart enough to augment your audio suroundings. Here is a short demo of them modifying an audio stream of a scene to remove one speaker and HVAC noise, while maintaining all other audio signals. This was done in the context of AV object analysis, so as a bonus the visuals are also properly adjusted!
Demo videos of augmented audio. On the left is the input video, which includes multiple overlapping sounds. On the right is the processed output which has removed one speaker and the background noise while keeping the rest intact.
Very-very Large Crowdsourced Microphone Arrays
In our CrowdMic project we explored technologies that would allow us to take thousands of user-provided recordings and stitch them as a single coherent AV object. The key to doing this efficiently was to match their audio signals using some hashing magic that sped up computations by many orders of magnitude. This feature is commercially available in Adobe's video software for multicam recording, but can be also used to align and enhance crowdsourced recordings (e.g. concerts, protests, etc)
In the left video we see a subset of 700+ recordings from a concert, that were scraped off social media posts. They were all recorded at different times and are obviously not synchronized. Our system anayzed and aligned all the videos in less than 30 sec on a consumer laptop (in 2011), and set them in a common timeline as shown in the right video
Source Separation
One of my pet projects is source separation. In all
honesty I can't figure out why one might want to separate a sound since it
is perfectly possible to use the same reasoning to perform many
classification and processing operations in mixtures. But, hey, who am I
to judge ... That said it is a fun thing to try and it certainly has
resulted into a lot of neat research and has diverted a lot of outside
interest into audio. I've worked on and off and tangentially on this
subject for more than 10 years now. Here are some highlights in reverse
chronological order:
Neural Net approaches (c. 2013-today)
A while back we wrote one of the first papers on using RNNs for source separation, an approach which by now is the new standard for most commercial audio systems. Although first done for speech denoising it has been shown to be a powerful model for all sorts of signal restoration. Probably the most challenging (and rewarding) case was consulting on "The Beatles: Get Back" documentary to help develop an audio restoration system that would extract speech from really noisy recordings. Here's an example:
This approach is actually a family of related statistical
models which attempt to decompose time-frequency distributions into
low-rank representations. They are similar to NMF approaches, but they are
easy to incorporate with more fancy machine learning and construct some
really neat models. We’ve made convolutive forms, Markovian models, LDA
versions, sparse coders, hierarchical structures etc. As far as I know
they produce the state of the art results for separation on monophonic
mixtures (20+dB SIR on 0dB mixtures).
Here's an example of removing a soprano from a recording.
The removed voice was pitch shifted and mixed in back to form a (slightly
"off") duet:
Original recording
Extracted soprano
Remix
Here's a more ethnic version of the above. The remix
consists of transgendrifying the singer so that neighboring animals are
not alarmed by the high-pitched tones :)
Original recording
Extracted singing
Remix
Here's an example recorded from an AIBO robot as it was
walking on a wooden floor (a speech recognition nightmare):
Original recording
Denoised speech
Same thing with the AIBO moving its head around
(resulting into motor and ear flapping noise)
Original recording
Denoised speech
Here's another "denoising" example. Note how the “noise”
source is somewhat correlated to the music (although not as much as it
should!):
Original recording
Noise source
Denoised output
Convolutive NMF (c. 2004)
Recently I developed the idea of the convolutive
non-negative matrix factorization and applied it on speech mixtures. One
could train on a set of speakers and then when provided with new input be
able to decompose the input into a sets which most fit each speaker.
Here's an example mixture and the extracted speakers:
Original recording
Extracted male voice
Exctracted female voice
Here is an example of denoising (a special case of
multi-source separation). In this case the speaker is known but the
background interference is not, and neither is the speaker utterance in
the mixture:
Original recording
Extracted speech
Frequency domain ICA (c. 1995)
My earliest claim to fame came with my masters thesis. I
applied ICA in the time-frequency domain in order to solve convolutive
mixing problems fast. It worked out fine and also spawned a lot of work on
the dreaded bin permutation problem! Here’s a (simple and contrived)
example which just sounds neat. It is played in “slow motion” so that you
can hear each frequency band separate at its own pace.
Original recording
Extracted speech
Sound Recognition for Content Analysis (c.2003)
First Commercial step
With Ajay Divakaran, Bhiksha Raj and Regu Radhakrishnan
we used sound recognition for video content analysis. The first real-world
application was sports highlights detection, which is an extraordinarily
hard task in the visual domain (hey, if all goals looked the same the game
wouldn't be worth it!), but a trivial task in the audio domain. By
recognizing key sounds like crowds going wild, clapping, ball hits,
speech, music, etc, we can deduce the state of excitement in the video
stream. The resulting system works fine on a variety of sports. This
system was initially released running on the Mitsubishi DVR-HE50W personal
video recorders and has since been extended to find highlights in all
sorts of sports (soccer, basketball, baseball, sumo, etc ...)
Video
demonstrating the use of audio cues to detect sports highlights
Surveillance Systems
Just as in the video content analysis project we can't
just detect highlights, we can also detect emergencies. This is a demo
video where we have an simulated elevator mugging. As you can see there is
not much to see! The elevator is dark and the contrast is lousy and that
trying to figure out when someone is being mugged from the visual
information is a very hard task. On the other hand, during such cases
people scream, and move around hitting things, their tone of voice is
distressed and there is plenty of audio commotion that can be detected
reliably. On our training test of a few hundred videos we get almost 100%
accuracy in detecting muggings.
Video demonstrating how
audio cues can help easily identify emergencies
Same idea as the above projects applied on traffic
monitoring. The videos here are from an intersection in Louisville, KY.
There are two cameras pointing at a troublesome intersection. Having the
cameras on 24h a day means that some poor soul has to watch the footage
and find the interesting sections which can help improve design and safety
of the intersection. Instead we can turn on the cameras only when specific
sounds are detected. The cameras keep a recording buffer of a few seconds,
once we recognize sounds like impacts, tire squealing, car horns, etc, we
save that buffer and record the next few seconds. This provides us with a
before and after glimpse of traffic "highlights". The videos in this
section show some of the extracted scenes. Two are real accidents, one is
a near accident (which are very useful in determining hot to improve
signage), and one is just one of those out-of-the-ordinary events. Just as
in the previous examples recognition rates are well in the 90s%.
Additionally we can track when sirens are around and manipulate the
traffic lights appropriately.
Video demonstrating the
use of audio cues to detect traffic incidents
Analysis of Movies (c.2005)
The same ideas can be applied for content analysis of
movies using the audio track. Using this idea you can search for scenes by
their representative sounds. You can search for sections with guns
shooting, cars skidding, people talking, dogs barking, or whatever else
makes sound. Just as in the sports system these cues are very reliable and
relatively easy to track as compared to their visual counterparts. Having
the metadata out of this kind of analysis we can divide a movie into
sections, cluster scenes or entire movies and automatically tag movie
databases efficiently. In this demo video the bar in the bottom displays
the likelihood of a detected audio class. These likelihoods can be used to
search for various events in a movie. Note that unlike generic sound
recognition methods this one works even when the sounds are mixed
together.
Video demonstrating
concurrent sound recognition using various movie clips
Missing Spectral Data and Bandwidth Expansion
I’m also very interested in missing data theory. In the
following examples I automatically fill in the time-frequency gaps using a
latent variable model. Here are two examples of a large gap, and or many
distributed gaps. In both cases the reconstruction was performed using
only the data available from the input.
Corrupted input
Reconstructed output
Corrupted input
Reconstructed output
We can also use the same ideas to perform bandwidth
expansion with pre-trained models. In the following example we start with
a band-limited latin-jazz recording with no low and high frequency
content. We then train a model of latin-jazz sounds by recording gibberish
from a synthesizer. The model itself holds enough audio information to
help make up the missing frequencies and provide a reasonable expansion.
Original Band-limited Input
Training example
Recovered Wideband Output
Source Localization (c.2002)
Here are two videos demonstrating localization. The first
one is a system that is running on a PTZ camera and automatically turns
the camera towards the most interesting sounds. The camera performed both
localization and recognition of sounds in order to decide where to look
towards. It was designed so that upon recognizing a sound it would turn to
its most likely direction (e.g. it would look towards the elevators when
it heard the elevator bell). You’ll see me and Jay Thornton compete for
the cameras attention. In this video the camera only tracks voices.