Audio Demos - Paris Smaragdis

In this page I have sounds and videos of various projects I’ve worked on in the past. You can find the technical papers describing all this work in my publications page. One would hope this page will get updated regularly :)

Audio editors are pretty lousy, you can’t use a graphical interface to do anything useful when editing sound mixtures. In this demo we present an audio-driven interface which allows a user to vocalize the sound they want to select and an automatic process matches that input to the most appropriate sound. Once the selection is done then we can manipulate sounds independently and then throw them back in the mix. This ties in a lot of work on audio separation shown in a later section.

Additionally, while at Adobe I worked on a number of new interface technologies to make speech production more accessible and easier by making use of smart interfaces. Some examples include the videos below, which are enhancements that address common speech recording production issues.

We've been working on technologies for hearable devices, earbuds which will be smart enough to augment your audio suroundings. Here is a short demo of them modifying an audio stream of a scene to remove one speaker and HVAC noise, while maintaining all other audio signals. This was done in the context of AV object analysis, so as a bonus the visuals are also properly adjusted!

Demo videos of augmented audio. On the left is the input video, which includes multiple overlapping sounds. On the right is the processed output which has removed one speaker and the background noise while keeping the rest intact.

In our CrowdMic project we explored technologies that would allow us to take thousands of user-provided recordings and stitch them as a single coherent AV object. The key to doing this efficiently was to match their audio signals using some hashing magic that sped up computations by many orders of magnitude. This feature is commercially available in Adobe's video software for multicam recording, but can be also used to align and enhance crowdsourced recordings (e.g. concerts, protests, etc)

In the left video we see a subset of 700+ recordings from a concert, that were scraped off social media posts. They were all recorded at different times and are obviously not synchronized. Our system anayzed and aligned all the videos in less than 30 sec on a consumer laptop (in 2011), and set them in a common timeline as shown in the right video

One of my pet projects is source separation. In all honesty I can't figure out why one might want to separate a sound since it is perfectly possible to use the same reasoning to perform many classification and processing operations in mixtures. But, hey, who am I to judge ... That said it is a fun thing to try and it certainly has resulted into a lot of neat research and has diverted a lot of outside interest into audio. I've worked on and off and tangentially on this subject for more than 10 years now. Here are some highlights in reverse chronological order:

A while back we wrote one of the first papers on using RNNs for source separation, an approach which by now is the new standard for most commercial audio systems. Although first done for speech denoising it has been shown to be a powerful model for all sorts of signal restoration. Probably the most challenging (and rewarding) case was consulting on "The Beatles: Get Back" documentary to help develop an audio restoration system that would extract speech from really noisy recordings. This work won an Emmy for Outstanding Sound Editing in 2022. Here are some examples:

This approach is actually a family of related statistical models which attempt to decompose time-frequency distributions into low-rank representations. They are similar to NMF approaches, but they are easy to incorporate with more fancy machine learning and construct some really neat models. We’ve made convolutive forms, Markovian models, LDA versions, sparse coders, hierarchical structures etc. As far as I know they produce the state of the art results for separation on monophonic mixtures (20+dB SIR on 0dB mixtures).

Here's an example of removing a soprano from a recording. The removed voice was pitch shifted and mixed in back to form a (slightly "off") duet:

Here's a more ethnic version of the above. The remix consists of transgendrifying the singer so that neighboring animals are not alarmed by the high-pitched tones :)

Here's an example recorded from an AIBO robot as it was walking on a wooden floor (a speech recognition nightmare):

Same thing with the AIBO moving its head around (resulting into motor and ear flapping noise)

Here's another "denoising" example. Note how the “noise” source is somewhat correlated to the music (although not as much as it should!):

Recently I developed the idea of the convolutive non-negative matrix factorization and applied it on speech mixtures. One could train on a set of speakers and then when provided with new input be able to decompose the input into a sets which most fit each speaker. Here's an example mixture and the extracted speakers:

Here is an example of denoising (a special case of multi-source separation). In this case the speaker is known but the background interference is not, and neither is the speaker utterance in the mixture:

My earliest claim to fame came with my masters thesis. I applied ICA in the time-frequency domain in order to solve convolutive mixing problems fast. It worked out fine and also spawned a lot of work on the dreaded bin permutation problem! Here’s a (simple and contrived) example which just sounds neat. It is played in “slow motion” so that you can hear each frequency band separate at its own pace.

With Ajay Divakaran, Bhiksha Raj and Regu Radhakrishnan we used sound recognition for video content analysis. The first real-world application was sports highlights detection, which is an extraordinarily hard task in the visual domain (hey, if all goals looked the same the game wouldn't be worth it!), but a trivial task in the audio domain. By recognizing key sounds like crowds going wild, clapping, ball hits, speech, music, etc, we can deduce the state of excitement in the video stream. The resulting system works fine on a variety of sports. This system was initially released running on the Mitsubishi DVR-HE50W personal video recorders and has since been extended to find highlights in all sorts of sports (soccer, basketball, baseball, sumo, etc ...)

Just as in the video content analysis project we can't just detect highlights, we can also detect emergencies. This is a demo video where we have an simulated elevator mugging. As you can see there is not much to see! The elevator is dark and the contrast is lousy and that trying to figure out when someone is being mugged from the visual information is a very hard task. On the other hand, during such cases people scream, and move around hitting things, their tone of voice is distressed and there is plenty of audio commotion that can be detected reliably. On our training test of a few hundred videos we get almost 100% accuracy in detecting muggings.

Same idea as the above projects applied on traffic monitoring. The videos here are from an intersection in Louisville, KY. There are two cameras pointing at a troublesome intersection. Having the cameras on 24h a day means that some poor soul has to watch the footage and find the interesting sections which can help improve design and safety of the intersection. Instead we can turn on the cameras only when specific sounds are detected. The cameras keep a recording buffer of a few seconds, once we recognize sounds like impacts, tire squealing, car horns, etc, we save that buffer and record the next few seconds. This provides us with a before and after glimpse of traffic "highlights". The videos in this section show some of the extracted scenes. Two are real accidents, one is a near accident (which are very useful in determining hot to improve signage), and one is just one of those out-of-the-ordinary events. Just as in the previous examples recognition rates are well in the 90s%. Additionally we can track when sirens are around and manipulate the traffic lights appropriately.

The same ideas can be applied for content analysis of movies using the audio track. Using this idea you can search for scenes by their representative sounds. You can search for sections with guns shooting, cars skidding, people talking, dogs barking, or whatever else makes sound. Just as in the sports system these cues are very reliable and relatively easy to track as compared to their visual counterparts. Having the metadata out of this kind of analysis we can divide a movie into sections, cluster scenes or entire movies and automatically tag movie databases efficiently. In this demo video the bar in the bottom displays the likelihood of a detected audio class. These likelihoods can be used to search for various events in a movie. Note that unlike generic sound recognition methods this one works even when the sounds are mixed together.

I’m also very interested in missing data theory. In the following examples I automatically fill in the time-frequency gaps using a latent variable model. Here are two examples of a large gap, and or many distributed gaps. In both cases the reconstruction was performed using only the data available from the input.

We can also use the same ideas to perform bandwidth expansion with pre-trained models. In the following example we start with a band-limited latin-jazz recording with no low and high frequency content. We then train a model of latin-jazz sounds by recording gibberish from a synthesizer. The model itself holds enough audio information to help make up the missing frequencies and provide a reasonable expansion.

Here are two videos demonstrating localization. The first one is a system that is running on a PTZ camera and automatically turns the camera towards the most interesting sounds. The camera performed both localization and recognition of sounds in order to decide where to look towards. It was designed so that upon recognizing a sound it would turn to its most likely direction (e.g. it would look towards the elevators when it heard the elevator bell). You’ll see me and Jay Thornton compete for the cameras attention. In this video the camera only tracks voices.

Demo video of the automatic de-umm-er	Demo video of intelligent splicing
Demo video of a speech splicing recording interface	Demo video of automatic speech condition matching

Corrupted input	Reconstructed output
Corrupted input	Reconstructed output

Original recording
Extracted male voice
Exctracted female voice