CrowdMic:

      Enabling CrowdSourced Recordings

Enhancing recordings with single point of interest


In this case we aim to improve the quality of a recording from an audio scene with a single point of interest, e.g. music peformances, talks, lectures, etc.  We do so by extracting only the desired parts from each of the multiple low-quality user-created recordings. This could be seen as a challenging microphone array setting with channels that are not synched, are defective in unique ways (e.g. varying coding schemes and interference), and with different sampling rates.

This material is based upon work supported by the National Science Foundation under Grant: III: Small: MicSynth: Enhancing and Reconstructing Sound Scenes from Crowdsourced Recordings. Award #:1319708

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

This page was last updated on Aug 29, 2016

We achieve the separation by using an extended probabilistic topic model that enables sharing of topics (sources) across the recordings. More specifically, we do a matrix factorization for each recording's spectrogram, but fix some of the sources to be the same (with different global weights) across the simultaneous factorizations for all recordings.


Check out our paper about this project: "Collaborative Audio Enhancement Using Probabilistic Latent Component Sharing (ICASSP 2013)"


And some audio clips:

     •     Input#1: low-pass filtered recording (8kHz) with a speech interference (wav)

     •     Input#2: high-pass filtered recording (500Hz) with another speech interference (wav)

     •     Input#3: low-pass filtered (11.5Hz) and high-pass filtered (500kHz) recording with clipping artifacts (wav)

     •     Enhanced audio using PLCS plus both priors (wav)


Aligning Multiple Recordings from Noisy Measurements


In this demo we show how we can combine multiple very noisy and uncalibrated recordings from a real-life situation. The input files consisted of about 700 user-supplied recordings of a Taylor Swift concert, as they were uploaded on YouTube. Our task was to align all of these recordings efficiently, despite the highly heterogeneous sensors and the noisy nature of the recording. Some of the input videos are shown below, unsynchronized.

Using a landmark-based approach we can automatically align the signals in a few seconds and the result of some of the aligned videos is shown below. This means that we can now combine efficiently relate videos using audio signals, despite poor recording conditions.

Combining Multiple Recordings for Audio Enhancement


In this demo we stress the proposed approach with a highly challenging example in which we want to extract the voice of a target speaker from a highly dense mixture of interfering sounds. The setup of this particular problem is shown in the figure below. The target sound is the red diamond in the center, the blue dots indicate mesurements using local sensors (e.g. cell phones of nearby users), and the empty diamonds represent competing sound sources.

If this job is to be done manually, someone has to listen to all of the 80 recordings, and pick out the best one based on perceptually quality assessment, which is a tedious, difficult, and expensive job to do. The selection will give the results ranging from the worst recording to the best one. Note that in the worst recording, someone else' voice was recorded loudly rather than the dominant source of interest.

     •     Dominant source (wav)

     •     The worst recording (wav)

     •     The best recording (wav)

We ideally want to achieve enhancement by using an extended probabilistic topic model that enables sharing of some topics (sources) across the recordings. However, this procedure calls for a lot of computation on all the 80 different recordings at every EM update. First, it is not very efficient particularly if there are too many recordings to deal with (being an ad-hoc microphone array where the potential number of sensors is as many as the number of people in the crowd). Second, most of those recordings that are far from the dominant source of interest or the ones with severe artifacts don't really contribute to the reconstruction of that source. Therefore, paying attention to the distant or low-quality recordings is a waste of computation.

To get round this computational complexity issue we to focus on the nearest neighbors at every EM rather than the whole. In the second figure above, we can see that topic modeling can find similar convex hulls (the green polytope wrapping the data points) whether or not we have those non-neighboring data samples during the process. Actually, what we believe is that focusing on those neighbors of current topics (corners of the polytope) will give us not only the speed-up, but the better results, because otherwise M-step spends a lot of time extracting out a small amount of contribution from those non-neighboring observations.


The nearest neighbor search can take too much time though. If we do this search in an exhaustive way, something based on a proper distance metric, such as cross entropy between the normalized magnitude spectrograms of recordings and that of the reconstructed source, the overhead introduced by the search will diminish the speed-up. Instead, we do this search based on Hamming distance calculated from hash codes of those spectrograms, because calculating Hamming distance can be done in a cheaper way by using the bitwise operations. If we find 3K pseudo neighbors with respect to Hamming distance in the first place, and then perform the exhaustive search only on the 3K candidates rather than the whole, we can construct the K-neighbor set in a fast way.

Indeed, when we do the EM updates only on those K-neighbors with an adequate choice of K (10 to 20% of the total number of recordings, red and orange lines in the picture), we can get better separation performance than the full component sharing topic model (thick green line in the picture).

     •     The results from the proposed method (wav)

     •     The results from the full and slower PLCS method (wav)