IAFPA 2011- 'No Thank You For the Music'


Published on

Our presentation at the International Association of Forensic Phonetics and Acoustics Conference 2011, in Vienna on the application of acoustic fingerprinting and echo cancellation for the reduction/removal of music for forensic audio enhancement

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

IAFPA 2011- 'No Thank You For the Music'

  1. 1. ‘No Thank You For The Music’<br />AN APPLICATION OF AUDIO FINGERPRINTING AND AUTOMATIC MUSIC SIGNAL CANCELLATION FOR FORENSIC AUDIO ENHANCEMENT <br />27 July 2011<br />International Association of Forensic Phonetics and Acoustics Conference 2011,<br />Vienna, Austria<br />Anil Alexander & Oscar Forth<br />Oxford Wave Research Ltd<br />United Kingdom<br />
  2. 2. Introduction<br />In surveillance audio recordings, it is common to come across:<br />Interfering music or a television playing in the background in locations like pubs, cafes, cars, etc. <br />Target speakers turn on their music players or their televisions, as they begin to speak, especially when they suspect they are being monitored, in order to mask their speech. <br />The loud music drowns out the words or makes the speech of the speakers hard to decipher and transcribe. <br />Research Question: Is it possible to reduce or remove interfering music and to bring the voice of the speaker to the forefront?<br />
  3. 3. Scenario (1,2): Car or Hotel Room<br />Hotel Room<br />In a Car<br />Noise sources: Radio, television, music player<br />Noise sources: road noise, car radio, other passengers<br />
  4. 4. Scenario (3): Pub/Hall with Music<br />Noise Sources:Television, Jukebox, Radio, Bar Noise, Other Speakers<br />
  5. 5. Why is this difficult ?<br />“Is it possible to reduce or remove interfering music and to bring the voice of the speaker to the forefront?”<br />Straightforward subtraction of the audio will not remove the music as the effects of the room are not considered<br />Cancellation is sensitive to clipping and compression.<br />Has often to be applied on a single channel of audio (without simultaneous reference recordings).<br />The exact song that is playing has to be identified and perfectly time-aligned. This is time and labour intensive.<br />
  6. 6. Reducing Background Music<br />The tasks involved are:<br />Identifying the music/song being played<br />Aligning the tracks to the exact moment in time, within the file being analysed, that the song or music begins<br />Applying a noise- and distortion-robust echo cancellation algorithm to remove or reduce the music while mostly leaving the target speech intact. <br />
  7. 7. Automatic Music Identification<br />Commercial applications of acoustic fingerprinting are in areas of identifying tunes, songs, videos, advertisements and radio broadcasts and anti-piracy initiatives.<br />Recent proliferation of music identification systems such as Shazam™.<br />A short segment of audio (noisy, distorted or otherwise poor) is sent through to an internet-based recognition server for identification. <br />The server compares feature of this recordingto a pre-indexed database of songs.<br />It selects the most probable candidate(s) for the song. <br />
  8. 8. Experimental Data<br />Sample Music Database<br />Popular genres (pop, rock and instrumental)<br />40 songs - CD quality (44,100 Hz, 16bit, Microsoft Wave Files) <br />Test Recordings<br />Made in a moving (diesel) vehicle, with music playing and windows closed and open<br />Office with music playing <br />
  9. 9. Schema<br />Identified <br />music track<br />Audio Fingerprint<br />Comparison<br />Landmark Feature Extraction<br />Audio File<br />Speech + Music<br />Music Fingerprint<br />Database<br />LMS-based music canceller<br />Enhanced<br />Audio File<br />(Speech)<br />
  10. 10. Noise-Robust Audio Fingerprinting<br />From Wang (2003) attributes for a ‘fingerprint’<br />Temporally localized<br />Translation invariant<br />Robust<br />Sufficiently Entropic<br />Spectral peak pairs are thus temporally localized, robust to noise and transmission distortions<br />
  11. 11. Landmark-based Audio Fingerprinting Algorithm (1)<br />Spectrogram peaks chosen (higher energy than neighbours)<br />Reduction of spectrogram into a ‘constellation maps’<br />Pairing peaks into hashes<br />Hash pairs from the test audio<br />Time-aligned until the best match is found<br />∆t<br />
  12. 12. Landmark-based Audio Fingerprinting Algorithm (2)<br />Using Ellis (2009) Robust Landmark-Based Audio Fingerprinting. <br />
  13. 13. Echo Cancellation<br />Echo cancellation suffers from a similar problem – <br />playback from the speakers and simultaneous recordings from the microphones<br />the playback should not ‘seep in’ to the recording in the microphone<br />An acoustic echo canceller could provide a good solution to the problem<br />Echo cancellation algorithms are generally LMS (Least Mean Square-based) – either time domain or frequency domain approaches can be used<br />In this application we use an echo canceller software module (compliant with ITU-T G.167, G.168) specificationsusing Intel Performance Primitives (IPP) library.<br />
  14. 14. Echo Cancellation<br />(S+N) Speech + music<br />Speech + <br />residual music<br />(S+N’ –N”) <br />+<br />-<br />Electronic<br /> Response estimate<br />(N’) Identified <br />time-aligned music<br />(N”) <br />Residual <br />
  15. 15. Echo Cancellation (Noise Reduction) <br />(S+N) Speech + music<br />+<br />-<br />Speech + <br />residual music<br />NR(S+N’ –N”) <br />Electronic<br />Room Response estimate<br />(N’) Identified <br />time-aligned music<br />(N”) <br />Residual <br />
  16. 16. Result Example - Time Domain <br />Marked reduction in the noise floor<br />
  17. 17. Result Example- Frequency Domain<br />
  18. 18. Limitations<br />This method is not applicable to to<br />Badly clipped recordings <br />Compressed recordings<br />Recordings where there is a ‘drift’ or stretch between the playback time of the music (more applicable to analogue recordings)<br />Note: What is extracted may still not be sufficient quality for forensic voice comparison<br />
  19. 19. Extensions and Applications<br />This approach can be extended to non-music/television enhancement<br />Two microphones in the same recording environment<br />Timing does not need to be synchronous (audio landmarkingcan be applied in some cases)<br />
  20. 20. Conclusions<br />A combination of audio-fingerprinting and echo cancellation can be used to reduce the effect of interfering radio and television noises.<br />A significant improvement in the intelligibility is obtained which could benefit forensic audio enhancement and transcription.<br />This approach could be extended to non-music speech sources by using two independent recordings in the same recording environment<br />
  21. 21. References<br />Avery Wang "An Industrial-Strength Audio Search Algorithm", Proc. 2003 ISMIR International Symposium on Music Information Retrieval, Baltimore, MD, Oct. 2003. <br />J. Benesty, D. Morgan and M. Sondhi, (1997) ‘‘A better understanding and an improved solution to the problems of stereophonic acoustic echo cancellation’’, Proc. ICASSP,97, 303 <br />D. P. W. Ellis. (2009) Robust Landmark-Based Audio Fingerprinting. http://labrosa.ee.columbia.edu/matlab/fingerprint/ <br />