1. Department of Electronics & Computers, IIT Roorkee A Bachelor Thesis Project Presentation(First Evaluation) on Audio FingerprintingFor Song Identification under the guidance ofDr. Padam Kumar Team – RishabhSoodB.Tech. CSE IV Yr. 070820 Santosh KumarB.Tech. CSE IV Yr. 070824 VikeshKhannaB.Tech. CSE IV Yr. 070829
3. Problem Statement To build a robust audio fingerprinting system which can be used to identify songs efficiently from a large database with limited computing resources and input.
4. Motivation There is an immense scope of robust audio fingerprinting applications in the industry. P2P Filtering Filtering copyright material from P2P networks Even if filenames and Metadata is tampered with Language Translation Identifying audio content In foreign languages, not possible by textual search Broadcast Monitoring Automating the royalties collection by monitoring broadcast channels Media Plugins Plugins for playlist generation and identifying similar tracks
5. Audio Fingerprint definition An audio fingerprint is essentially a hash function that maps an audio object of a large number of bits to a ‘fingerprint’ of only a limited number of bits. The audio object can be uniquely identified from this bit string. F 5 MB 100 KB
6. Audio Fingerprint v/s Cryptographic hash functions Mathematical Equivalence v/s Perceptual similarityAssume X and Y are two objects that are mapped into H(X) and H(Y) by a crypto. hash function H. Strictmathematical equality of H(X) and H(Y) implies an equality of X and Y with a very low probability of error. In case of audio, we are not interested in strict mathematical equivalence but perceptual similarity. Transitivity propertyIf two sound tracks X and Y are perceptually similar while Y and Z are perceptually similar to each other, it does NOT imply that X and Z are perceptually similar. Transitivity property essentially holds for all mathematical hash functions. Therefore, in stead of mathematical equivalence, we use threshold comparisons: |F(x) – F(y) | ≤ T implies X and Y are similar|F(x) – F(y) | > T implies X and Y are not similar
7. System Parameters Robustness Low false negative rate. Reliability Low false positive rate. Fingerprint Size How many bits per song? Granularity What is the minimum input size? Search Speed How fast is the search for a particular database size?
9. HTTP POST request CLIENT SERVER Protocol Layer Database (Search Algorithm) Fingerprint Metadata Fingerprint Layer XML generator Samples in unsigned char format Codec Layer XML Data Audio input XML Parser Album Artist Lyrics 9
10. Protocol Layer Fingerprint Layer Codec Layer An audio codec is a computer program that compresses/decompresses an audio file format for encryption or playback AAC MP3 WMA AAC
11. Protocol Layer WAV Fingerprint Layer Codec Layer AudioData i) Samples (unsigned char* samples)A buffer of the actual data samples (2 bytes or 16 bits per sample)ii) Byte Order (intbyteOrder) The byte order of the samples in. This can be CONST_LITTLE_ENDIAN or CONST_BIG_ENDIAN iii) Number of samples (long size) Number of samples read.iv) Sample rate (intsRate) The number of samples per second of audio (samples/sec)v) Stereo (bool stereo) Boolean value indicating whether the audio is stereo Vi) DurationDuration of the original audio regardless of the number of samples. Vii) FormatFormat of the original audio. This will be expressed as file extensions - .mp3, .wav etc.
12. Protocol Layer Fingerprint Layer The “RIFF” chunk descriptor. The format “WAVE” requires two subchunks “fmt” and “data” Codec Layer Field offset (bytes) Field size (bytes) Field name Endian 0 big 4 ChunkID “fmt” subchunk Describes the format of the data in the “data” subchunk 4 ChunkSize little 4 8 4 big Format Uncompressed PCM (WAV format) [4] 12 big 4 Subchunk1 ID 16 little 4 Subchunk1 Size 20 2 Audio Format little 22 little Num channels 4 “data” subchunk Indicates the ‘size’ of the sound Information and contains the raw sound data 24 little 4 Sample rate 28 Byte Rate little 4 32 Block Align little 2 34 BitsPerSample little 4 36 big 4 Subchunk2 ID 40 4 Subchunk2 Size little 44 Data little Subchunk2 size
13. Protocol Layer Fingerprint Layer Codec Layer Fingerprint layer carries out the core mathematical analysis of the audio, thereby converting a 5MB audio file into a 100KB fingerprint (bit string) WAV (5MB) fea690b1-b11dce98-a… (100KB)
14. Protocol Layer Fingerprint Layer Codec Layer Fingerprint extraction scheme [1] : FramingDivide the audio file into equally sized frames . Sub fingerprintingFor each frame, degradation invariant features are calculated. Well known audio features include Fourier Coefficients, Mel Frequency Cepstral Coefficients (MFCC), Spectral Flatness, Sharpness, Linear Predictive Coding (LPC). These features are mapped into a more compact representation by using classification algorithms like Hidden Markov Models (HMM) or Quantization. Generate a fingerprint blockOne sub fingerprint is not sufficient for identification of an audio clip. The basic unit that is sufficient to identify an audio clip is called a fingerprint block.
15. Protocol Layer 1 if E(n.m) – E(n,m+1) – (E(n-1,m) – E(n-1,m-1)) > 00 if E(n.m) – E(n,m+1) – (E(n-1,m) – E(n-1,m-1)) <= 0 F(n,m) = Fingerprint Layer Codec Layer E(n,m) = Energy of band m of frame n F(n,m) = m-th bit of the subfingerprint of frame n Framing Framing Band Division Energy Computation - F + ∑ x2 Feature T >0 F(n,0) + - - ∑ x2 T >0 F(n,1) + + - ABS - ∑ x2 T >0 F(n,30) + + - - T ∑ x2 F(n,31) >0 + +
16. Protocol Layer Fingerprint Layer Codec Layer The protocol layer accepts the fingerprint from the fingerprint layer and makes an HTTP POST request to the server for the relevant metadata. The protocol layer has two major modules – HTTP moduleThis module implements the POST request to the server with the fingerprint in the request message. XML Parser The returned metadata is in XML format. The protocol layer has the parser module to retrieve the required information like the artist, album, lyrics etc.
17. Protocol Layer Fingerprint Layer Codec Layer POST/path/script.cgi HTTP/1.0 From: vikesh@zeppelin.com User-Agent: HTTPTool/1.0 Content-Type: application/x-www-form-urlencoded Content-Length: 32 client_id=42&fingerprint=fea690b1b11dce98a… HTTP POST Database AlbumDark Side of the moon Song Comfortably Numb Artist Pink Floyd XML <xml version=“1.0” version=“UTF-8” ?> <metadata fp=“fea690b1b11dce98a…” id=“42”> <album>Dark Side of the moon</album> <song>Comfortably Numb</song> <artist>Pink Floyd</artist> </metadata> XML Parser
19. Database Architecture To understand the search algorithm, it is essential to understand the database architecture first.
20.
21. The list is stored as a binary large object via object serialization. It contains the following fields:i)songId ii)offset
22. Search algorithm A brute force matching approach takes O(n) time which is unacceptable for any commercial deployment having large databases. For example, consider a moderate fingerprint database of 10,000 songs with an average length of 5 minutes. Recall that every 11.6 ms of audio generates a sub-fingerprint => Number of sub-fingerprints = (5 x 10000 x 60) / (11.6 x 10-3 ) = 258 million Assuming a rate of 2 x 105 fingerprint comparisons per seconds [1] on a modern PC, an O(n) time algorithm takes about 20 minutes for execution on this database. Optimized Algorithm Assumption: At least one sub-fingerprint has an exact match in the correct song. The positions in the database where a specific 32-bit sub-fingerprint is located are retrieved using the database architecture shown already. The fingerprint database contains a lookup table (LUT) with all possible 32 bit sub-fingerprints as an entry. Every entry points to a list with pointers to the positions in the real fingerprint lists where the respective 32-bit sub-fingerprints are located. Assume the same 10,000 song database with each song of length approximately 5 minutes, leading to about 250 million sub-fingerprints. The average number of positions in the list will be, assuming all positions to be equally likely, : Average list size = 250,000,000 / 232 = 0.058
23. Search algorithm Average number of comparisons per identification = 0.058 x 256 = 15 Therefore, the average time for the algorithm = 15 x 20 / 106 = 30 ms Improvement over brute force = 20 x 60 / 30 x 10-3 = 40,000
27. References [1] JaapHaitsmaand Ton Kalker, “A highly robust audio fingerprinting system”, Philips Research , Eindhoven, The Netherlands, October 2001[2] Music IP corporation, Available HTTP: musicip.com[3] Neuschmied H., Mayer H. and Battle E., “Identification of Audio Titles on the Internet”, Proceedings of International October 2000. Conference on Web Delivering of Music 2001, Florence, Italy, November 2001 [4] Microsoft-IBM Wave file format, Available HTTP: ccrma.stanford.edu/courses/422/projects/WaveFormat/ [5] Haitsma J., Kalker T. and Oostveen J., “Robust Audio Hashing for Content Identification, Content BasedMultimedia Indexing 2001, Brescia, Italy, September 2001.