Query By Humming - Music Retrieval Technique


Published on

1 Like
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Query By Humming - Music Retrieval Technique

  1. 1. Veermata Jijabai Technological Institute 132011005 QUERY BY HUMMING... Seminar Report Shital Katkar
  2. 2. 2 SEMINARS OF SEMISTER – II [ YEAR 2013-2014 ] NAME: SHITAL KATKAR TOPIC : Query By Humming SIGNATURE:________________
  3. 3. 3 INDEX 1 Introduction 1.1 Query By Humming 2 Basic Architecture 2.1 Extraction 2.2 Transcription 2.3 Comparison 3 Applications 3.1 Shazam 3.2 Sound-Hound 3.3 Midomi 3.4 Musipedia 4 The art of Singing 4.1 Challenges 5 File Formats 5.1 Wav File format 5.2 MIDI File format 6 System Architecture 6.1 Wav to MIDI conversion 7 Parson Code algorithm 7.1 Rules 7.2 Advantages 8 Benchmarking MIR System 8.1 Online MIR System 8.1.1 CatFind 8.1.2 MelDex 8.1.3 MelodyHound 8.1.4 ThemeFinder 8.1.5 Music Retrieval Demo
  4. 4. 4 8.2 Comparison of MIR System 8.3 Evaluation Issues 8.4 Subjective and objective testing 9 Conclusion
  5. 5. 5 1. INTRODUCTION Many people often remember as short tidbit of a song but fail to recall the song's name. If you can remember lyrics that correspond to the song you are trying to recall, finding the song is as easy as performing a text query on a web search engine. A query by humming system allows a user to find a song even if he merely knows the tune from part of the melody. • “I don’t know the name. I don’t know who does it. • But I can’t get this song out of my head.” • Well, why not just hum it. Query by humming System It is a music retrieval technology in which users can hum or sing a melody to retrieve the song. The user simply sings or hums the tune into a computer microphone, and the system searches through a database of song for melodies containing the tune and returns a ranked list of search results. Thus user can then find the desired song by listening to the results.
  6. 6. 6 A Query by Humming (QBH) system enables a user to hum a melody into a microphone connected to a computer in order to retrieve a list of possible song titles that match the query melody. The system analyzes the melodic and rhythmic information of the input signal. The extracted data set is used as a database query. The result is presented as a list of e.g. ten best matching results. Generally, a QBH system is a Music Information Retrieval (MIR) system. A MIR systems provides several means for music retrieval, which can be hummed audio signal, but also music genre classification or text information about the artist or title.
  7. 7. 7 2. BASIC ARCHITECTURE Fig- Basic System Architecture The basic architecture of the system is depicted in above figure. A microphone takes the hummed input and sends this as a PCM signal to extraction block. The extracted information results here which is given to the transcription part. The transcription block forms Melody Contour to be compared with all contours residing in the database. A result list is finally presented to the user. Extraction The extraction block is also referred as the acoustic front end. After recording the signal with a computer sound card the signal is band pass filtered to reduce environmental noise and distortion. In this system a sampling rate of 8000 Hz is used. The signal is band limited to 80 to 800 Hz, which is sufficient for sung input. This frequency range corresponds to a musical note range of D2–G5. Transcription The transcription block transcribes the extracted information into the representation that is needed for comparison. The main task is to segment the input stream into single notes. This can be done using parson code algorithm.
  8. 8. 8 Comparison The transcription result is used as database query. Several distance measures can be used to find a similar piece of music. The database contains a collection of already transcribed melodies formatted according to the MelodyContourType. The Result is finally presented to the user.
  9. 9. 9 3. APPLICATIONS These are some examples of QBH Systems. Shazam Shazam is a commercial mobile phone-based music identification service. The company was founded in 1999 by Chris Barton, Philip Inghelbrecht, Avery Wang and Dhiraj Mukherjee. Shazam uses a mobile phone's built-in microphone to gather a brief sample of music being played. An acoustic fingerprint is created based on the sample, and is compared against a central database for a match. If a match is found, information such as the artist, song title, and album are relayed back to the user. Shazam can identify prerecorded music being broadcast from any source, such as a radio, television, cinema or club, provided that the background noise level is not high enough to prevent an acoustic fingerprint being taken, and that the song is present in the software's database.
  10. 10. 10 SoundHound SoundHound (known as Midomi until December 2009) is a mobile device service that allows users to identify music by humming, singing or playing a recorded track. The service was launched by Melodis Corporation (now SoundHound Inc), under Chief Executive Keyvan Mohajer in 2007 and has received funding from Global Catalyst Partners, TransLink Capital and Walden Venture Capital. SoundHound is a music search engine available on the Apple App Store, Google Play, Windows Phone Store, and on June 5, 2013, was available on the BlackBerry 10 platform. It enables users to identify music by playing, singing or humming a piece. It is also possible to speak or type the name of the artist, composer, song and piece. Unlike competitor Shazam, SoundHound can recognise tracks from singing, humming, speaking, or typing, as well as from a recording. Sound matching is achieved through the company's 'Sound2Sound' technology, which can match even poorly-hummed performances to professional recordings.
  11. 11. 11 Midomi Midomi is the ultimate music search tool. Sing, hum, or whistle to instantly find your favorite music and connect with a community that shares your musical interests. At midomi you can create your own profile, sing your favorite songs and share them with your friends and get discovered by other midomi users. You can listen to and rate other users' musical performances, see their pictures, send them messages, buy original music, and more. midomi features an extensive digital music store with a growing collection of more than two million legal music tracks. You can listen to samples of original recordings, buy the full studio versions directly from midomi, and play them on your Windows computer or compatible music players.
  12. 12. 12 Musipedia Musipedia is a search engine for identifying pieces of music. This can be done by whistling a theme, playing it on a virtual piano keyboard, tapping the rhythm on the computer keyboard, or entering the Parsons code. Anybody can modify the collection of melodies and enter MIDI files, bitmaps with sheet music, lyrics or some text about the piece, or the melodic contours as Parsons Code. Musipedia's search engine works differently from that of search engines such as Shazam. The latter can identify short snippets of audio (a few seconds taken from a recording), even if it is transmitted over a phone connection. Shazam uses Audio Fingerprinting for that, a technique that makes it possible to identify recordings. Musipedia, on the other hand, can identify pieces of music that contain a given melody. Shazam finds exactly the recording that contains a given snippet, but no other recordings of the same piece.
  13. 13. 13 4. THE ART OF SINGING It is obvious that people have imperfect memories for melodies or may lack any formal singing practice. 1.People sing any part of the melody. A repetitive melodic passage in a song may represent the ’hook-line’ of a song that ’gets stuck in people’s head’. 2.People sing at the wrong key. People chose a random pitch to start their singing. Only for their most favorite songs, people are thought to have a latent ability of absolute pitch. 3. People sing at a reasonably correct global tempo. People knew or had a feeling, by previous hearings, what the correct tempo would be and were able to approach this tempo reasonably accurately. But still it is not possible to sing in correct tempo. 4.People sing too many or too few notes. Human memory is imperfect to recall all pitches in the right order. People sang just the line they remembered. They also added all kinds of ornaments (e.g., grace notes, filler notes, or thinner notes) to beautify their singing or to ease the muscular motor processes involved in singing. 5.People sing the wrong intervals or confuse some with others. People sang about 59% of the intervals correctly, though there were differences due to singing experience, song familiarity and recent song exposure. Interval confusion seems to be symmetric; interchanging an interval with another was found to be equally likely as the other way around. A large interval (thirds and larger) tends to be more easily interchanged for another. 6. People sing the contour reasonably accurately. People largely knew when to go up and when to go down in pitch when singing; they did that correctly in 80% of the times.
  14. 14. 14 7. People with singing experience sing better on some aspects than people without singing experience do. The non-experienced and experienced singers did not differ in singing the contour of a melody accurately. However, experienced singers reproduced proportionally more correct intervals and sang at a better timing. 8. People sing familiar melodies better than less familiar ones. Less familiar melodies were reproduced with fewer notes and had proportionally fewer correct intervals than familiar melodies. Also, both experienced and non-experienced singers improved their singing of intervals when they had heard the melody very recently.
  15. 15. 15 4.1 CHALLENGES Building such a system, however, presents some significantly greater challenges than creating a conventional text-based search engine. Unlike lyrical content, there exists no intuitively obvious way to represent and store melodic content in a database. The chosen representation must be indexable for efficient searching. Furthermore, several issues unique to query by humming systems pose significant challenges to creating an efficient and accurate music search system. 1. Users may not make perfect queries. Even if a user has a perfect memory of a particular tune, he may start at the wrong key, or he may hum a few notes off-pitch throughout the course of the tune. Sometimes he may even drop some notes entirely or add notes that did not exist in the original melody. Additionally, no user is expected to be able to perfectly hum at the same tempo as the songs stored in the database. Finally, since none of these errors are mutually exclusive, a humming query may contain any combination of these errors. 2. Accurately capturing pitches and notes from user hums is difficult, even if the user manages to submit a perfect query. Currently existing software for converting raw audio data into discrete pitch information is mediocre at best and oftentimes will introduce a great deal of noise when extracting the pitches from a user’s hum. 3. Similarly, accurately capturing melodic information from a pre-recorded music file is difficult. Properly extracting the melody from a given song is a field of study on its own but is absolutely critical for an accurate query by would be of little use if the database contains inaccurate representations of the target songs.
  16. 16. 16 5.FILE FORMATS Wav File Format WAVE or WAV format is the short form of the Wave Audio File Format (rarely referred to as the audio for Windows). WAV format compatible with Windows, Macintosh or Linux. Despite the fact that the WAV file can hold compressed audio, the most common use is to store it is just an uncompressed audio in linear PCM (LPCM). The standard format of Audio- CD, for example, is the audio in LPCM, 2-channel, sampling frequency of 44,100 Hz and 16 bits per sample. As a format, derived from the Resource Interchange File Format (RIFF), WAV-files can have metadata (tags) in the chunk INFO. In addition, the WAV files can contain metadata standard Extensible Metadata Platform (XMP). Uncompressed WAV files are quite large in size, so, as file sharing over the Internet has become popular, the WAV format has declined in popularity. However, it is still a widely used, relatively "pure", i.e. lossless, file type, suitable for retaining "first generation" archived files of high quality, or use on a system where high fidelity sound is required and disk space is not restricted. MIDI File Format The term MIDI stands for Musical Instrument Digital Interface and is essentially a communications protocol for computers and electronic musical instruments. Although the produced MIDI files are not exactly the same as the typical digital audio formats we use (like MP3, AAC, WMA, etc.) to listen to music, MIDI files can still be thought of as digital music. Rather than an actual audio recording stored as binary data, a MIDI file in its simplest form is made up of information that describes what musical notes are to be played, along with the types of instruments that are to be used
  17. 17. 17 MIDI Files therefore do not contain any 'real world' recordings like voice (e.g. Audio books), live performances, etc., However, MIDI files are very small and can be played on a wide range of devices that support the MIDI protocol. Examples of hardware that can play MIDI files include: cell phones, smart phones, and even your computer using the right software. Examples of MIDI file format is Monophonic and polyphonic Ringtones. In QBH system it is chose to create our database of songs using songs in the midi file format. Because the midi representation already discretizes the notes, making it easier to extract the pitch and timing information necessary for our song matching. Alternate music file formats such as wav, mp3, aiff, etc. would require complicated waveform and signal processing that could lead to many inaccuracies. Each of our songs is also mapped to a set of metadata attributes such as song name and song artist for eventual display in the GUI result list.
  18. 18. 18 6. SYSTEM ARCHITECTURE The architecture is illustrated in above Figure. Operation of the system is straight-forward. Queries are hummed into a microphone, digitized, and fed into a pitch-tracking module. The result, a contour representation of the hummed melody, is fed into the query engine, which produces a ranked list of matching melodies. The database of melodies will be acquired by processing public domain MIDI songs, and is stored as a flat file database. Pitch tracking can be performed. Hummed queries may be recorded in a variety of formats. The query engine uses an approximate pattern matching algorithm, in order to tolerate humming errors. The melody database is essentially an indexed set of soundtracks. The acoustic query, which is typically a few notes hummed by the user, is processed to detect its melody line. The database is searched to find those songs that best match the query. While the overall task is one that is easily performed by humans, many challenging problems arise in the implementation of an automatic system. These include the signal processing needed for extracting the melody from the stored audio and from the acoustic query, and the pattern matching algorithms to achieve proper ranked retrieval. Further, a robust system must be able to account for inaccuracies in the user’s singing
  19. 19. 19 6.1 WAV TO MIDI CONVERSION To create a MIDI a file for a song recorded in WAV format a musician must determine pitch, velocity and duration of each note being played and record these parameters into a sequence of MIDI events. The Midi created represents the basic melody and chords of recognized music. The difference between WAV and MIDI formats consists in representation of sound and music. WAV format is digital recording of any sound (including speech) and MIDI format is principally sequence of notes (or MIDI events). Here we have an Output File (.mid) from an Input File (.wav) that contains musical data, and a Tone File (.wav) that consists of monotone data. An advantage of such a structure is also the fact that the query is prepared on the client side of the system. In this case the query is very short. Besides, there is a possibility to evaluate its quality before sending to the server. The system provides for playback of the recognized melody notes in MIDI format. This allows the user to listen to a query and take a decision either to send it to the server or to sing it once again.
  20. 20. 20 7. PARSON CODE ALGORITHM The Parsons code, formally named the Parsons Code for Melodic Contours, is a simple notation used to identify a piece of music through melodic motion—the motion of the pitch up and down. Denys Parsons developed this system for his 1975 book, The Directory of Tunes and Musical Themes. Representing a melody in this manner makes it easy to index or search for particular pieces. User input to the system (humming) is converted into a sequence of relative pitch transitions. A note in the input is classified in one of three ways 1. U = "up," if the note is higher than the previous note 2. D = "down," if the note is lower than the previous note 3. r = "repeat," if the note is the same pitch as the previous note 4. * = first tone as reference
  21. 21. 21 First note is C (72nd note). We will make it as reference note. And put the * Second note is also C, Since it is repeating, we will put R. Next is G. G note is upper than C so we will put U (U for upper) For second G , We put R. and so on. This textual pattern will store into database for comparison. Advantages 1. Pattern remains same, even if user hum the tune in different scale even if user hum some note off key. 2. Require less space since it is stored in textual file
  22. 22. 22 8. BENCHMARKING MUSIC INFORMATION RETRIEVAL SYSTEMS Research Paper Benchmarking Music Information Retrieval Systems Josh Reiss Department of Electronic Engineering Queen Mary, University of London Mile End Road, London E1 4NS UK +44-207-882-5528 josh.reiss@elec.qmul.ac.uk Department of Electronic Engineering Queen Mary, University of London Mile End Road, London E1 4NS UK +44-207-882-5528 josh.reiss@elec.qmul.ac.uk Mark Sandler Department of Electronic Engineering Queen Mary, University of London Mile End Road, London E1 4NS UK +44-207-882-7680 mark.sandler@elec.qmul.ac.uk -- Goal of this research paper is to create an accurate and effective benchmarking system for music information retrieval (MIR) systems. This will serve the multiple purposes of inspiring the MIR community to add additional features and increased speed into existing projects, and to measure the performance of their work and incorporate the ideas of other works. To date, there has been no systematic rigorous review of the field, and thus there is little knowledge of when an MIR implementation might fail in a real world setting. ONLINE MIR SYSTEMS For the purposes of this work, we considered five online MIR systems. The systems considered all have certain properties in common. They may all be used online via the World Wide Web. They all are used by entering a query concerning a piece of music, and all may return information about music that matches that query. However, these systems differ greatly in their features, goals and implementation. These differences are discussed in detail below. CatFind CatFind allows one to search MIDI files using either a musical transcription or a melodic profile based on the Parson’s Code. It has minimal features, and was intended primarily for demonstration. Although it seems unlikely that this system will be extended, it is still useful here as a system for comparison.
  23. 23. 23 MelDex This allows searching of the New Zealand Digital Library. The MELody inDEX system is designed to retrieve melodies from a database on the basis of a few notes sung into a microphone. It accepts acoustic input from the user, transcribes it into common music notation, then searches a database for tunes that contain the sung pattern, or patterns similar to it. Thus the query is audio although the retrieved files are in symbolic representation. Retrieval is ranked according to the closeness of the match. A variety of different mechanisms are provided to control the search, depending on the precision of the input. MelodyHound This melody recognition system was developed by Rainer Typke in 1997. It was originally known as "Tuneserver" and hosted by the university of Karlsruhe. It searches directly on the Parsons Code and was designed initially for Query By Whistling. That is, it will return the song in the database that most closely matches a whistled query. ThemeFinder Themefinder, created by David Huron, et. al., allows one to identify common themes in Western classical music, Folksongs, and latin Motets of the sixteenth century. Themefinder provides a web-based interface to the Humdrum thema command, which in turn allows searching of databases containing musical themes or incipits (opening note sequences). Themes and incipits available through Themefinder are first encoded in the kern music data format. Groups of incipits are assembled into databases. Currently there are three databases: Classical Instrumental Music, European Folksongs, and Latin Motets from the sixteenth century. Matched themes are displayed on-screen in graphical notation. Music Retrieval Demo The Music Retrieval Demo is notably different from the other MIR systems considered herein. The Music Retrieval Demo performs similarity searches on raw audio data (WAV files). No transcription of any kind is applied. It works by calculating the distance between the selected file and all other files in the database. The other files can then be displayed in a list ranked by their similarity, such that the more similar files are nearer the top. Distances
  24. 24. 24 are computed between templates, which are representations of the audio files, not the audio itself. The waveform is Hamming-windowed into overlapping segments; each segment is processed into a spectral representation of Mel- frequency cepstral coefficients. This is a data-reducing transformation that replaces each 20ms window with 12 cepstral coefficients plus an energy term, yielding a 13-valued vector. The next step is to quantize each vector using a specially- designed quantization tree. This recursively divides the vector space into bins, each of which corresponds to a leaf of the tree. Any MFCC vector will fall into one and only one bin. Given a segment of audio, the distribution of the vectors in the various bins characterize that audio. Counting how many vectors fall into each bin yields a histogram template that is used in the distance measure. For this demonstration, the distance between audio files is the simple Euclidean distance between their corresponding templates (or rather 1 minus the distance, so closer files have larger scores). Once scores have been computed for each audio clip, they are sorted by magnitude to produce a ranked list like other search engines. COMPARISON OF MIR SYSTEMS In Table 1, we present a comparison of the features of the various MIR systems under investigation. Note first that each of these systems was designed for a different purpose,
  25. 25. 25 and none of them can be considered a finished product. This table allows one to get an overview of the state of the MIR systems available., the features that one may wish to include in an MIR system, and the areas where improvement is most necessary. It also highlights the need for a standardized testbed. Each of the MIR systems use a different database of files for audio retrieval. Both CatFind and the Music Retrieval Demo have databases with less than 500 files. Thus, any benchmarking estimates, such as retrieval times and efficiency, are rendered useless. MelDex, MelodyHound and ThemeFinder have databases containing over 10,000 files. This should be sufficient for estimating search efficiency and salability. EVALUATION ISSUES Table 1 listed and compared the features available in existing online MIR systems. However, this is not sufficient for effective benchmarking and evaluation of possible music information retrieval systems that may appear in the near future and be used with large file collection. The question of what features to evaluate is determined by what we can measure that will reflect the ability of the system to satisfy the user. In a landmark paper, Cleverdon[21] listed six main measurable quantities. This has become known as the Cranfield model of information retrieval evaluation. Here, those properties are listed and modified as applicable for MIR. 1. The coverage of the collection, that is, the extent to which the system includes relevant matter. 2. The time lag, that is, the average interval between the time the search request is made and the time an answer is given. Consideration should also be made of worst case or close to worst case scenarios. It may be that certain genres or formats of music, as well as certain types of queries, e. g., query and retrieval of polyphonic transcription based audio may require far more time than other queries. Furthermore, if the testbed is particularly large, dispersed or unindexed, such as with peer-to-peer based internet, then bandwidth limitations and scalability may greatly reduce efficiency while maximizing the collection size.
  26. 26. 26 3. The form of presentation of the output. For MIR systems this not only means having the option of retrieving various formats, symbolic and audio, but it also implies identifying multiple performances of the same composition. 4. The effort involved on the part of the user in obtaining answers to his search requests. So far, MIR research has been dominated by audio engineers, computer scientists, musicologists and librarians. As the field expands to include developers and user interface experts this issue will acquire more significance. 5. The recall of the system, that is, the proportion of relevant material actually retrieved in answer to a search request; 6. The precision of the system, that is, the proportion of retrieved material that is actually relevant.
  27. 27. 27 9.CONCLUSION Music retrieval is becoming more natural, simple and user friendly with the advancement of QBH. Thus this technology will give broader application prospects for music retrieval. Using Parson code algorithm it become easy to implement Query Matching System. In this work, we have laid down a framework for benchmarking of future MIR systems. At the moment, this field is in its infancy. There are only a handful of MIR systems available online, each of which is quite limited in scope. Still, these benchmarking techniques were applied to five online systems. Proposals were made concerning future benchmarking of full online audio retrieval systems. It is hoped that these recommendations will be considered and expanded upon as such systems become available.
  28. 28. 28 10.REFERENCES Benchmarking Music Information Retrieval Systems Josh Reiss Department of Electronic Engineering Queen Mary, University of London Mile End Road, London E1 4NS UK +44-207-882-5528 josh.reiss@elec.qmul.ac.uk Mark Sandler Department of Electronic Engineering Queen Mary, University of London Mile End Road, London E1 4NS UK +44-207-882-7680 mark.sandler@elec.qmul.ac.uk A Query by Humming system using MPEG-7 Descriptors Jan-Mark Batke, Gunnar Eisenberg, Philipp Weishaupt, and Thomas Sikora Communication Systems Group, Technical University of Berlin Correspondence should be addressed to Jan-Mark Batke (batke@nue.tu-berlin.de) MusicDB: A Query by Humming System Edmond Lau, Annie Ding, Calvin On 6.830: Database Systems Final Project Report Massachusetts Institute of Technology {edmond, annie_d, calvinon}@mit.edu