WHAT IS SOUND
Physical - sound as a disturbance in the air
Psychophysical - sound as perceived by the
Sound as stimulus (physical event) & sound
as a sensation.
Pressures changes (in band from 20 Hz to 20
ACOUSTICS is the study of sound.
416 February 2016
HOW DO WE HEAR
Ear connected to the brain
left brain: speech
right brain: music
Ear's sensitivity to frequency is logarithmic
Varying frequency response
Dynamic range is about 120 dB (at 3-4 kHz)
Frequency discrimination 2 Hz (at 1 kHz)
Intensity change of 1 dB can be detected.
16 February 2016 5
FUNDA - MENTALS
• Digital audio is sound reproduction using pulse-code modulation and digital
• Digital audio systems include analog-to-digital conversion (ADC), digital-to-
analog conversion (DAC), digital storage, processing and transmission
• A primary benefit of digital audio is in its convenience of storage,
transmission and retrieval
• Digital audio is useful in the recording, manipulation, mass-production, and
distribution of sound
• Modern distribution of music across the Internet via on-line stores depends
on digital recording and digital compression algorithms
16 February 2016 7
16 February 2016 12
1 song = 27.2 MB
1 GB Hard Drive
($899 in 1995)
• Audio data compression, as distinguished from dynamic range
compression, has the potential to reduce the transmission bandwidth
and storage requirements of audio data.
• Audio compression algorithms are implemented in software as audio
• Lossy audio compression algorithms provide higher compression at
the cost of fidelity and are used in numerous audio applications.
• Lossless audio compression produces a representation of digital data
that decompress to an exact digital duplicate of the original audio
16 February 2016 13
• Audio signal processing, sometimes referred to as audio processing,
is the intentional alteration of auditory signals, or sound, often
through an audio effect or effects unit.
• As audio signals may be electronically represented in either digital or
analog format, signal processing may occur in either domain.
• Analog processors operate directly on the electrical signal, while
digital processors operate mathematically on the digital
representation of that signal.
• Processing methods and application areas include storage, level
compression, data compression, transmission, etc.
16 February 2016 18
16 February 2016 23
An audio fingerprint is essentially a hash function that maps an audio object of a
large number of bits to a ‘fingerprint’ of only a limited number of bits. The audio
object can be uniquely identified from this bit string.
AUDIO FINGERPRINT DEFINITION
AUDIO FINGERPRINTING ARCHITECTURE
16 February 2016 24
16 February 2016 25
i) Samples (unsigned char* samples)
A buffer of the actual data samples (2 bytes or 16 bits per sample)
ii) Byte Order (int byteOrder)
The byte order of the samples in.This can be CONST_LITTLE_ENDIAN or CONST_BIG_ENDIAN
iii) Number of samples (long size)
Number of samples read.
iv) Sample rate (int sRate)
The number of samples per second of audio (samples/sec)
v) Stereo (bool stereo)
Boolean value indicating whether the audio is stereo
Duration of the original audio regardless of the number of samples.
Format of the original audio.This will be expressed as file extensions - .mp3, .wav etc.
16 February 2016 26
Fingerprint layer carries out the core mathematical analysis of the audio, thereby
converting a 5MB audio file into a 100KB fingerprint (bit string)
16 February 2016 27
POST /path/script.cgi HTTP/1.0
<xml version=“1.0” version=“UTF-8” ?>
<metadata fp=“fea690b1b11dce98a…” id=“42”>
<album>Dark Side of the moon</album>
Album Dark Side of the moon
Song Comfortably Numb
Artist Pink Floyd
HOW SHAZAM WORKS
1. Beforehand, Shazam fingerprints a comprehensive catalog of
music, and stores the fingerprints in a database.
2. A user “tags” a song they hear, which fingerprints a 10 second
sample of audio.
3. The Shazam app uploads the fingerprint to Shazam’s service, which
runs a search for a matching fingerprint in their database.
4. If a match is found, the song info is returned to the user, otherwise
an error is returned.
16 February 2016 30
• You can think of any piece of music as a time-
frequency graph called a spectrogram.
• On one axis is time, on another is frequency,
and on the 3rd is intensity.
• Each point on the graph represents the
intensity of a given frequency at a specific
point in time.
• Assuming time is on the x-axis and frequency
is on the y-axis, a horizontal line would
represent a continuous pure tone and a
vertical line would represent an instantaneous
burst of white noise.
16 February 2016 31
• The Shazam algorithm fingerprints a
song by generating this 3d graph, and
identifying frequencies of “peak
• For each of these peak points it keeps
track of the frequency and the amount
of time from the beginning
16 February 2016 32
. . . . . .
• Shazam builds their fingerprint
catalog out as a hash table,
where the key is the frequency.
• When Shazam receives a
fingerprint like the one above, it
uses the first key (in this case
823.44), and it searches for all
matching songs. of the track.
16 February 2016 33
Frequency in Hz
Time in seconds,
by Artist 1
34.678, “Song B”
by Artist 2
108.65, “Song C’
by Artist 3
. . . . . .
34.945, “Song B”
by Artist 2
• If a specific song is hit multiple times, it then checks
to see if these frequencies correspond in time.
• They create a 2d plot of frequency hits, on one axis
is the time from the beginning of the track those
frequencies appear in the song, on the other axis is
the time those frequencies appear in the sample.
• If there is a temporal relation between the sets of
points, then the points will align along a diagonal.
• They use another signal processing method to find
this line, and if it exists with some certainty, then
they label the song a match.
16 February 2016 34