MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 - poster

– The Query by Example Search on Speech Task (QUESST) involves searching for audio within audio content using audio queries.
– Queries may present small changes, filler content, reordering of words, and originate from spontaneous inquiries.
– They may also have significant background or intermittent noise and reverberation.
Our System: Spectral Subtraction to filter background noise; Fuses 6 special Dynamic Time Warping (DTW) paths obtained from the output of phonetic recognizers for 5 languages.
The Task
The SPL-IT-UC Query by Example Search on Speech
system for MediaEval 2015
Jorge Proença, Luis Castela, Fernando Perdigão
Instituto de Telecomunicações, Pole of Coimbra, Portugal
University of Coimbra – DEEC-FCTUC, Coimbra, Portugal
{jproenca, fp}@co.it.pt
Main contributions:
– Performing a careful Spectral Subtraction – to diminish severe background
noise which greatly influences the output of phonetic recognizers;
– Using the average distance matrix of all languages as 6th sub-system;
– Considering 6 possible DTW paths to tackle complex match cases;
– Truncating large distances per-query – may help to lower the burden of
critical false negatives.
– Besides side-info, all of the improvements also improve the ATWV metric.
Conclusions
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
– We used the long temporal context neural network system from Brno
University of Technology (BUT).
– 5 sub-systems/languages (for 8 kHz):
Czech
Hungarian
Russian
Portuguese (trained)
English (trained)
– Output: state level posteriorgrams
(3 states per phoneme).
– Silence/Noise frames removed on queries.
2. Phonetic Recognizer
– Per language Local Distance matrix:
– Dot Product of Query and Audio posterior probability vectors;
– Back-off with l =10-4
6 sub-systems for DTW:
– 5 distance matrices from the 5 languages
– a 6th one, the average of the 5 distance matrices – ML
(Improvement: 5langs fusion - 0.7971 Cnxe, ML - 0.8136, 5langs+ML - 0.7873)
Basic DTW strategy (A1):
– Smallest distance in identically weighted unitary jumps.
– Output average distance of the final path.
3. Dynamic Time Warping (DTW)
Spectral Subtraction (SS) to counter constant background noise.
1. High pass filter for low-frequency artefacts.
2. Analyze averaged Energy of the signal and determine high and low levels
through median of quartiles:
3. High SNR signals: no SS applied due to distortions.
Others: get >100ms candidate segments for "noise“.
4. Subtract the average noise spectrum with classical SS.
(Improvement: from 0.8368 Cnxe → 0.8130 with SS)
1. Noise Filtering
Czech Posteriorgram example for one query
Linear Fusion (with Bosaris Toolkit), calibrating for Cnxe.
– 6 sub-systems x 6 paths = 36 distance vectors of audio-query pairs.
1. Per query distribution: Truncate large distances to the mean of the
distribution.
(Improvement: from 0.7939 -> 0.7873 Cnxe)
2. Normalize per-query: subtract mean, divide by standard deviation.
3. Side-info: 7 additional vectors for fusion:
– mean of distances per query before truncation and normalization (from the best
approach and sub-system: ML-A2);
– Query size in frames and log of query size;
– 4 SNR values: original and post SS SNRs of query and of audio.
4 systems submitted:
1. Linear Fusion of all approaches and sub-systems + side-info
2. Harmonic Mean of approaches and Linear Fusion of sub-systems + side-info
3. Same as 1, without side-info
4. Same as 2, without side-info
5. Fusion and Calibration
– Side-info always helpful for the Cnxe metric.
– Fusion of All best on Dev set.
– Harmonic mean: best on Eval (fusion of all may be over fitted for Dev).
– Best Dev A1: 0.8041, A2: 0.7978, A3: 0.8335, A4: 0.8137, A5: 0.8184, A6: 0.8460
(A2) overall best, may help in all cases due to co-articulation or intonation.
(A6) performs badly. Filler in query may be extension and not gap.
– Best Eval T1: 0.7107, T2: 0.8147, T3: 0.8115
6. Results
QueryQuery
Audio
Query vs. Audio posterior distance matrix (top) and the best path from A5 (bottom)
– Indexing Speed Factor – 2.14
– Searching Speed Factor – 0.0034 per sec
– Peak Memory – 120MB
Processing Speed
Query
Audio
1 1
1
𝑞′ = 1 − 𝜆 𝑞 + 𝜆𝑢
Fusion Systems Dev: Cnxe, MinCnxe Eval: Cnxe, MinCnxe
1. All + side-info 0.7782, 0.7716 0.7866, 0.7809
2. H.mean + side-info 0.7862, 0.7800 0.7842, 0.7786
3. All, no side 0.7873, 0.7816 0.7930, 0.7875
4. H.mean, no side 0.7957, 0.7893 0.7914, 0.7865
– 5 additional approaches:
(A2) – Cutting up to 250ms at the end of the query, keeping the total above
500ms.
(A3) – Cutting up to 250ms at the beginning of the query, keeping the total
above 500ms.
(A4) – Allowing one 'jump' along the audio up to ½ query’s length, that
– Can’t occur at initial and final 250ms of the query
– Can’t occur for queries shorter than 800ms
(A5) – Accounting for re-ordering of words.
– Find the best path for the beginning of the query, ahead of the end of the
first one, with restrictions similar to (A4).
(A6) – Allowing one 'jump' along the query, of maximum ⅓ of query length.
4. DTW Modifications
𝐷 𝑞, 𝑥 = − log 𝑞. 𝑥
0 200 400 600 800 1000 1200
-80
-70
-60
-50
-40
-30
-20
-10
0
Query frames (5ms)
AverageEnergy(dB)

MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 - poster

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 - poster

Similar to MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 - poster (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 - poster