This document describes the system built by the SPL-IT-UC team from the Signal Processing Lab of Instituto de Telecomunicações (pole of Coimbra) and University of Coimbra for the Query by Example Search on Speech Task (QUESST) of MediaEval 2015. The submitted system filters considerable background noise by applying spectral subtraction, uses five phonetic recognizers from which posterior probabilities are extracted as features, implements novel modifications of Dynamic Time Warping (DTW) that focus on complex queries, and uses linear calibration and fusion to optimize results. This year’s task proved extra challenging in terms of acoustic conditions and match cases, though we observe the best results when merging all complex approaches.
http://ceur-ws.org/Vol-1436/
http://www.multimediaeval.org
MediaEval 2015 - The SPL-IT-UC Query by Example Search on Speech system for MediaEval 2015 - poster
1. – The Query by Example Search on Speech Task (QUESST) involves searching for audio within audio content using audio queries.
– Queries may present small changes, filler content, reordering of words, and originate from spontaneous inquiries.
– They may also have significant background or intermittent noise and reverberation.
Our System: Spectral Subtraction to filter background noise; Fuses 6 special Dynamic Time Warping (DTW) paths obtained from the output of phonetic recognizers for 5 languages.
The Task
The SPL-IT-UC Query by Example Search on Speech
system for MediaEval 2015
Jorge Proença, Luis Castela, Fernando Perdigão
Instituto de Telecomunicações, Pole of Coimbra, Portugal
University of Coimbra – DEEC-FCTUC, Coimbra, Portugal
{jproenca, fp}@co.it.pt
Main contributions:
– Performing a careful Spectral Subtraction – to diminish severe background
noise which greatly influences the output of phonetic recognizers;
– Using the average distance matrix of all languages as 6th sub-system;
– Considering 6 possible DTW paths to tackle complex match cases;
– Truncating large distances per-query – may help to lower the burden of
critical false negatives.
– Besides side-info, all of the improvements also improve the ATWV metric.
Conclusions
MediaEval 2015 - QUESST
| September 14-15 2015, Wurzen, GERMANY
– We used the long temporal context neural network system from Brno
University of Technology (BUT).
– 5 sub-systems/languages (for 8 kHz):
Czech
Hungarian
Russian
Portuguese (trained)
English (trained)
– Output: state level posteriorgrams
(3 states per phoneme).
– Silence/Noise frames removed on queries.
2. Phonetic Recognizer
– Per language Local Distance matrix:
– Dot Product of Query and Audio posterior probability vectors;
– Back-off with l =10-4
6 sub-systems for DTW:
– 5 distance matrices from the 5 languages
– a 6th one, the average of the 5 distance matrices – ML
(Improvement: 5langs fusion - 0.7971 Cnxe, ML - 0.8136, 5langs+ML - 0.7873)
Basic DTW strategy (A1):
– Smallest distance in identically weighted unitary jumps.
– Output average distance of the final path.
3. Dynamic Time Warping (DTW)
Spectral Subtraction (SS) to counter constant background noise.
1. High pass filter for low-frequency artefacts.
2. Analyze averaged Energy of the signal and determine high and low levels
through median of quartiles:
3. High SNR signals: no SS applied due to distortions.
Others: get >100ms candidate segments for "noise“.
4. Subtract the average noise spectrum with classical SS.
(Improvement: from 0.8368 Cnxe → 0.8130 with SS)
1. Noise Filtering
Czech Posteriorgram example for one query
Linear Fusion (with Bosaris Toolkit), calibrating for Cnxe.
– 6 sub-systems x 6 paths = 36 distance vectors of audio-query pairs.
1. Per query distribution: Truncate large distances to the mean of the
distribution.
(Improvement: from 0.7939 -> 0.7873 Cnxe)
2. Normalize per-query: subtract mean, divide by standard deviation.
3. Side-info: 7 additional vectors for fusion:
– mean of distances per query before truncation and normalization (from the best
approach and sub-system: ML-A2);
– Query size in frames and log of query size;
– 4 SNR values: original and post SS SNRs of query and of audio.
4 systems submitted:
1. Linear Fusion of all approaches and sub-systems + side-info
2. Harmonic Mean of approaches and Linear Fusion of sub-systems + side-info
3. Same as 1, without side-info
4. Same as 2, without side-info
5. Fusion and Calibration
– Side-info always helpful for the Cnxe metric.
– Fusion of All best on Dev set.
– Harmonic mean: best on Eval (fusion of all may be over fitted for Dev).
– Best Dev A1: 0.8041, A2: 0.7978, A3: 0.8335, A4: 0.8137, A5: 0.8184, A6: 0.8460
(A2) overall best, may help in all cases due to co-articulation or intonation.
(A6) performs badly. Filler in query may be extension and not gap.
– Best Eval T1: 0.7107, T2: 0.8147, T3: 0.8115
6. Results
QueryQuery
Audio
Query vs. Audio posterior distance matrix (top) and the best path from A5 (bottom)
– Indexing Speed Factor – 2.14
– Searching Speed Factor – 0.0034 per sec
– Peak Memory – 120MB
Processing Speed
Query
Audio
1 1
1
𝑞′ = 1 − 𝜆 𝑞 + 𝜆𝑢
Fusion Systems Dev: Cnxe, MinCnxe Eval: Cnxe, MinCnxe
1. All + side-info 0.7782, 0.7716 0.7866, 0.7809
2. H.mean + side-info 0.7862, 0.7800 0.7842, 0.7786
3. All, no side 0.7873, 0.7816 0.7930, 0.7875
4. H.mean, no side 0.7957, 0.7893 0.7914, 0.7865
– 5 additional approaches:
(A2) – Cutting up to 250ms at the end of the query, keeping the total above
500ms.
(A3) – Cutting up to 250ms at the beginning of the query, keeping the total
above 500ms.
(A4) – Allowing one 'jump' along the audio up to ½ query’s length, that
– Can’t occur at initial and final 250ms of the query
– Can’t occur for queries shorter than 800ms
(A5) – Accounting for re-ordering of words.
– Find the best path for the beginning of the query, ahead of the end of the
first one, with restrictions similar to (A4).
(A6) – Allowing one 'jump' along the query, of maximum ⅓ of query length.
4. DTW Modifications
𝐷 𝑞, 𝑥 = − log 𝑞. 𝑥
0 200 400 600 800 1000 1200
-80
-70
-60
-50
-40
-30
-20
-10
0
Query frames (5ms)
AverageEnergy(dB)