5. Spoken Web Search Task Motivation
Any speech problem can be solved with enough:
Money, Time, Constraints, Data
What if we have just one constraints?
Don’t know what language/ dialect is being used? Don’t have much data!
But don’t have to do Large Vocabulary Speech Recognition,
“only” content retrieval
What can be done?
Port outside resources (ie. run language-independent/-portable recognizer)
Build a “zero knowledge” approach (i.e. try to directly identify similar words)
6. Primary Data Source: “African Data”
“Lwazi” Corpus Data obtained during targeted effort
Lwazi means Knowledge Meant as resource for speech
Lwazi project aims to develop research, so no “found” data,
telephony-based speech-driven as “Indian Data”
information system E. Barnard, M. Davel, and C. van
11 South-African languages, 3h-6h Heerden, "ASR Corpus design for
of speech per language resource-scarce languages," in Proc.
INTERSPEECH, Brighton, UK; Sep.
Phone sets, dictionaries, read & 2009, pp. 2847-2850.
spontaneous, …
3200 utterances used, from 4
languages
7. Evaluation Paradigm:
Spoken Term Detection (STD)
Do not attempt to convert speech to text (full recognition, ASR)
Attempt to detect the occurrence (or absence) of “keywords”
STD is not easier than doing ASR
It requires less resources: particularly not a strong language model
Evaluation metrics:
(Spoken) Document Retrieval (SDR), when relaxing time constraints
Actual Term Weighted Value (ATWV, MTWV – defined by NIST)
8. Evaluation Idea – 4 Conditions
Test development terms on (known) development data
Test (unknown) evaluation terms on (unknown) evaluation data
Test development terms on evaluation data
Test evaluation terms on development data
Terms provided as audio examples taken from collections
Systems could be developed with or without using external resources (i.e.
other speech data, it is important to document, which ones were used –
“restricted” vs “open”)
9. NIST Scoring Tools
Developed for 2006 Spoken Term Detection
Generates “Actual” and “Maximum Term Weighted Value” (ATWV, MTWV)
Generates DET curves
Adapted by us
ECF = “Experiment Control File” (controls which sections to process)
RTTM = “Rich Transcription Time Mark” (defines references)
TLIST =“Term List” Files (links term IDs and word dictionary)
A few parameters to choose
Different for 2011 and 2012, to better represent characteristics of SWS task
(thanks, Xavi)
Best ATWV value is 1, below 0 possible
10. How to Interpret DET Plots
Most useful plot
If done right, will give you
Miss probability (in %)
10
20
40
60
80
90
95
98
.0001
5
P(Miss) over P(FA) for all decision
Term Wtd. float-primary-test: CTS Subset Max Val=0.173 Scr=1.276
Term Wtd. float-primary-test : ALL Data Max Val=0.173 Scr=1.276
.001 .004.01.02 .05 .1 .2 .5 1 2
scores
False Alarm probability (in %)
A “marker” at the actual decision
Combined DET Plot
If computed using score, this will
Random Performance
5
be on the line
10
20
Used for evaluation (with
40
score.occ.txt)
11. 2012 Spoken Web Search Participants
Authors Title
Haipeng Wang and Tan Lee CUHK System for the Spoken Web Search task at
Mediaeval 2012
Cyril Joder, Felix Weninger, Martin Wöllmer The TUM Cumulative DTW Approach for the Mediaeval
and Björn Schuller 2012 Spoken Web Search Task
Andi Buzo, Horia Cucu, Mihai Safta, ARF @ MediaEval 2012: A Romanian ASR-based
Bogdan Ionescu, and Corneliu Burileanu Approach to Spoken Term Detection
Alberto Abad and Ramón F. Astudillo The L2F Spoken Web Search system for Mediaeval
2012
Jozef Vavrek, Matus Pleva and Jozef Juhar TUKE MediaEval 2012: Spoken Web Search using
DTW and Unsupervised SVM
Amparo Varona, Mikel Penagarikano, Luis GTTS System for the Spoken Web Search Task at
Javier Rodriguez-Fuentes, German Bordel, MediaEval 2012
and Mireia Diez
Igor Szoke, Michal Fapšo, and Karel BUT 2012 APPROACHES FOR SPOKEN WEB
Veselý SEARCH - MEDIAEVAL 2012
Aren Jansen, Benjamin Van Durme, and The JHU-HLTCOE Spoken Web Search System for
Pascal Clark MediaEval 2012
Xavier Anguera (TID) Telefonica Research System for the Spoken
Web Search task at Mediaeval 2012
12. Summary of (Primary) Results
Team Type Dev Eval
CUHK cuhk_phnrecgmmasm_p-fusionprf_1 open 0,7824 0,7430
CUHK cuhk_spch_p-gmmasmprf_1 restricted 0,6776 0,6350
L2F l2f_12_spch_p-phonetic4_fusion_mv_1 open 0,5313 0,5195
BUT BUT_spch_p-akws-devterms_1 open 0,4884 0,4918
BUT BUT_spch_g-DTW-devterms_1 open 0,4426 0,4477
JHU-HLTCOE jhu_all_spch_p-rails_1 restricted 0,3811 0,3688
TID sws2012_IRDTW restricted 0,3866 0,3301
TUM tum_spch_p-cdtw_1 restricted 0,2628 0,2895
ARF arf_spch_p-asrDTWAlign_w15_a08_b04 open 0,4109 0,2448
GTTS gtts_spch_p-phone_lattice_1 open 0,0978 0,0809
TUKE tuke_spch_p-dtwsvm restricted 0 0
13. Development data, development terms
98
Random Performance
ARF MTWV=0.471
95 ARF MTWV=0.491
ARF MTWV=0.253
90 ARF MTWV=0.487
Development Data, Development Terms
BUT MTWV=0.468
Miss probability (in %)
80 BUT MTWV=0.493
CUHK MTWV=0.735
CUHK MTWV=0.751
60 CUHK MTWV=0.787
CUHK MTWV=0.631
CUHK MTWV=0.680
40 JHU-HLTCOE MTWV=0.382
L2F MTWV=0.531
TUKE MTWV=0.000
20 TUM MTWV=0.354
TUM MTWV=0.337
TUM MTWV=0.270
10
TID MTWV=0.390
5 TID MTWV=0.375
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 GTTS MTWV=0.098
GTTS MTWV=0.105
False Alarm probability (in %)
14. Development data, evaluation terms
98
Random Performance
ARF MTWV=0.443 Scr=0.470
95 ARF MTWV=0.475
ARF MTWV=0.016
90 ARF MTWV=0.224
ARF MTWV=0.466
Miss probability (in %)
80 BUT MTWV=0.481
Development Data, Evaluation Terms
BUT MTWV=0.629
CUHK MTWV=0.769
60 CUHK MTWV=0.772
CUHK MTWV=0.805
CUHK MTWV=0.687
40 CUHK MTWV=0.686
JHU-HLTCOE MTWV=0.440
L2F MTWV=0.633
20 TUKE MTWV=0.000
TUKE MTWV=0.257
TUM MTWV=0.201
10
TUM MTWV=0.396
5 TID MTWV=0.498
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 TID MTWV=0.300
GTTS MTWV=0.083
False Alarm probability (in %)
GTTS MTWV=0.109
15. Evaluation data, development terms
98
Random Performance
ARF MTWV=0.317
95 ARF MTWV=0.339
ARF MTWV=0.000
90 ARF MTWV=0.167
ARF MTWV=0.333
Miss probability (in %)
Evaluation Data, Development Terms
80 BUT MTWV=0.383
BUT MTWV=0.429
CUHK MTWV=0.707
60 CUHK MTWV=0.715
CUHK MTWV=0.752
CUHK MTWV=0.561
40 CUHK MTWV=0.620
JHU-HLTCOE MTWV=0.336
L2F MTWV=0.486
20 TUKE MTWV=0.000
TUM MTWV=0.236
TUM MTWV=0.291
10
TUM MTWV=0.174
5 TID MTWV=0.314
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 TID MTWV=0.472
GTTS MTWV=0.070
False Alarm probability (in %)
GTTS MTWV=0.081
16. Evaluation data, evaluation terms
98
Random Performance
ARF MTWV=0.268
95 ARF MTWV=0.310
ARF MTWV=0.001
90 ARF MTWV=0.120
ARF MTWV=0.306
Miss probability (in %)
80 BUT MTWV=0.488
BUT MTWV=0.530
Evaluation Data, Evaluation Terms
CUHK MTWV=0.724
60 CUHK MTWV=0.742
CUHK MTWV=0.762
CUHK MTWV=0.589
40 CUHK MTWV=0.643
JHU-HLTCOE MTWV=0.384
L2F MTWV=0.523
20 TUKE MTWV=0.000
TUM MTWV=0.187
TUM MTWV=0.164
10
TUM MTWV=0.296
5 TID MTWV=0.342
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 TID MTWV=0.311
GTTS MTWV=0.070
False Alarm probability (in %)
GTTS MTWV=0.081
17. Spoken Web Search Task
Summary 1
Second time around
Last year’s participants (mostly) became organizers
Grew from 5 to ca. 10 participants!!!
Europe, America, Asia, Africa (where’s Australia and Antarctica?)
Interesting differences in performance
Thank you all participants! It was fun & interesting.
Evaluation criteria useful, correct?
18. Spoken Web Search Task
Summary 2
Could talk a bit about JHU-HLTCOE’s “RAILS” system
Next steps?
Do more joint analysis (hope everybody’s results agree with ours?)
Shared Publications? ICASSP? Journal?
Develop task further for next year?
“Speech Kitchen” idea will be presented later …
20. How to Interpret *.occ.txt File
Coefficients C, V Values used for padding and
multi-term detections are missing
Weighting of correct vs incorrect
detections In some rare cases lists different
Probability of a Term values for total and only sub-class
Expectation of terms Was expecting more questions
Average and Maximum TWV
P(FA) and P(Miss)
Optimal decision score
21. Parameters used
The tools assume you use a Used different parameters for
“decision score” African and Indian data sets to
reflect different use cases
Submit “candidates” with score
lower than cutoff KoefV/ KoefC are debatable
Submit “detections” with score
What’s the cost of wrong and the
higher than cutoff
benefit of correct detections
Enables plotting of DET curves
-P Probability-of-Term
Can be confusing How frequent are terms expected
to be?
22. How to Interpret score.det.thresh.pdf
Can be used to analyze decision
MaxValue 0.173 @ 1.276
score behavior
P(Miss)
P(FA)
Value
0.2
0.4
0.6
0.8
0
1
P(FA) False Alarms
1
P(Miss) Missed Detections
Term Wtd. Threshold Plot for float-primary-test : ALL Data
2
3
Resulting TWV
Decision Score
4
5
6
7