MediaEval 2012 Spoken Web Search Task Results

MediaEval 2012
Spoken Web Search

Florian Metze, Marelie Davel, Etienne Barnard, Xavier
Anguera, Guillaume Gravier, and Nitendra Rajput
Pisa, October 4, 2012

Outline

 The Spoken Web Search Task

 Data and Scoring

 Organizers and Participants

 Results

 Discussion

Organizers

 Florian Metze (Carnegie Mellon)

 Etienne Barnard, Marelie Davel, Charl v. Heerden (North-West University)

 Xavier Anguera (Telefonica Research)

 Guillaume Gravier (IRISA)

 Nitendra Rajput (IBM India)

Real life audio content is very diverse!

“2011 Indian Data”

Spoken Web Search Task Motivation

 Any speech problem can be solved with enough:

 Money, Time, Constraints, Data

 What if we have just one constraints?

 Don’t know what language/ dialect is being used? Don’t have much data!

 But don’t have to do Large Vocabulary Speech Recognition,
“only” content retrieval

 What can be done?

 Port outside resources (ie. run language-independent/-portable recognizer)

 Build a “zero knowledge” approach (i.e. try to directly identify similar words)

Primary Data Source: “African Data”

 “Lwazi” Corpus  Data obtained during targeted effort
 Lwazi means Knowledge  Meant as resource for speech
 Lwazi project aims to develop research, so no “found” data,
telephony-based speech-driven as “Indian Data”
information system  E. Barnard, M. Davel, and C. van
 11 South-African languages, 3h-6h Heerden, "ASR Corpus design for
of speech per language resource-scarce languages," in Proc.
INTERSPEECH, Brighton, UK; Sep.
 Phone sets, dictionaries, read & 2009, pp. 2847-2850.
spontaneous, …
 3200 utterances used, from 4
languages

Evaluation Paradigm:
Spoken Term Detection (STD)

 Do not attempt to convert speech to text (full recognition, ASR)

 Attempt to detect the occurrence (or absence) of “keywords”

 STD is not easier than doing ASR

 It requires less resources: particularly not a strong language model

 Evaluation metrics:

 (Spoken) Document Retrieval (SDR), when relaxing time constraints

 Actual Term Weighted Value (ATWV, MTWV – defined by NIST)

Evaluation Idea – 4 Conditions

 Test development terms on (known) development data

 Test (unknown) evaluation terms on (unknown) evaluation data

 Test development terms on evaluation data

 Test evaluation terms on development data

 Terms provided as audio examples taken from collections

 Systems could be developed with or without using external resources (i.e.
other speech data, it is important to document, which ones were used –
“restricted” vs “open”)

NIST Scoring Tools

 Developed for 2006 Spoken Term Detection
 Generates “Actual” and “Maximum Term Weighted Value” (ATWV, MTWV)

 Generates DET curves

 Adapted by us
 ECF = “Experiment Control File” (controls which sections to process)

 RTTM = “Rich Transcription Time Mark” (defines references)

 TLIST =“Term List” Files (links term IDs and word dictionary)

 A few parameters to choose
 Different for 2011 and 2012, to better represent characteristics of SWS task
(thanks, Xavi)

 Best ATWV value is 1, below 0 possible

How to Interpret DET Plots

 Most useful plot

 If done right, will give you
Miss probability (in %)

10

20

40

60

80

90

95

98
.0001
5
 P(Miss) over P(FA) for all decision

Term Wtd. float-primary-test: CTS Subset Max Val=0.173 Scr=1.276
Term Wtd. float-primary-test : ALL Data Max Val=0.173 Scr=1.276
.001 .004.01.02 .05 .1 .2 .5 1 2
scores

False Alarm probability (in %)
 A “marker” at the actual decision

Combined DET Plot
 If computed using score, this will

Random Performance
5
be on the line
10
20

 Used for evaluation (with
40

score.occ.txt)

2012 Spoken Web Search Participants

Authors Title

Haipeng Wang and Tan Lee CUHK System for the Spoken Web Search task at
Mediaeval 2012
Cyril Joder, Felix Weninger, Martin Wöllmer The TUM Cumulative DTW Approach for the Mediaeval
and Björn Schuller 2012 Spoken Web Search Task
Andi Buzo, Horia Cucu, Mihai Safta, ARF @ MediaEval 2012: A Romanian ASR-based
Bogdan Ionescu, and Corneliu Burileanu Approach to Spoken Term Detection
Alberto Abad and Ramón F. Astudillo The L2F Spoken Web Search system for Mediaeval
2012
Jozef Vavrek, Matus Pleva and Jozef Juhar TUKE MediaEval 2012: Spoken Web Search using
DTW and Unsupervised SVM
Amparo Varona, Mikel Penagarikano, Luis GTTS System for the Spoken Web Search Task at
Javier Rodriguez-Fuentes, German Bordel, MediaEval 2012
and Mireia Diez
Igor Szoke, Michal Fapšo, and Karel BUT 2012 APPROACHES FOR SPOKEN WEB
Veselý SEARCH - MEDIAEVAL 2012
Aren Jansen, Benjamin Van Durme, and The JHU-HLTCOE Spoken Web Search System for
Pascal Clark MediaEval 2012
Xavier Anguera (TID) Telefonica Research System for the Spoken
Web Search task at Mediaeval 2012

Summary of (Primary) Results

Team Type Dev Eval

CUHK cuhk_phnrecgmmasm_p-fusionprf_1 open 0,7824 0,7430

CUHK cuhk_spch_p-gmmasmprf_1 restricted 0,6776 0,6350

L2F l2f_12_spch_p-phonetic4_fusion_mv_1 open 0,5313 0,5195

BUT BUT_spch_p-akws-devterms_1 open 0,4884 0,4918

BUT BUT_spch_g-DTW-devterms_1 open 0,4426 0,4477

JHU-HLTCOE jhu_all_spch_p-rails_1 restricted 0,3811 0,3688

TID sws2012_IRDTW restricted 0,3866 0,3301

TUM tum_spch_p-cdtw_1 restricted 0,2628 0,2895

ARF arf_spch_p-asrDTWAlign_w15_a08_b04 open 0,4109 0,2448

GTTS gtts_spch_p-phone_lattice_1 open 0,0978 0,0809

TUKE tuke_spch_p-dtwsvm restricted 0 0

Development data, development terms
98
Random Performance
ARF MTWV=0.471
95 ARF MTWV=0.491
ARF MTWV=0.253
90 ARF MTWV=0.487
Development Data, Development Terms

BUT MTWV=0.468

80 BUT MTWV=0.493
CUHK MTWV=0.735
CUHK MTWV=0.751
60 CUHK MTWV=0.787
CUHK MTWV=0.631
CUHK MTWV=0.680
40 JHU-HLTCOE MTWV=0.382
L2F MTWV=0.531
TUKE MTWV=0.000
20 TUM MTWV=0.354
TUM MTWV=0.337
TUM MTWV=0.270
10
TID MTWV=0.390
5 TID MTWV=0.375
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 GTTS MTWV=0.098
GTTS MTWV=0.105

Development data, evaluation terms
98
Random Performance
ARF MTWV=0.443 Scr=0.470
95 ARF MTWV=0.475
ARF MTWV=0.016
90 ARF MTWV=0.224
ARF MTWV=0.466

80 BUT MTWV=0.481
Development Data, Evaluation Terms

BUT MTWV=0.629
CUHK MTWV=0.769
60 CUHK MTWV=0.772
CUHK MTWV=0.805
CUHK MTWV=0.687
40 CUHK MTWV=0.686
JHU-HLTCOE MTWV=0.440
L2F MTWV=0.633
20 TUKE MTWV=0.000
TUKE MTWV=0.257
TUM MTWV=0.201
10
TUM MTWV=0.396
5 TID MTWV=0.498
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 TID MTWV=0.300
GTTS MTWV=0.083
GTTS MTWV=0.109

Evaluation data, development terms
98
Random Performance
ARF MTWV=0.317
95 ARF MTWV=0.339
ARF MTWV=0.000
90 ARF MTWV=0.167
ARF MTWV=0.333
Evaluation Data, Development Terms

80 BUT MTWV=0.383
BUT MTWV=0.429
CUHK MTWV=0.707
60 CUHK MTWV=0.715
CUHK MTWV=0.752
CUHK MTWV=0.561
40 CUHK MTWV=0.620
L2F MTWV=0.486
20 TUKE MTWV=0.000
TUM MTWV=0.236
TUM MTWV=0.291
10
TUM MTWV=0.174
5 TID MTWV=0.314
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 TID MTWV=0.472
GTTS MTWV=0.070
GTTS MTWV=0.081

Evaluation data, evaluation terms
98
Random Performance
ARF MTWV=0.268
95 ARF MTWV=0.310
ARF MTWV=0.001
90 ARF MTWV=0.120
ARF MTWV=0.306

80 BUT MTWV=0.488
BUT MTWV=0.530
Evaluation Data, Evaluation Terms

CUHK MTWV=0.724
60 CUHK MTWV=0.742
CUHK MTWV=0.762
CUHK MTWV=0.589
40 CUHK MTWV=0.643
L2F MTWV=0.523
20 TUKE MTWV=0.000
TUM MTWV=0.187
TUM MTWV=0.164
10
TUM MTWV=0.296
5 TID MTWV=0.342
.0001 .001.004.01.02 .05 .1 .2 .5 1 2 5 10 20 40 TID MTWV=0.311
GTTS MTWV=0.070
GTTS MTWV=0.081

Spoken Web Search Task
Summary 1

 Second time around
 Last year’s participants (mostly) became organizers

 Grew from 5 to ca. 10 participants!!!

 Europe, America, Asia, Africa (where’s Australia and Antarctica?)

 Interesting differences in performance
 Thank you all participants! It was fun & interesting.

 Evaluation criteria useful, correct?

Spoken Web Search Task
Summary 2

 Could talk a bit about JHU-HLTCOE’s “RAILS” system

 Next steps?
 Do more joint analysis (hope everybody’s results agree with ours?)

 Shared Publications? ICASSP? Journal?

 Develop task further for next year?

 “Speech Kitchen” idea will be presented later …

How to Interpret *.occ.txt File

 Coefficients C, V  Values used for padding and
multi-term detections are missing
 Weighting of correct vs incorrect
detections  In some rare cases lists different
 Probability of a Term values for total and only sub-class

 Expectation of terms  Was expecting more questions

 Average and Maximum TWV

 P(FA) and P(Miss)

 Optimal decision score

Parameters used

 The tools assume you use a  Used different parameters for
“decision score” African and Indian data sets to
reflect different use cases
 Submit “candidates” with score
lower than cutoff  KoefV/ KoefC are debatable
 Submit “detections” with score
 What’s the cost of wrong and the
higher than cutoff
benefit of correct detections
 Enables plotting of DET curves
 -P Probability-of-Term
 Can be confusing  How frequent are terms expected
to be?

How to Interpret score.det.thresh.pdf

 Can be used to analyze decision

MaxValue 0.173 @ 1.276
score behavior

P(Miss)
P(FA)
Value
0.2

0.4

0.6

0.8
0

1
 P(FA) False Alarms

1
 P(Miss) Missed Detections

Term Wtd. Threshold Plot for float-primary-test : ALL Data
2
3
 Resulting TWV

Decision Score
4
5
6
7

Dev-Dev MTWV-ATWV differences
0.1

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

Eval-Eval MTWV-ATWV differences
0.1

0.09

0.08

0.07

0.06

0.05

0.04

0.03

0.02

0.01

0

Dev-Eval MTWV-ATWV differences
0.25

0.2

0.15

0.1

0.05

0

Eval-Dev MTWV-ATWV differences
0.25

0.2

0.15

0.1

0.05

0

MediaEval 2012 Spoken Web Search Task Results

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (17)

Similar to MediaEval 2012 Spoken Web Search Task Results

Similar to MediaEval 2012 Spoken Web Search Task Results (6)

More from MediaEval2012

More from MediaEval2012 (20)

Recently uploaded

Recently uploaded (20)

MediaEval 2012 Spoken Web Search Task Results