New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

New Metrics for Meaningful Evaluation of
Informally Structured Speech Retrieval

Maria Eskevich1 , Walid Magdy2,3 , Gareth J.F. Jones1,2

1Centre for Digital Video Processing
2 Centre for Next Generation Localisation
School of Computing
Dublin City University, Dublin, Ireland
3 Qatar Computing Research Institute - Qatar Foundation
Doha, Qatar

April, 3, 2012

Speech Retrieval New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline
Speech Retrieval

Speech Search Evaluation
Mean Average Precision (MAP)
Mean Average interpolated Precision (MAiP)
mean Generalized Average Precision (mGAP)

New Metrics
Mean Average Segment Precision (MASP)
Mean Average Segment Distance-Weighted Precision
(MASDWP)

Retrieval Collection

Experimental Results

Conclusions


Speech Documents Diversity



Broadcast news:



Broadcast news:

Meetings:


Speech Retrieval

Speech
Collection


Speech Retrieval

Speech Queries
Collection (audio)

Queries
(text)


Speech Retrieval

Speech Queries
Collection (audio)

Automatic
Speech
Recognition
System

Queries
(text)


Speech Retrieval

Speech Queries
Collection (audio)

Automatic Automatic
Speech Speech
Recognition Recognition
System System

Queries
Transcript
(text)


Speech Retrieval

Speech Queries
Collection (audio)

Automatic Automatic
Speech Speech
System System

Queries
Transcript
(text)
Segmentation

Segments


Speech Retrieval

Speech Queries
Collection (audio)

Automatic Automatic
Speech Speech
System System

Queries
Transcript
(text)
Segmentation
Indexed
Segments
Indexing Segments


Speech Retrieval

Speech Queries
Collection (audio)

Automatic Automatic
Speech Speech
System System

Queries
Transcript
(text)
Information
Segmentation
Request
Indexed
Segments
Indexing Segments


Speech Retrieval

Speech Queries
Collection (audio)

Automatic Automatic
Speech Speech
System System

Queries Retrieval Results:
Transcript
(text) textual segments
Information
Segmentation
Request
Retrieval
Indexed
Segments
Indexing Segments


Speech Retrieval

Speech Queries Retrieval Results:
Collection (audio) speech segments

Automatic Automatic
Speech Speech
System System

Transcript
Information
Segmentation
Request
Retrieval
Indexed
Segments
Indexing Segments


Speech Retrieval

Speech Retrieval Results:
Collection speech segments

Automatic
Speech
Recognition
System
Transcript
Information
Segmentation
Request
Retrieval
Indexed
Segments
Indexing Segments

Speech Search Evaluation New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline
Speech Retrieval


New Metrics
(MASDWP)



Conclusions


Related Work in Speech Search Evaluation

Retrieval Units:



Retrieval Units:
Clearly deﬁned documents:
TREC SDR: Mean Average Precision (MAP)



Retrieval Units:
Passages:
INEX : Mean Average interpolated Precision (MAiP)



Retrieval Units:
Passages:
INEX : Mean Average interpolated Precision (MAiP)

Jump-in points:
CLEF CL-SR: Mean Generalized Average Precision
(mGAP)


Task: passage text retrieval.
Document relevance is not counted in a binary way.
Precision at rank r : fraction of retrieved number of characters
that are relevant:


that are relevant:

Average interpolated Precision (AiP): average of interpolated
precision scores calculated at 101 recall levels (0.00, 0.01, . . . ,
1.00):
1
AiP = . iP[x]
101
x=0.00,0.01,...,1.00


that are relevant:

Average interpolated Precision (AiP): average of interpolated
precision scores calculated at 101 recall levels (0.00, 0.01, . . . ,
1.00):
1
AiP = . iP[x]
101
x=0.00,0.01,...,1.00

Shortcomings: averaging over characters in transcript is
not suitable for speech tasks



Task: retrieval of the jump-in points in time for relevant content




N
1 Distance
GAP = . P[r ] · 1 − · 0.1
n Granularity
r =1




N
1 Distance
GAP = . P[r ] · 1 − · 0.1
n Granularity
r =1

Shortcomings: Does not take into account
how much time the user needs to spend listening
to access the relevant content

New Metrics New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline
Speech Retrieval


New Metrics
(MASDWP)



Conclusions


Time Precision Oriented Metrics

Motivation:

Create a metric that measures both the ranking quality and
the segmentation quality with respect to relevance in a
single score.

Reﬂect how far the user has to listen into the segment at a
certain rank until the relevant part actually begins.


Segment Precision (SP[r ]) at rank r :



Average Segment Precision:



N
1
ASP = . SP[r ] · rel(sr )
n
r =1

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0



N
1
ASP = . SP[r ] · rel(sr )
n
r =1

rel(sr ) = 1, if relevant content is present, otherwise rel(sr ) = 0
Difference from other metrics:
the amount of relevant content is measured over time
instead of text
average segment precision (ASP) is calculated at the
ranks of segments containing relevant content
rather than ﬁxed recall points as in MAiP


(MASDWP)

Penalize ASP results as mGAP

N
1 Distance
ASDWP = . SP[r ] · rel(sr ) · 1 − · 0.1
n Granularity
r =1


Comparative example of AP, ASP and ASDWP

Retrieved
Segments
1

2

3

4

5

6



Rel Len/
Total Len
2/3

0/5

3/4

6/6

0/2

5/10



Rel Len/
AP
Total Len
2/3 1

0/5 1/2

3/4 2/3

6/6 3/4

0/2 3/5

5/10 4/6



Rel Len/
AP
Total Len MAP
0.771
2/3 1

0/5 1/2

3/4 2/3

6/6 3/4

0/2 3/5

5/10 4/6



Rel Len/
AP ASP
Total Len MAP
0.771
2/3 1 2/3

0/5 1/2 2/8

3/4 2/3 5/12

6/6 3/4 11/18

0/2 3/5 11/20

5/10 4/6 16/30



Rel Len/
AP ASP
Total Len MAP
0.771
2/3 1 2/3

0/5 1/2 2/8
MASP
0.557
3/4 2/3 5/12

6/6 3/4 11/18

0/2 3/5 11/20

5/10 4/6 16/30



Rel Len/
AP ASP ASDWP
Total Len MAP
0.771
2/3 1 2/3 2/3 * 1.0

0/5 1/2 2/8 2/8 * 0.0
MASP
0.557
3/4 2/3 5/12 5/12 * 0.9

6/6 3/4 11/18 11/18 * 0.0

0/2 3/5 11/20 11/20 * 0.0

5/10 4/6 16/30 16/30 * 0.0



Rel Len/
AP ASP ASDWP
Total Len MAP
0.771
2/3 1 2/3 2/3 * 1.0

0/5 1/2 2/8 2/8 * 0.0
MASP
0.557
3/4 2/3 5/12 5/12 * 0.9

6/6 3/4 11/18 11/18 * 0.0
MASDWP
0.260
0/2 3/5 11/20 11/20 * 0.0

5/10 4/6 16/30 16/30 * 0.0

Retrieval Collection New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline
Speech Retrieval


New Metrics
(MASDWP)



Conclusions


Test Collection

Speech collection: AMI Corpus
Ca. 100 hours of data (80 hours of speech)
160 meetings:
average length – 30 minutes
Transcript
Manual
Automatic Speech Recognition (ASR), WER ≈ 30 %
Retrieval test set:
25 queries with text taken form PowerPoint slides provided
with the AMI Corpus (avr len > 10 content words)
Manual relevance assessment


Segmentation Methods and Retrieval Runs

Segmentation*:
Lexical cohesion based algorithms: TextTiling, C99
Time- and length-based algorithms:
time length = 60, 120, 150, 180 seconds;
number of words per segment = 300, 400
Extreme case: No segmentation
Retrieval system:
SMART extended to use language modeling

* Manual boundaries for both types of transcript

Experimental Results New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline
Speech Retrieval


New Metrics
(MASDWP)



Conclusions


Scores Results for 1000 retrieved documents
Run asr man
MAP MAiP MASP MASDWP
c99 0.438 0.275 0.218 0.177
tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181
len 400 0.463 0.286 0.237 0.147
time 120 0.428 0.296 0.256 0.196
time 150 0.448 0.283 0.243 0.171
time 180 0.473 0.300 0.246 0.163
time 60 0.333 0.259 0.238 0.220
one doc 0.686 0.109 0.085 0.009


Run asr man
c99 0.438 0.275 0.218 0.177
tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181
len 400 0.463 0.286 0.237 0.147
time 120 0.428 0.296 0.256 0.196
time 150 0.448 0.283 0.243 0.171
time 180 0.473 0.300 0.246 0.163
time 60 0.333 0.259 0.238 0.220
one doc 0.686 0.109 0.085 0.009

one doc run: only MAP highest score, all other metrics
has the lowest score


Run asr man
c99 0.438 0.275 0.218 0.177
tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181
len 400 0.463 0.286 0.237 0.147
time 120 0.428 0.296 0.256 0.196
time 150 0.448 0.283 0.243 0.171
time 180 0.473 0.300 0.246 0.163
time 60 0.333 0.259 0.238 0.220
one doc 0.686 0.109 0.085 0.009

has the lowest score − > contradict user experience


Run asr man
c99 0.438 0.275 0.218 0.177
tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181
len 400 0.463 0.286 0.237 0.147
time 120 0.428 0.296 0.256 0.196
time 150 0.448 0.283 0.243 0.171
time 180 0.473 0.300 0.246 0.163
time 60 0.333 0.259 0.238 0.220
one doc 0.686 0.109 0.085 0.009

time 60: the highest MASDWP rank


Run asr man
c99 0.438 0.275 0.218 0.177
tt 0.421 0.275 0.221 0.173
len 300 0.416 0.287 0.248 0.181
len 400 0.463 0.286 0.237 0.147
time 120 0.428 0.296 0.256 0.196
time 150 0.448 0.283 0.243 0.171
time 180 0.473 0.300 0.246 0.163
time 60 0.333 0.259 0.238 0.220
one doc 0.686 0.109 0.085 0.009

time 60: the highest MASDWP rank − > shorter average
length of the segments makes it easier to capture
the segment closer to the jump-in point


Capturing Difference Between Segmentations

Rank c99 time 180 time 60
3 179/179 60/60
4 243/243 179/179 59/59
5 180/180 60/60
6 105/125 59/59
7 157/204 179/179 59/59
8 107/107 59/179 60/60
9 350/429 162/180 60/60
10 122/122 143/181



3 179/179 60/60
4 243/243 179/179 59/59
5 180/180 60/60
6 105/125 59/59
7 157/204 179/179 59/59
8 107/107 59/179 60/60
9 350/429 162/180 60/60
10 122/122 143/181
AP: one doc > time 180 > c99 > time 60
AiP: c99 > time 180 > time 60 > one doc
ASP time 180 > c99 > time 60 > one doc



3 179/179 (–) 60/60 (–)
4 243/243 (–) 179/179 (–) 59/59 (1)
5 180/180 (-69) 60/60 (–)
6 105/125 (20) 59/59 (-10)
7 157/204 (47) 179/179 (0) 59/59 (–)
8 107/107 (-45) 59/179 60/60 (–)
9 350/429 (47) 162/180 (-4) 60/60 (21)
10 122/122 (-11) 143/181 (–)
AP: one doc > time 180 > c99 > time 60
AiP: c99 > time 180 > time 60 > one doc
ASP time 180 > c99 > time 60 > one doc
ASDWP c99 > time 180 > time 60 > one doc


Impact of Averaging Techniques



AiP: man<asr man; ASP: man>asr man




(relevant content moves down from higher ranks)

Conclusions New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval

Outline
Speech Retrieval


New Metrics
(MASDWP)



Conclusions


Conclusions

MAP and MAiP do not reﬂect the user experience of informally
structured speech documents:


Conclusions

MAP is appropriate for clearly deﬁned documents
MAiP works with transcript characters


Conclusions

MAP is appropriate for clearly deﬁned documents
MAiP works with transcript characters
Introduced MASP and MASDWP:

MASP: captures the amount of relevant content that
appears at different ranks

MASDWP: rewards runs where segmentation algorithms
put boundaries closer to the relevant content and these
segments are higher in the ranked list


Thank you for your attention!

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Recently uploaded

Recently uploaded (20)

New Metrics for Meaningful Evaluation of Informally Structured Speech Retrieval (ECIR 2012)