The L2F Spoken Web Search system for Mediaeval 2012
1. The
L 2F
Spoken
Web
Search
system
for
Mediaeval
2012
Alberto
Abad
and
Ramón
F.
Astudillo
L2F
-‐
Spoken
Language
Systems
Lab
INESC-‐ID
Lisboa,
Portugal
alberto@l2f.inesc-‐id.pt
Mediaeval
2012
Workshop
Pisa,
October
4,
2012
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 1
2. IntroducJon
In
this
first
L2F/INESC-‐ID
parJcipaJon
on
SWS
our
main
objecJves
were:
• To
learn
• To
have
fun
• To
build
a
reasonable
system
(in
an
unreasonable
reduced
amount
of
Jme)
The
submiVed
L2F
SWS
system
exploits
hybrid
ANN/HMM
connecJonist
methods
for
both
query
tokenizaJon
and
acousJc
keyword
search
• Composed
by
the
fusion
of
4
phoneJc-‐based
SWS
sub-‐systems
o Based
on
our
in-‐house
ASR
system
named
AUDIMUS
• Each
sub-‐system
uses
different
language-‐dependent
acousJc
models:
o European
Portuguese
(pt),
Brazilian
Portuguese
(br),
European
Spanish
(es)
and
American
English
(en)
• Different
detecJon
score
normalizaJon
and
fusion
methods
invesJgated
o SubmiVed
system
applies
per-‐query
score
normalizaJon
(Q-‐norm)
and
majority
voJng
(MV)
fusion
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 2
3. The
baseline
speech
recognizer
AUDIMUS
is
our
in-‐house
hybrid
HMM/MLP
speech
recognizer
• Feature
extrac@on
MulJ-‐stream
26
PLP,
26
logRASTA-‐PLP,
28
MSG
and
39
ETSI
(only
8kHz)
• MLP
Several
context
input
frames
(13-‐15),
2
hidden-‐layers
(500
units)
and
1
output
layer.
• Output
layer
size
39
for
pt,
40
for
br,
30
for
es,
41
for
en.
• Data
pt
115
hours
(57
BN+58
telephone);
br
13
hours
of
BN
data;
es
57hours
(36
BN
+21
telephone);
en
142
hours
of
BN
data
(HUB-‐4
96
&
97)
• HMM
topology
Single-‐state
phonemes
(3
frames
minimum
duraJon)
• Decoder
Uses
Weighted
Finite-‐State
Transducer
(WFST)
approach
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 3
4. Spoken
Query
TokenizaJon
For
each
sub-‐system,
obtain
a
phoneJc
tokenizaJon
of
the
queries
• Similar
to
LID
parallel
phonotacJc
approaches
Use
a
phone-‐loop
grammar
with
phoneme
minimum
duraJon
of
3
frames
to
obtain
1-‐best
phoneme
chain
• AlternaJve
n-‐best
hypothesis
for
charactering
each
query
were
explored
with
unsaJsfactory
results
o Other
possibiliJes
not
explored
(yet):
lakce,
CN,
etc…
• Influence
of
the
word
inserJon
penalty
(wip)
parameter
on
the
tokenizaJon
result
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 4
5. Spoken
Query
Search
Spoken
query
search
based
on
AKWS
with
our
hybrid
speech
recognizer.
Search
window
of
5
seconds
(2.5
seconds
Jme
shio)
• Convert
the
problem
in
a
verificaJon
task
(originally
for
with
forced
alignment)
• Convenient
for
fusion
purposes
Equally-‐likely
1-‐gram
LM
with
target
Query
and
Background
word:
• Query
word
o Described
by
the
phoneJc
units
obtained
in
the
previous
tokenizaJon
stage
• Background
word
o Described
by
the
special
phoneJc
class
background/filler
o Minimum
duraJon
set
to
250
msec.
DetecJon
score
of
detected
candidates
computed
as
the
average
phoneJc
log-‐likelihood
raJos
of
the
query
term
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 5
6. Spoken
Query
Search
BG
modelling
with
HMM/MLP
Possible
approaches
• Re-‐train
MLP
✖
• Compute
posterior
probability
of
the
background
class
depending
on
the
other
classes
✔
o Mean
probability
of
the
top-‐N
most
likely
outputs
For
the
SWS
system,
• We
compute
the
average
in
the
likelihood
domain
(top-‐6)
o The
decoder
operates
in
the
likelihood
domain,
so
there
is
not
need
for
and
add-‐hoc
esJmaJon
of
the
BG
class
prior
• We
use
a
background
scale
term
β
(exponenJal
in
the
likelihood
domain)
to
control
the
weight
of
the
BG
model
vs.
Query
• This
β
scale
together
with
the
wip
term
strongly
affects
searching
results
o Adjusted
following
a
non-‐exhausJve
greedy
search
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 6
7. Score
normalizaJon,
fusion
and
calibraJon
Score
normaliza@on
schemes
explored:
• Q-‐norm
Assume
that
the
scores
are
dependent
of
the
queries
and
apply
a
by-‐query
normalizaJon
• F-‐norm
Assume
that
the
scores
are
dependent
of
the
data
file
(of
the
collecJon)
and
apply
a
by-‐file
normalizaJon
• CombinaJons
(QF-‐norm,
FQ-‐norm)
Fusion
schemes
explored:
• Candidate
detecJons
from
the
4
parallel
sub-‐systems
are
kept
(or
rejected)
according
to
simple
combinaJon
rules:
o AND
(all),
OR
(at
least
1
sys)
and
MV
(at
least
2
sys)
• Final
detecJon
score
is
the
mean
score
Decision
threshold
set
according
to
maxATWV
in
dev-‐dev
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 7
10. Development
experiments
Fusion
strategies
(aoer
Q-‐norm)
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 10
11. Official
L2F
SWS2012
results
Combined DET Plot Combined DET Plot Combined DET Plot
98 98 98
Random Performance Random Performance Random Performance
Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.486 Scr=-0.135 Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435 Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.523 Scr=-0.362
95 95 95
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.486 Scr=-0.135 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.523 Scr=-0.362
90 90 90
80 80 80
Miss probability (in %)
Miss probability (in %)
Miss probability (in %)
60 60 60
40 40 40
20 20 20
10 10 10
5 5 5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %) False Alarm probability (in %)
dev-‐eval
eval-‐dev
eval-‐eval
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 11
12. Conclusions
The
L2F
Spoken
Web
Search
system
fully
exploits
hybrid
ANN/HMM
speech
recogniJon:
• Query
tokenizaJon
based
on
1-‐best
phoneJc
decoding
• Query
search
based
on
AKWS
(with
no
need
for
AM
re-‐training)
The
submiVed
system
is
formed
by
the
fusion
of
four
language-‐
dependent
sub-‐systems:
• Q-‐norm
score
normalizaJon
is
applied
to
each
individual
sub-‐system
• Fusion
is
done
following
a
majority
voJng
strategy
The
system
achieves
an
actual
ATWV
score
of
0.5195
in
the
eval-‐eval
• Promising
given
the
simplicity
of
the
proposed
system
• Robust
to
query
and
collecJon
sets
o Best
performance
is
achieved
in
a
mismatched
condiJon!!
• Reasonably
well-‐calibrated
Future
work
Focused
in
improved
Query
tokenizaJon,
fusion
with
other
type
of
approaches
(DTW)
The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 12
13. technology
from seed
technology
from seed
L2 F - Spoken Language Systems Laboratory
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa
13
L2 F - Spoken Language Systems Laboratory