The L2F Spoken Web Search system for Mediaeval 2012

The
L 2F
Spoken
Web
Search
system

for
Mediaeval
2012

Alberto
Abad
and
Ramón
F.
Astudillo

L2F
-‐
Spoken
Language
Systems
Lab

INESC-‐ID
Lisboa,
Portugal

alberto@l2f.inesc-‐id.pt

Mediaeval
2012
Workshop
Pisa,
October
4,
2012

The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012 1

IntroducJon

  In
this
first
L2F/INESC-‐ID
parJcipaJon
on
SWS
our
main
objecJves
were:

•  To
learn

•  To
have
fun

•  To
build
a
reasonable
system
(in
an
unreasonable
reduced
amount
of
Jme)

  The
submiVed
L2F
SWS
system
exploits
hybrid
ANN/HMM
connecJonist

methods
for
both
query
tokenizaJon
and
acousJc
keyword
search

•  Composed
by
the
fusion
of
4
phoneJc-‐based
SWS
sub-‐systems

o  Based
on
our
in-‐house
ASR
system
named
AUDIMUS

•  Each
sub-‐system
uses
different
language-‐dependent
acousJc
models:

o  European
Portuguese
(pt),
Brazilian
Portuguese
(br),
European
Spanish
(es)
and

American
English
(en)

•  Different
detecJon
score
normalizaJon
and
fusion
methods
invesJgated

o  SubmiVed
system
applies
per-‐query
score
normalizaJon
(Q-‐norm)
and
majority
voJng

(MV)
fusion


The
baseline
speech
recognizer

  AUDIMUS
is
our
in-‐house
hybrid
HMM/MLP
speech
recognizer

•  Feature
extrac@on
MulJ-‐stream
26
PLP,
26
logRASTA-‐PLP,
28
MSG
and
39
ETSI
(only
8kHz)

•  MLP
Several
context
input
frames
(13-‐15),
2
hidden-‐layers
(500
units)
and
1
output
layer.

•  Output
layer
size
39
for
pt,
40
for
br,
30
for
es,
41
for
en.

•  Data
pt
115
hours
(57
BN+58
telephone);

br
13
hours
of
BN
data;
es
57hours
(36
BN
+21
telephone);
en
142
hours
of
BN
data
(HUB-‐4
96
&
97)

•  HMM
topology
Single-‐state
phonemes
(3
frames
minimum
duraJon)

•  Decoder
Uses
Weighted
Finite-‐State
Transducer
(WFST)
approach


Spoken
Query
TokenizaJon

  For
each
sub-‐system,
obtain
a
phoneJc

tokenizaJon
of
the
queries

•  Similar
to
LID
parallel
phonotacJc

approaches

  Use
a
phone-‐loop
grammar
with

phoneme
minimum
duraJon
of
3

frames
to
obtain
1-‐best
phoneme
chain

•  AlternaJve
n-‐best
hypothesis
for

charactering
each
query
were
explored
with

unsaJsfactory
results

o  Other
possibiliJes
not
explored
(yet):
lakce,
CN,
etc…

•  Inﬂuence
of
the
word
inserJon
penalty

(wip)
parameter
on
the
tokenizaJon
result


Spoken
Query
Search

  Spoken
query
search
based
on
AKWS
with
our
hybrid
speech
recognizer.

  Search
window
of
5
seconds
(2.5
seconds
Jme
shio)

•  Convert
the
problem
in
a
veriﬁcaJon
task
(originally
for
with
forced
alignment)

•  Convenient
for
fusion
purposes

  Equally-‐likely
1-‐gram
LM
with
target
Query
and
Background
word:

•  Query
word

o  Described
by
the
phoneJc
units
obtained
in
the
previous
tokenizaJon
stage

•  Background
word

o  Described
by
the
special
phoneJc
class
background/ﬁller

o  Minimum
duraJon
set
to
250
msec.

  DetecJon
score
of
detected
candidates
computed
as
the
average

phoneJc
log-‐likelihood
raJos
of
the
query
term


Spoken
Query
Search

BG
modelling
with
HMM/MLP

  Possible
approaches

•  Re-‐train
MLP
✖

•  Compute
posterior
probability
of
the
background
class
depending
on
the
other

classes
✔

o  Mean
probability
of
the
top-‐N
most
likely
outputs

  For
the
SWS
system,

•  We
compute
the
average
in
the
likelihood
domain
(top-‐6)

o  The
decoder
operates
in
the
likelihood
domain,
so
there
is
not
need
for
and
add-‐hoc

esJmaJon
of
the
BG
class
prior

•  We
use
a
background
scale
term
β
(exponenJal
in
the
likelihood
domain)
to

control
the
weight
of
the
BG
model
vs.
Query

•  This
β
scale
together
with
the
wip
term
strongly
aﬀects
searching
results

o  Adjusted
following
a
non-‐exhausJve
greedy
search


Score
normalizaJon,
fusion
and
calibraJon

  Score
normaliza@on
schemes
explored:

•  Q-‐norm
Assume
that
the
scores
are
dependent
of
the
queries
and

apply
a
by-‐query
normalizaJon

•  F-‐norm
Assume
that
the
scores
are
dependent
of
the
data
ﬁle
(of

the
collecJon)
and
apply
a
by-‐ﬁle
normalizaJon

•  CombinaJons
(QF-‐norm,
FQ-‐norm)

  Fusion
schemes
explored:

•  Candidate
detecJons
from
the
4
parallel
sub-‐systems
are
kept
(or

rejected)
according
to
simple
combinaJon
rules:

o  AND
(all),
OR
(at
least
1
sys)
and
MV
(at
least
2
sys)

•  Final
detecJon
score
is
the
mean
score

  Decision
threshold
set
according
to
maxATWV
in
dev-‐dev


Development
experiments

Sub-‐systems
performance


Development
experiments

Score
normalizaJon
strategies


Development
experiments

Fusion
strategies
(aoer
Q-‐norm)


Oﬃcial
L2F
SWS2012
results

Combined DET Plot Combined DET Plot Combined DET Plot
98 98 98
Random Performance Random Performance Random Performance
Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.486 Scr=-0.135 Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435 Term Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.523 Scr=-0.362
95 95 95
Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.486 Scr=-0.135 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435 Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.523 Scr=-0.362

90 90 90

80 80 80
Miss probability (in %)


60 60 60

40 40 40

20 20 20

10 10 10

5 5 5
.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40 .0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40
False Alarm probability (in %) False Alarm probability (in %) False Alarm probability (in %)

dev-‐eval
eval-‐dev
eval-‐eval


Conclusions

  The
L2F
Spoken
Web
Search
system
fully
exploits
hybrid
ANN/HMM

speech
recogniJon:

•  Query
tokenizaJon
based
on
1-‐best
phoneJc
decoding

•  Query
search
based
on
AKWS
(with
no
need
for
AM
re-‐training)

  The
submiVed
system
is
formed
by
the
fusion
of
four
language-‐
dependent
sub-‐systems:

•  Q-‐norm
score
normalizaJon
is
applied
to
each
individual
sub-‐system

•  Fusion
is
done
following
a
majority
voJng
strategy

  The
system
achieves
an
actual
ATWV
score
of
0.5195
in
the
eval-‐eval

•  Promising
given
the
simplicity
of
the
proposed
system

•  Robust
to
query
and
collecJon
sets

o  Best
performance
is
achieved
in
a
mismatched
condiJon!!

•  Reasonably
well-‐calibrated

  Future
work
Focused
in
improved
Query
tokenizaJon,
fusion
with

other
type
of
approaches
(DTW)


technology
from seed

technology
from seed

L2 F - Spoken Language Systems Laboratory
Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

13
L2 F - Spoken Language Systems Laboratory

The L2F Spoken Web Search system for Mediaeval 2012

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to The L2F Spoken Web Search system for Mediaeval 2012

Similar to The L2F Spoken Web Search system for Mediaeval 2012 (20)

More from MediaEval2012

More from MediaEval2012 (20)

Recently uploaded

Recently uploaded (20)

The L2F Spoken Web Search system for Mediaeval 2012