MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015

NNI QbE system, MedialEval 2015 Workshop, Wurzen, Germany
The NNI Query-by-Example System
for MedialEval 2015
Jingyong Hou1, Van Tung Pham2, Cheung-Chi Leung3, Lei Wang3, Haihua Xu2, Hang Lv1, Lei Xie1,
Zhonghua Fu1, Chongjia Ni3, Xiong Xiao2, Hongjie Chen1, Shaofei Zhang1, Sining Sun1, Yougen Yuan1,
Pengcheng Li1, Tin Lay Nwe3, Sunil Sivadas3, Bin Ma3, Eng Siong Chng2, Haizhou Li2,3
1Northwestern Polytechnical University (NPWU), Xi’an, China
2Nanyang Technological University (NTU), Singapore
3Institute for Infocomm Research (I2R), A*STAR, Singapore
Presented
by
Cheung-‐Chi
Leung

Ins3tute
for
Infocomm
Research
(I2R),
A*STAR,

Singapore

1

System Diagram
2

•  Score-level fusion of 66 systems
from our 3 groups:
–  15 DTW systems from NWPU
–  39 DTW systems from I2R
–  8 DTW systems and 4 SS systems
from NTU
•  Our submitted system involves:
–  DTW mainly on bottleneck features/stacked bottleneck features
–  Symbolic search (SS) using phoneme tokenizers and weighted finite state transducer
(WFST)
Highlight
of
this
year’s
system:

-‐  Noise
robustness
techniques
to
deal
with
noisy

data
of
this
year

query
audio
search
audio

tokenizer
tokenizer
tokenizer
tokenizer

...

...

DTW
DTW
SS
SS

...

...

intra-‐group
and

inter-‐group
fusion

results

Training Resources for Tokenizers
•  Tokenizers are used to convert the audio signal into
•  bottleneck features (BNF)/stacked bottleneck features (SBNF)/posteriorgrams
for DTW systems
•  phone sequences/lattices for SS systems
3

Training
corpora
or
phoneme
recognizers

NWPU
I2R
NTU

Switchboard
(English)
√
√
√√

Development
languages
in

OpenKWS

Cantonese
√
√
√

Pashto
√
√
√

Tagalog
√
√
√

Tamil
√
√

Turkish
√
√
√

Vietnamese
√
√

Fisher
Spanish
√

HKUST
Mandarin
√

CallHome
EgypRan
Arabic
√

SEAME
(mixed
Mandarin-‐English)
√

MASS
(Malay)
√

BUT
phoneme
recognizers
(Czech,
Hungarian

and
Russian)

√
√ used in SS system(s)
√ used in DTW system(s)

DTW Systems
•  Exact matching systems: conventional subsequence DTW; Good
for type 1 queries
•  Approximate matching systems to deal with type 2&3 queries
•  Use partial feature segment of query for matching
•  1) Fixed-window based1:
•  Segments of 70-90 frames shifted by 5-10 frames
•  2) Phoneme-sequence based2:
•  Segments formed by consecutive 8 phonemes (phoneme
boundaries derived from phoneme recognizers)
1 P. Yang et al, “The NNI query-by-example system for MediaEval 2014” in Proc. MediaEval 2014 workshop, pp. 16-17.
2 J. Hou et al, “Spoken term detection technology based on DTW,” Journal of Tsinghua University (Sci and Tech), 2015
(to be published).
4

Exact matching and approximate
matching DTW Systems
•  Fused results of 13 exact matching and 13 approximate matching
(fixed-window based) DTW systems (from the 13 SBNF/BNF
tokenizers)
5

minCnxe
(maxTWV)
on
dev

Exact
matching

DTW

Approx.

matching

DTW

Exact+Approx.

Matching

DTW

Type
1
queries
0.700
(0.293)
0.711
(0.312)
0.685
(0.314)

Type
2
queries
0.893
(0.083)
0.853
(0.112)
0.852
(0.122)

Type
3
queries
0.874
(0.124)
0.867
(0.120)
0.856
(0.135)

All
queries
0.844
(0.166)
0.828
(0.179)
0.817
(0.190)

Adding Noise to Training Data for
Tokenizers
•  Precautions:
–  Signal-to-noise (SNR) distribution of the noise-added training data
should be similar to that of development data
–  Only portion (~50%) of training data is added with noise (as not all
utterances in this year are highly noisy)
6

QUESST

dev
data

training
data

of
a
tokenizer

tokenizer

noise

segment

noise

segment

extracRon

model

training

Adding Noise to Training Data for
Tokenizers
•  Results of an exact matching DTW system using
SBNF (tokenizer trained using Switchboard corpus)
7

minCnxe
(maxTWV)
on
dev

Baseline
(orig.

Switchboard
data)

baseline+noise1
baseline+noise2

Type
1
queries
0.762
(0.227)
0.733
(0.258)
0.735
(0.270)

Speech Enhancement
•  Wiener filter is used to reduce noise in utterances1
•  Initial results show this leads to better DTW search performance for some
tokenizers
•  Further investigation will be conducted
8

minCnxe
(maxTWV)
of
exact
matching
DTW

systems
on
type
1
dev
queries

baseline
w/
speech
enhancement

Switchboard
monophone

SBNF

0.894
(0.097)
0.870
(0.110)

BUT-‐CZ
posteriorgrams
0.931
(0.018)
0.872
(0.103)

BUT-‐HU
posteriorgrams
0.909
(0.070)
0.857
(0.114)

1J.
Chen,
J.
Benesty,
Y.
Huang,
and
T.
Gaensle,
"On
single-‐channel
noise
reducRon
in
the

Rme
domain,"
in
Proc
ICASSP,
2011,
pp.277-‐280.

Symbolic Search Systems
•  Symbolic search system with phoneme sequence approximate matching1 is used to
facilitate type 2&3 queries
•  Key steps:
•  Represent search audio by phone lattices, index it in WFST format
•  Represent query audio by N-best phone sequences
•  Extract partial phone sequences of queries
•  Search by composition of query and search WFSTs
9

1H. Xu et al, “Language independent query-by-example spoken term detection using n-best phone sequences and
partial matching,” in Proc. ICASSP, 2015, 5191-5195.

Symbolic Search Systems

•  Further improvement by fusing 4 SS systems and 8 DTW
system (4 exact matching and 4 fixed-window approximate
matching)
–  Different types of systems use the same 4 tokenizers
10

minCnxe
(maxTWV)
on
dev

DTW
(including

exact+approx.)

SS
DTW
+
SS

relaRve

improvement

Type
1
queries
0.683
(0.321)
0.871
(0.150)
0.680
(0.331)
0.4%
(3.1%)

Type
2
queries
0.878
(0.098)
0.902
(0.068)
0.831
(0.168)
5.4%
(71.4%)

Type
3
queries
0.878
(0.113)
0.934
(0.072)
0.854
(0.174)
2.7%
(54.0%)

All
queries
0.836
(0.177)
0.910
(0.094)
0.809
(0.224)
3.2%(26.5%)

Results

•  Each group experienced performance gain by:
–  fusing exact-matching and approximate-matching systems
–  fusing systems with systems using different speech preprocessing
techniques (e.g. noise extraction, speech enhancement or VAD)
–  fusing systems with different tokenizers
•  Further performance gain by inter-group fusion
•  Compared with our single best exact matching DTW systems,
system fusion brings around 13.5% relative improvement in
minCnxe (115% in maxTWV) on all query types in dev
11

Conclusion
12

•  We have described the NNI system for the QUESST 2015
•  Noise robustness techniques are used to deal with the noise
condition of data, and lead to better search performance
•  Same observations are obtained as last year:
•  Complementary DTW and SS systems
•  Complementary exact matching and approximate matching
systems
•  Further investigation will be conducted for speech
enhancement techniques, and the gain provided by BNF and
SBNF

Thanks !
13

MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015

Recommended

Recommended

More Related Content

Similar to MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015

Similar to MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015 (20)

More from multimediaeval

More from multimediaeval (20)

Recently uploaded

Recently uploaded (20)

MediaEval 2015 - The NNI Query-by-Example System for MediaEval 2015