Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information

Simultaneous Estimation of
Self-position and Word from Noisy
Utterances and Sensory Information
Akira Taniguchi *, Tadahiro Taniguchi *,
Tetsunari Inamura **
* Ritsumeikan University, Japan
(e-mail: a.taniguchi@em.ci.ritsumei.ac.jp).
** National Institute of Informatics/The Graduate
University for Advanced Studies, Japan.
The 13th IFAC/IFIP/IFORS/IEA Symposium on Analysis, Design, and Evaluation of
Human-Machine Systems (IFAC HMS 2016 in Kyoto)

Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information
Outline
1. Background and purpose of our research
2. Research outline
3. Proposed method:
Simultaneous estimation of self-positions and words
4. Experiment1:
Learning spatial concepts
5. Experiment2:
Evaluation for word clustering
6. Experiment3:
Self-localization using learned spatial concepts
7. Conclusion
2

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
3

Background of our research
• Lexical acquisition in robots
– The robot does not have word knowledge in advance.
– The robot learns words from human speech signals.
– Estimating the identity of the speech recognition results with errors is
a very difficult problem.
• Self-localization in robots
– The robot estimates self-position using sensor information and an
environment map.
– If the robot uses local sensor only, the sequential global localization is
a very difficult problem.
4

Mutually effective utilization
• Uncertain position information
• It is estimated by Monte-Carlo
Localization (MCL).
Self-localization
• Uncertain verbal information
• It is obtained from human speech
• Phoneme/Syllable recognition
Purpose of our research
5
Lexical acquisition
Lexical acquisition
related to places
Monte-Carlo Localization
/afroqtabutibe/
/bigbuqkkais/
/waitosherfu/
“The front of TV.”
/zafroqtabutibe/ ??
/zafrontobutibi/ ??

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
6

Research outline (1/4)
7
Place of learning target

8
Teaching
“white shelf”
“a front of TV”
Teachings of multiple
/afurantotibi/ ?
/waitosherufu/ ?
Ambiguous speech recognition

9
/afurantobutibi/
/bigusheufu/
/waitosherufu/

10
“a front of TV”
After
Modification of self-localization
/furontabutebi/ ?
“Where is .
this place?”
The robot can narrow down the hypothesis of self-position.
Before

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
11

Simultaneous estimation of
Self-positions and words
12
Probabilistic self-localization algorithm
based on particle filter
Monte Carlo localization（MCL）
• Place name
• Spatial area (position distribution)
Spatial concept
W1：/sherufu/
W4：/terebi/
W3：/teebou/
W2：
/torashuboqs/
Ex) spatial concepts
• The proposed method is based on a Bayesian probability approach.
• The robot can learn place names from the user’s speech signals without
previous word knowledge
• The robot can reduce the estimation error of self-position by using learned
word information.

The posterior distribution is calculated as follows:
Self-localization
),,|(
),|()|()|(
),,|(
1:11:11:11:0
1
:1:1:1:0




tttt
ttttttt
tttt
Ouzxp
uxxpxOpxzp
Ouzxp
1tx

xt

xt1
1tz 1tztz
1tu tu 1tu
tC
tO
W
Σμ,
State of
spatial concept
Simultaneous estimation of
Self-positions and words
13
L
ΣμxNWO
CpCxpCOp
xOp
tt
t
t
t
CCt
C
Ct
ttt
C
tt
tt
1
),|()),LD(exp(
)(),,|(),|(
)|(





ΣμW
β：the parameter of the effect of LD
L ：the number of spatial concepts
Robot’s
position
Control data
sensor data
Speech
recognition result
Place names
Position distribution
Parameter estimation by Gibbs sampling
Levenshtein distance Gaussian distribution

We set the initial position distribution.
The mean vector 𝝁 is the uniform random number in the given range,
i.e., the range in which the robot can move on the map.
The covariance matrix 𝚺 is set as diag( 𝜎initial; 𝜎initial).
1. Initialization 𝝁, 𝚺
2. Sampling of 𝐶𝑡
without W
3. Sampling of 𝝁, 𝚺
4. Sampling of W
5. Sampling of 𝐶𝑡 with W
6. Multiple iterations of
3 - 5
Learning algorithm for spatial concept using Gibbs sampling
14

without W
4. Sampling of W
3 - 5
The conditional posterior distribution of Ct samples the state of spatial concept
Ct at each time (data) t.
In this step, the robot does not estimate the place names W.
Therefore, the sampling of Ct is performed as follows:
),|(~ CtCttt xpC 
1
2
3
15

• 位置分布の初期値を決定








initial
initial
c
c



0
0
範囲内に一様乱数
without W
4. Sampling of W
3 - 5
• どのデータがどの場所概念を表すかを推定する
• 発話された場所の座標データを使って位置分布から確率的に選択する
The conditional posterior distribution of 𝝁, 𝚺 samples the position distribution
for each state of the spatial concept c.
The sampling of 𝝁, 𝚺 is performed as follows:
),|Wishert(),|(~, 1
NNccNNccc μ  V 
mN
NNNN  ,,, Vm ：the hyperparameters of the posterior distribution
1
2
3
16

• 位置分布の初期値を決定








initial
initial
c
c



0
0
without W
4. Sampling of W
3 - 5
• どのデータがどの場所概念を表すかを推定する
• 発話された場所の座標データを使って位置分布から確率的に選択する
• 場所概念ごとにデータから位置分布を更新
• 事前分布にガウス‐ウィシャート分布を用いてガウス分布に対するベイズ
推論を行う
),|Wishert()(,|(~, 111
NNccNNccc μ  V
 mN
NNNN  ,,, Vm ：ガウスウィシャート分布のハイパーパラメータ
The conditional posterior distribution of Wc samples the place name Wc for
each state of the spatial concept c.
The sampling of Wc is performed as follows:
  )()|(~ c
Tt
cC
Ctc WpWOpW
o
t
t


1
2
3
word
wordword
p(Wc) = 1/N
17

without W
4. Sampling of W
3 - 5
• 場所概念ごとにデータから位置分布を更新
• 事前分布にガウス‐ウィシャート分布を用いてガウス分布に対するベイズ
推論を行う
),|Wishert()(,|(~, 111
 mN
：ガウスウィシャート分布のハイパーパラメータ
The conditional posterior distribution of Ct samples the state of spatial concept
Ct for each time t.
The sampling of Ct is shown as follows:
)|(),|(~ CttCtCttt WOpxpC 
1
2
3
1
3
2
word
wordword
18









initial
initial
c
c



0
0
without W
4. Sampling of W
3 - 5
),|Wishert()(,|(~, 111
 mN
Multiple iterations were performed for the process described in steps (3) to (5)
above.
1
2
3
1
3
2
word
wordword
word
wordword
word
wordword
1
3
2
4.
5.
3.
19

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
20

Experiment 1 :
21
/gomibako/
(trash bin)/siroitana/
(white shelf)
Four places were selected as learning
targets.
The user decides whether the robot
arrived at a learning target.
The teaching utterances are repeated
40 times.
The number of spatial concepts：𝐿 = 4
The number of iterations：10
Speech recognition system : Julius
(using Japanese syllables dictionary)
Microphone : SHURE PG27-USB
Conditions
/tsukue/
(desk)
500cm
500cm
The environment on SIGVerse[Inamura et al. (2010)]
Inamura, T., et al. (2010). Simulator platform that enables social interaction simulation -SIGVerse:
SocioIntelliGenesis simulator-. In IEEE/SICE International Symposium on System Integration, 212-217.
/terebimae/
(in front of the TV)

W1：
/sinitana/
W4：
/terimae/
W3：
/tsukune/
W2：
/gumiwako/
Learning result of spatial concepts
22

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
23

Experiment 2 :
24
datateachingallofnumberThe
datacorrectestimatedofnumberThe
ΕΑR 
We compared the matching rate for the estimated states
of the spatial concept Ct of teaching data and the
classification results of ground truth by a human.
Classification by human
(Ground truth)
Ct = 2
Ct = 3
Ct = 1
Ct = 4Conditions
The EAR (estimation accuracy rate) is calculated as follows:
The white boxes represent the classification of the teaching data
according to the state of the spatial concept.

25
The proposed method
EAR: 0.96
The number of estimated Ct
Figures show estimated results of the state of spatial concept in 10 trials.
Word 𝑂𝑡 only clustering
EAR: 0.81
As a result, we showed that it is possible
to improve the lexical acquisition accuracy
by performing clustering that considered
position and word information.

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
26

While the robot performs self-localization, the estimation error for each time step
is calculated as follows:
Experiment 3 : Self-localization using learned
spatial concepts
• Conventional MCL （without word information）
• The proposed method （with word information）
27
22
)()( 
 ttttt yyxxe
𝑒 𝑚： the average of 𝑒𝑡
𝑒 𝑟 ： the minimum value of the estimation error 𝛾
 
M
i
i
t
i
tt
M
i
i
t
i
tt ywyxwx )()()()(
,
：the robot’s true position

tt yx ,
Comparison
We performed an experiment for the evaluation of self-localization using
learned spatial concepts.
𝛾 is determined by the interval 0, 𝛾 , which includes values greater than or equal to 95% of 𝑒𝑡 in
all teaching times.
Two evaluation indicators

Experiment 3 : Self-localization using learned
spatial concepts
• Landmark based MCL
– If the robot looks at the landmark,
the robot gets the distance and
angle of the landmark.
– View angle of the camera：45
degree
– The number of particles：1000
– The number of landmarks：4
– The robot moves for 400 steps
28
Conditions
W1：sinitana
W4：terimae
W3：tsukune
W2：
gumiwako
When the robot arrives at the learning target,
the user says the place name of the target
place for the robot.

Situation of the experiment
29
(Video clip)

Results
The estimation errors of the proposed method were smaller than that
of MCL for two evaluation indicators 𝑒 𝑚, 𝑒 𝑟.
30
As a result, we showed that the spatial concepts enable the modification
of the global self-localization error.

Outline
2. Research outline
3. Proposed method:
4. Experiment1:
5. Experiment2:
6. Experiment3:
7. Conclusion
31

• The robot could acquire spatial concepts by integrating noisy
information from sensors and speech.
• It is possible to improve the lexical acquisition accuracy by
performing clustering that considered uncertain position
information and uncertain word information.
Lexical acquisition related to places
• The robot was able to estimate self-position by employing spatial
concepts with lower error than when using MCL.
Self-localization using spatial concepts
Conclusion
32THANK YOU FOR YOUR KIND ATTENTION.

THANK YOU FOR YOUR KIND
ATTENTION.
33

Teaching word Recognized words
/shiroitana/
(white shelf)
/shiroitano/ /shinitana/ /shiroitaga/ /shinotaga/ /shinutana/
/chinitano/ /tsutana/ /shinonitana/ /shinotana/ /shiruitana/
/tsukue/
(desk)
/tsukube/ /tsukune/ /tsukuu/ /tsukue/ /tsukume/
/tsukune/ /tsukuke/ /tsutsune/ /tsukune/ /tsukume/
/gomibako/
(trash bin)
/komiwako/ /gubiwako/ /gumiwako/ /gubiwako/ /kumiwako/
/komiwako/ /gumiwako/ /gumiwako/ /gumibako/ /gumibaku/
/terebimae/
(in front of the TV)
/terimae/ /tebimae/ /tegikunae/ /terenae/ /terimae/
/terenae/ /tezurimae/ /terimae/ /teteginae/ /teribinae/
The results of speech recognition using the
Japanese syllable dictionary
34

Smoothing
Estimation of self-positions 𝑥1:𝑇
• Monte Carlo localization : on-line algorithm
• Learning spatial concepts : off-line algorithm
35
In general, the smoothing method can provide a more accurate estimation
than the MCL of online estimation.
Self-positions 𝑥1:𝑇 are sampled by using a Monte Carlo fixed-lag smoother in
the learning phase.

x0

x1

xt 1

xt

xt 1
1z 1tz tz 1tz
 Tx
Tz


xt
robot’s true position
time
Ltx 
：the trajectories of particles ：the smoothed trajectory
Monte Carlo fixed-lag smoothing
(Particle smoother)
In general, the smoothing method can provide a more accurate estimation
than the MCL of online estimation.

Speech recognition system: Julius
• Word dictionary
– Japanese syllables only
37
To recognize speech signals as syllable sequences.
Japanese syllable dictionary
The set of 43 Japanese phonemes defined by
Acoustical Society of Japan (ASJ)’s speech database
committee is adopted by Julius.
The Julius system uses a word dictionary containing
115 Japanese syllables.
Julius dictation-kit-v4.2, http://julius.sourceforge.jp/
We assume that the robot can recognize user speech at the unit of syllables, i.e., the
robot does not have word knowledge in advance.

The clustering method using word data only Ot
（without position data 𝑥 𝑡)
2 )(~ tt CpC
3 ),(~, cccc p  
4
)()|(~ c
Tt
Cc
Ctc WpWOpW
o
t
t













5 )()|(~ tCtt CpWOpC t
38
without W
4. Sampling of W
3 - 5

Nonparametric Bayesian Spatial Concept
Acquisition method (SpCoA)
39
• This model can learn spatial concepts from continuous speech signals
• This model can learn an appropriate number of spatial concepts, depending
on the data (using nonparametric Beysian approach)
• This model can relate several places to several names.
(many-to-many correspondences between names and places)
/emajentoshitemurabo/,
/taniguchizurabo/
“Here is
Emergent
System Lab.”
/konekutiNgukorida/
(connecting corridor)
Akira Taniguchi, Tadahiro Taniguchi, and Tetsunari Inamura, "Spatial Concept Acquisition for a Mobile Robot that Integrates Self-Localization
and Unsupervised Word Discovery from Spoken Sentences", IEEE Transactions on Cognitive and Developmental Systems, 2016.

Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information

Recommended

Recommended

More Related Content

Similar to Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information

Similar to Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information (20)

More from Akira Taniguchi

More from Akira Taniguchi (7)

Recently uploaded

Recently uploaded (20)

Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information