This is a material for invited talk in the workshop on Machine Learning Methods for High-
Level Cognitive Capabilities in Robotics 2016 (ML-HLCR2016) held in IROS2016, Korea.
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
1. Nonparametric Bayesian Word Discovery
for Symbol Emergence in Robotics
Tadahiro Taniguchi
College of Information Science & Engineering
Ritsumeikan University
Invited talk @ Workshop on Machine Learning Methods for High-
Level Cognitive Capabilities in Robotics 2016 ML-HLCR2016, in
IROS2016, Daejeon, Korea 13/10/2016
@tanichu
2. Contents
1. Introduction
2. Word segmentation and discovery
3. Nonparametric Bayesian double articulation
analyzer (NPB-DAA)
4. Conclusion
Tadahiro Taniguchi, Takayuki Nagai, Tomoaki Nakamura, Naoto Iwahashi,
Tetsuya Ogata, and Hideki Asoh
Symbol Emergence in Robotics: A Survey
Advanced Robotics. (2016)DOI:10.1080/01691864.2016.1164622
@tanichu
3. Without any pre-existing
knowledge of phonemes and
vocabularies.
(like human infants)
[D. Roy 2002]
[N. Iwahashi 2003]
D. K. Roy and A. P. Pentland, “Learning words from sights and sounds: a computational model,”
Cogn. Sci., vol. 26, no. 1, pp. 113–146, 2002.
N. Iwahashi, “Language acquisition through a human – robot interface by combining speech ,
visual , and behavioral information,” vol. 156, pp. 109–121, 2003.
Unsupervised machine learning for language
acquisition by a robot
4. Tadahiro Taniguchi, Takayuki Nagai, Tomoaki Nakamura, Naoto Iwahashi, Tetsuya Ogata,
and Hideki Asoh, Symbol Emergence in Robotics: A Survey
Advanced Robotics, .(2016)DOI:10.1080/01691864.2016.1164622
6. Word discovery for
symbol emergence in robotics
• To enable a robot obtain many words
provided by its user through human-robot
interaction, word discovery is a critical task.
… … …
can
apple
...
you
this
...
...
7. Contents
1. Introduction
2. Word segmentation and discovery
3. Nonparametric Bayesian double articulation
analyzer (NPB-DAA)
4. Conclusion
@tanichu
8. Word segmentation
in language acquisition
When parents speak to their children, they rarely use
“isolated words,” but use continuous word sequences,
i.e. sentences.
Word segmentation is a primary task of language
acquisition.
The child has to perform word segmentation without
pre-existing knowledge of vocabulary because
children do not know lists of words before they learn.
… … …THISISANAPPLE
APPLEISSOSWEET
HEYLOOKATTHIS
OHYOUARESOCUTE
?????
9. Unsupervised word segmentation
Word segmentation problem
– Example:
• Thisisanapple(ðɪsɪzənæpl) -> This(ðɪs) is(ɪz) an(ən) apple(æpl)
• WATASHIWATANAKANADESU(わたしはたなかです)
-> WATASHI(わたし) WA(は) TANAKA(たなか) DESU(です)
– To segment sentences into words (morpheme).
– This had required preexisting knowledge of language model, i.e.,
dictionary.
Unsupervised word segmentation
– No preexisting dictionaries are used.
– A nonparametric Bayesian framework for word segmentation
[Goldwater+ 09]
– Unsupervised word segmentation method based on the Nested
Pitman–Yor language model (NPYLM)
[Mochihashi+ 09].
S. Goldwater, T. L. Griffiths, and M. Johnson, “A Bayesian framework for word segmentation:
exploring the effects of context.,” Cognition, vol. 112, no. 1, pp. 21–54, 2009.
Daichi Mochihashi, Takeshi Yamada, Naonori Ueda."Bayesian Unsupervised Word Segmentation
with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009.
10. NPYLM [Mochihashi ‘09]
(Nested Pitman-Yor Language Model)
• Mochihashi et al. proposed NPYLM for unsupervised word
segmentation.
• NPYLM has a word n-gram model and a letter n-gram model.
Each adopts hierarchical Pitman-Yor language model as a
language model.
• Bayesian nonparametrics.
• Efficient blocked Gibbs sampler.
Daichi Mochihashi, Takeshi Yamada, Naonori Ueda."Bayesian Unsupervised Word Segmentation
with Nested Pitman-Yor Language Modeling". ACL-IJCNLP 2009, pp.100-108, 2009.
Language model
(Vocabulary)
Word segmentation
Updating language model
12. Problems with Unsupervised Word
Segmentation in Word Discovery Tasks
• NPYLM presumes that the target document
(sentences) is transcribed without errors.
– If there are phoneme recognition errors, its
performance becomes dramatically worse.
– How to mitigate the effect of phoneme recognition
errors in word discovery is an important issue in
real-world language acquisition.
[Saffuran 1996]
13. Problem and approach
Continuous speech signals
Single Gaussian
emission distribution
with duration distribution
Word dictionary and
bigram language model
A single nonparametric Bayesian
probabilistic generative model
Acoustic
model
Language
model
… … …
Unsupervised learning
14. Contents
1. Introduction
2. Word segmentation and discovery
3. Nonparametric Bayesian double articulation
analyzer (NPB-DAA)
4. Conclusion
1. Tadahiro Taniguchi, Shogo Nagasaka, Ryo Nakashima
Nonparametric Bayesian Double Articulation Analyzer for Direct Language
Acquisition from Continuous Speech Signals
IEEE Transactions on Cognitive and Developmental Systems .(2016)
2. Tadahiro Taniguchi, Ryo Nakashima, Hailong Liu and Shogo Nagasaka
Double Articulation Analyzer with Deep Sparse Autoencoder for Unsupervised
Word Discovery from Speech Signals
Advanced Robotics, Vol.30 (11-12) pp. 770-783 .(2016) @tanichu
15. Double articulation
structure in semiotic data
• Semiotic time-series data often has double
articulation
– Speech signal is a continuous and high-dimensional time-series.
– Spoken sentence is considered a sequence of phonemes.
– Phonemes are grouped into words, and people give them meanings.
h a u m ʌ́ tʃ I z ð í s
[h a u ] [m ʌ́ tʃ] [ i z ] [ð í s]
How much is this?Word
Phoneme
Speech
signal
semantic
(meaningful)
meaningless
unsegmented
Does the human brain have a special capability to analyze
double articulation structures embedded in time-series data?
16. 1 2 46 1 27 8 5 10 11 13 14 7
W H A T I S T H I S T H I S I S A P E N
[WHAT] [IS] [THIS] [THIS] [IS] [A] [PEN]
Speech
Motion
Driving
Working hypothesis
Double Articulation Structure in Human Behavior
2016/10/16
17. Double Articulation Analyzer (DAA) and its
application to non-speech time series data
Double Articulation Analyzer = sticky HDP-HMM + NPYLM
sticky HDP-HMM = nonparametric Bayesian HMM
NPYLM = nonparametric Bayesian language model for unsupervised
morphological analysis
HDP-HMM
[Fox ‘07]
NPYLM
[Moachihashi ‘09]
Tadahiro Taniguchi, Shogo Nagasaka, Double Articulation Analyzer for Unsegmented Human Motion using
Pitman-Yor Language model and Infinite Hidden Markov Model, 2011 IEEE/SICE SII.(2011)
Human motion
Driving
behavior
Imitation learning
Motion segmentation
[Taniguchi’11]
Extracting driving chunk
[Nagasaka ‘12]
Detecting intentional
changing points
[Takenaka ‘12]
Prediction [Taniguchi ‘12]
Video summarization
[Takenaka ‘12]
For topic modeling [Bando
‘13]
18. Simultaneous acquisition of
phoneme and language models
[Nakamura+ 2014] used a pre-existing phoneme model and
did not make a robot learn a phoneme model.
There are still few studies about unsupervised simultaneous
learning of phoneme and language models from speech
signals [Kamper+ 15, Lee+ 15].
Does the analysis of double articulation structure
embedded in speech signals enable a robot to obtain
phoneme and language models simultaneously?
… … …
Prosodic Cues
Distributional Cues
Co-occurrence Cues
H. Kamper, A. Jansen, and S. Goldwater, “Fully Unsupervised Small-Vocabulary Speech
Recognition Using a Segmental Bayesian Model,” in INTERSPEECH 2015, 2015.
C.-y. Lee, T. J. O. Donnell, and J. Glass, “Unsupervised Lexicon Discovery from Acoustic Input,”
Transactions of the Association for Computational Linguistics, vol. 3, pp. 389-403, 2015.
Making full use of the directly from speech signals
19. Hierarchical Dirichlet process hidden language model
(HDP-HLM) [Taniguchi+ 16]
19Tadahiro Taniguchi, Shogo Nagasaka, Ryo Nakashima, Nonparametric Bayesian Double Articulation Analyzer for
Direct Language Acquisition from Continuous Speech Signals, IEEE Transactions on Cognitive and Developmental
Systems.(2016)
γLM
Language model
(Word bigram)
γWM
i=1,…,∞
αWM
j=1,…,∞
Word model
(Letter bigram)
z1 zs-1 zs zs+1 zS
Latent words
(Super state sequence)
wi
i=1,…,∞
ls1 lsk lsL
Latent letters
Ds1 Dsk
x1
xt1
s1 xT
Acoustic model
ωj
θj
G
H
yT
Observation
Ds1 Dsk DsL
Duration
βLM
αLM
πLM
i
βWM
πWM
j
xt
2
s1 xt1
sk
xt
2
sk xt
1
sL xt
2
sL
j=1,…,∞
yt
2
sLyt
1
sLyt1
sk yt
2
sk
yt1
s1 yt
2
s1
y1
DsL
zs
zszs
zs
zs
zs
zs
Language model
(Word bigram
model with
letter bigram
model)
Acoustic model
(phoneme model)
Word sequence
Phoneme sequence
A probabilistic generative model for time-series data
having double articulation structure
20. HDP-HLM as an extension of HDP-HSMM
HDP-HLM can be regarded as an extension of HDP-
HSMM [Johnson’13]
This property helps us to derive efficient inference
procedure.
Matthew J Johnson and Alan S Willsky. Bayesian nonparametric hidden semi-markov models.
The Journal of Machine Learning Research, Vol. 14, No. 1, pp. 673–701, 2013.
HDP-HSMM
(hierarchical Dirichlet process
hidden semi-Markov model)
corresponds..
21. Inference (Blocked Gibbs sampler)
Blocked Gibbs sampler can be derived by extending HDP-
HMM’s backward filtering-forward sampling algorithm.
Backward filtering
Forward sampling
Parameter update
very heavy....
22. Evaluation experiment using
artificial 2 or 3 words sentences with
Japanese five vowels
Five artificial words {aioi, aue, ao, ie, uo} prepared by connecting five
Japanese vowels.
30 sentences (25 two-word and 5 three-word sentences) are prepared and
each sentence is recorded twice by four Japanese speakers.
MFCC (frame size =25ms, shift = 10ms, frame rate 100hz)
ex. aioi ao
γLM
Language model
(Word bigram)
γWM
i=1,…,∞
αWM
j=1,…,∞
Word model
(Letter bigram)
z1 zs-1 zs zs+1 zS
Latent words
(Super state sequence)
wi
i=1,…,∞
ls1 lsk lsL
Latent letters
Ds1 Dsk
x1
xt1
s1 xT
Acoustic model
ωj
θj
G
H
yT
Observation
Ds1 Dsk DsL
Duration
βLM
αLM
πLM
i
βWM
πWM
j
xt
2
s1 xt1
sk
xt
2
sk xt
1
sL xt
2
sL
j=1,…,∞
yt
2
sLyt
1
sLyt1
sk yt
2
sk
yt1
s1 yt
2
s1
y1
DsL
zs
zszs
zs
zs
zs
zs
* HDP-HLM are trained separately for each speaker.
23. Sample of results
Compared to Conventional DAA, NPB-DAA could discover latent
words accurately.
The inference procedure could gradually
estimate the boundaries of words and
phonemes.
ex) ao-ie-ao
24. Unsupervised word discovery
with trained phoneme recognizer
Nonparametric Bayesian Double Articulation Analyzer
(NPB-DAA) based on HDP-HLM [Taniguchi ’16]
The method could estimate language and acoustic/phoneme
models simultaneously.
The comparative methods were compared using ARI
(adjusted rand index) from the viewpoint of frame-based
clustering task.
It even outperformed an off-the-shelf speech recognition
system-based method in a word discovery task.
24
Tadahiro Taniguchi, Shogo Nagasaka, Ryo Nakashima, Nonparametric Bayesian Double Articulation Analyzer
for Direct Language Acquisition from Continuous Speech Signals, IEEE Transactions on Cognitive and
Developmental Systems.(2016)
Unsupervised word discovery
Speech recognition with
off-the-shelf ASR system
25. Double Articulation Analyzer with Deep Sparse Autoencoder
for Unsupervised Word Discovery from Speech Signals
Deep learning-based feature extraction method , deep sparse
autoencoder (DSAE), was employed to increase the performance of
NPB-DAA.
DSAE can be trained in an unsupervised manner. Therefore, the total
learning system is still an unsupervised learning system.
26. Experimental results
The NPB-DAA with DSAE even outperformed
MFCC-based off-the-shelf speech
recognition system.
Tadahiro Taniguchi, Ryo Nakashima, Hailong Liu and Shogo Nagasaka, Double Articulation Analyzer with Deep
Sparse Autoencoder for Unsupervised Word Discovery from Speech Signals, Advanced Robotics.(2016)
27. Contents
1. Introduction
Symbol emergence in robotics
2. Word discovery with multimodal categorization
3. Direct word discovery from speech signals
4. Conclusion
@tanichu
28. Conclusion
Symbol Emergence in Robotics is introduced.
SER is a synthetic approach towards
developmental mental system involving
language acquisition and symbol emergence
systems.
An unsupervised machine learning method for
word discovery by robots, NPB-DAA, is
introduced. This is based on Nonparametric
Bayesian approach.
… … …
29. Current problems and future challenges
• Current Problems
– Computational cost
• Analyzing only 60 sentences require more than one hour.
– Speaker dependency
• Unsupervised learning from multi-speaker speech signals is
currently difficult because each speaker’s acoustic feature
is different from each other.
• Future Challenges
– Efficient and Fast Algorithm
• Inventing more efficient inference methods and using more
computational resources are our future directions.
– Unsupervised Speaker Adaptation
• Developing a unsupervised speaker adaptation method for
language acquisition from multi-speaker speech signals.
– Mutual learning of words, phonemes and objects.
• It is expected that phoneme acquisition performance is also
increased by learning phonemes with objects simultaneously.
30. Future challenge
Word discovery for symbol emergence in robotics
… … …
NPB-DAA
Multimodal
object
categorization
SLAM
Motion
Primitives
Affordance
learning
Syntax learning
Probabilistic
information
PhonemesPhonemes
& words
32. Information
2016/10/16
email: taniguchi@ci.ristumei.ac.jp
Special Thanks
• Ritsumeikan University
• R. Nakashima, S. Nagasaka, A.
Taniguchi, K. Hayashi
• DENSO co.
• T. Bando, K. Takenaka, K. Hitomi
• Okayama Pref. Univ.
• N. Iwahashi
Visit http://www.tanichu.com/
Facebook: please search me
Twitter: @tanichu
Acknowledgement
[Github] NPB-DAA
https://github.com/EmergentSystemLabStudent/NPB_DAA
@tanichu