Iitdmj 1

Multimedia Information Processing
(1)
Koichi Shinoda
Tokyo Institute of Technology
1

Outline
• Theory and implementation of statistical
speech recognition
– Hidden Markov models
– Clustering, Bayes estimation, etc
– speaker adaptation
• Video information retrieval
2

Syllabus
1. Introdution: Sound and speech
2. Speech analysis
3. Very simple speech recognition
4. Hidden Markov model(1)
5. Hidden Markov model(2)
6. Continuous speech recognition
7. Language model
8. Speaker adaptation
9. Video Information Retrieval (1)
10. Video Information Retrieval (2)
3

My CV
1987 Graduated from The University of Tokyo (Physics)
1989 MS from The University of Tokyo (Astronomical physics)
1989 Joined NEC Corporation. Research on speech recognition.
1997 Visiting Scholar at Bell Labs, NJ, USA (-1998)
2001 Dr. Eng. from Tokyo Institute of Technology
2001 Associate Professor of The University of Tokyo
2003 Associate Professor of Dept. Computer Science,
Tokyo Institute of Technology
Visiting Associate Professor of
The Institute of Statistical Mathematics
2013 Professor of Dept. Computer Science, Tokyo Tech
4

Research Area
Statistical Pattern Recognition (Speech, Video)
• Acoustic Modeling for speech recognition
– High speed calculation in pattern matching
– Autonomous model-size control
– Graphical Modeling
– Active learning
• Speaker Adaptation for speech recognition
– Rapid improvement with a small amount of user’s utterances.
• Robust speech recognition
– Noises, Microphones, Channels,...
• Video Information Retrieval
– Highlight scene extraction from the broadcast of sports
– High level feature extraction
– Event detection (Surveillance)
• Multimodal interface
– Simultaneous input interface of speech and gestures.
• Social Signal Processing
– Data mining from human-human communication
5

Speech recognition
• Familiar in SF novels (2001 A Space Odyssey,
Blade Runner, Star Wars,…)
• Now used in car navigation, voice search, call
center business, etc
Problems:
spontaneous speech, noisy environment, multi-
modality, conversation, etc
6

A brief history of speech recognition
1952: The first speech recognition system(10 digits, Bell Labs)
1952: Dynamic Programming (DP) was used in Operation Research
1968: The theory of Hidden Markov Model(Baum)
1976: Research for Speech Recognition using HMM(IBM)
1978: Commercial speech recognition system using DP matching(10 digits, N
1983: The development of HMM based continuous speech recognition(AT&T
1980s∼: Large projects (DARPA)
1990s∼: Software for continuous speech recognition using HMMs
Speech recognition algorithm
Simple pattern matching → DP matching → HMM
Signal Processing – Extraction of good features
⇓ Computational theory, Hardware
Information-Theoretic approach – Data mining from large database
7

8
Gartner Hype Cycle for 2011
Video Analysis for
Consumer Service
Gesture
Recognition
Image
Recognition
Biometric
Authentication
Method
Speech
Recognition
Babble
Crash!

History of DARPA speech recognition
benchmark tests





1k
ATIS






100%
10%
1%
WORDERRORRATE
1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
Read
Speech
Spontaneous
Speech
Conversational
Speech
Broadcast
SpeechVaried
Microphone
Noisy
20k
5k
foreign
Courtesy NIST 1999 DARPA
HUB-4 Report, Pallett et al.
foreign
Resource
Management
WSJ
Switchboard
NAB

A speech recognition system
10

My Research in NEC
• Automatic interpretation system between Japanese and English
(1989–1991)
• Large vocabulary speech recognition hardware (1993-1994)
• Speech recognition software on MS Windows (1994-1995)
• Dictation software (1998-2001)
• Robot with speech recognition function.
• Speech recognition middleware for car navigation system
• Telephone speech recognition
• Japanese-English recognition
• Speech input interface for many applications, such as presentation,
home appliance, train transfer guide.
• Robust speech recognition with microphone array.
11

Speech to speech translation system
(Japanese ⇔ English) 1989-1991
• NEC’s CI
(Computer & Communication)
• Speech recognition + machine
translation + speech synthesis
• Hardware implementation
• Demo at Telecom91 (Genève)
• I made English speech
recognition tools.
12

Large Vocabulary Speech Recognition
Device (1993-1994)
• Name: DS-1000
• Recognizes 1000 isolated words
• 2-3 million yen
• Market: hand-busy, eyes-busy
– Classify meet by their quality
– Rapping fish, vegetables
• Since CPU was not fast, we design a special LSI
• I went to business department for 3 months
• Circuit diagram, Time chart, Simulator, etc.
13

Dictation Software (1998-2001)
• Smart Voice series
• Large vocabulary
continuous speech
recognition
• Database, Algorithms,
Evaluation,…
• Team leader for
acoustic model
development
14

What you learn in this lecture
• Even beginners can run speech recognition
– Many tools and software: HTK, Sphinx, Jucer, T-
cubed decoder
– But they do not know how it works
– They do not know how to solve problems
Speech recognition INSIDE
16

Textbook
• S. Furui, "Digital speech processing, synthesis,
and recognition", Second Edition, Marcel
Deccor, 2001.
• C. M. Bishop, "Pattern Recognition and
Machine Intelligence", Springer, 2006
17

Iitdmj 1

Recommended

Recommended

More Related Content

What's hot

What's hot (10)

Viewers also liked

Viewers also liked (7)

Similar to Iitdmj 1

Similar to Iitdmj 1 (20)

Recently uploaded

Recently uploaded (20)

Iitdmj 1