An Introduction to Automatic Speech Recognition
Teaching Machines
to Listen
● Introduction
● Considerations for Working with Speech Data
● Overview of Automatic Speech Recognition (ASR) Systems
● Modeling Approaches
Outline
Introduction
• Why should you listen to me
tell you about ASR?
whoami
What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations
What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations
What we will not cover
• Deep dive into specific model
architectures
• Comprehensive coverage of
all aspects of working with
speech
• Code
Consider this an orientation to the field of speech recognition for
those familiar with other types of Data Science!
What we will cover
• Main components of speech
recognition systems
• Unique challenges working
with speech data
• Some common modeling
approaches
Expectations
What we will not cover
• Deep dive into specific model
architectures
• Comprehensive coverage of
all aspects of working with
speech
• Code
• Speech recognition is the task
of recognizing speech within
audio and converting it into
text
• Active field since 1950’s!
• Very active research field with
substantial recent advances
driven largely by novel neural
network architectures
Introduction
• 1950s and 1960s: Focus on
limited use cases; digits,
phonemes, single speakers
• 1980s and 1990s: Shift to focus
on statistical approaches
(HMMs, etc.)
• 2000s and 2010s: Wider
availability of speech
recognition toolkits
• Late 2010s: end-to-end
modeling approaches
(Selected) History
• Data volume
Specific Challenges for Speech Recognition
• Data Quality & Characteristics
Specific Challenges for Speech Recognition
• Annotation
Specific Challenges for Speech Recognition
Considerations for Working with
Speech Data
• Machine learning algorithms
want to deal with vectors (or
tensors) of floating point
numbers
• Different data types have
different preprocessing
requirements, as well as
different vectorization
strategies
Preparing Data for Machine Learning
• Tabular data is often a mix of
different field types, likely
captured from various systems
of reference in a wide variety
of ways
• “Messiness” often comes from
interpreting business context
for individual fields in a given
dataset
Common Characteristics of Tabular Data
• Preprocessing often includes
categorical/numerical
standardization and
identifying and addressing
relationships between
columns
Preprocessing and Vectorizing Tabular Data
• Text carries it own set of
unique preprocessing
challenges and data
characteristics
• Characteristics
• Language
• Text Encoding
• Typos and word-level errors
Common Characteristics of Text Data
• Preprocessing
• Case and punctuation
normalization
• Frequency analysis
• Usually map to fixed lexicon
• Vectorization
• Tokenization: Word / subword
/ character
• Vectorization: Count-based
vs. embedding
Preprocessing and Vectorizing Text Data
• Technical Characteristics
• Size and resolution
• Color space
• Format / compression
Common Characteristics of Image Data
• Preprocessing
• Type Conversion
• Resampling
• Re-scaling (centering)
• Cropping
• Vectorization
• Preprocessing produces
matrices and tensors of
machine-readable data!
Preprocessing and Vectorizing Image Data
• Technical Characteristics
• Sample rate
• Bit rate
• Compression / Encoding
• Audio channels
• Content Characteristics
• Number / demographics of
speakers
• Acoustic Environment
• Accent
• Continuous vs. discrete speech
• Language(s)
• Dialect and vocabulary
Common Characteristics of Audio Data
Overview of Automatic Speech
Recognition (ASR) Systems
• Input is raw audio waveform
Shape of Automatic Speech Recognition Task
• Output is discrete text tokens
automatic speech
recognition is the coolest
area of machine learning
• Input is raw audio waveform
Shape of Automatic Speech Recognition Task
• Output is discrete text tokens
automatic speech
recognition is the coolest
area of machine learning
Framed as a sequence to sequence machine learning task
10,000 ft. View of ASR System Components
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
Preprocessing and Feature Extraction
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / cleaning
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / parsing
• Language identification /
cleaning
Preprocessing Audio Data
• Resampling
• Normalization
• Splitting / Chunking
• Noise detection / parsing
• Language identification /
cleaning
Preprocessing Audio Data
Preprocessing Audio Text Data
Automatic speach
recognition is the
coolest area of
making learning!!!
automatic speech
recognition is the
coolest area of
machine learning
• Furthermore, the target text
data (transcripts) need to be
preprocessed, normalized, etc
• Many speech recognition
map to a fixed output
vocabulary
• Using this vocabulary, we can
define the output lexicon for
our downstream modeling
tasks
• Much of the useful information
for downstream tasks isn’t
directly accessible from the
raw audio signal
• Most modern frameworks
leverage features derived
from the frequency
representation of the audio
signal
Feature Extraction and Vectorization
• The human ear is more
sensitive to particular
frequency ranges than others
• That domain knowledge is
leveraged for computing sets
of features that capture
features most relevant for
speech processing
Feature Extraction and Vectorization
• A common approach for
extracting relevant features
incorporating this domain
knowledge is the
computation of Mel
Frequency Cepstral
Coefficients, or MFCCs
Feature Extraction and Vectorization
• The result is now a series of
discrete feature vectors which
can then be fed as input to
the next step in the process
Feature Extraction and Vectorization
Acoustic Model
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
• An acoustic model provides a mapping from the feature space to an
intermediate acoustic representation of discrete phonemes
Acoustic Model
Acoustic Model
• Given a discrete input of
feature vectors, the acoustic
model provides probabilities
that a given (sequence of)
features corresponds to a
particular output phoneme
Acoustic Model
• As we’re ultimately interested
in word output and not
sequences of phonemes, an
acoustic model
• A lexicon is leveraged by
several methods to directly
map sequences of phonemes
to a word token hypothesis
Language Model
automatic speech
recognition is the
coolest area of
machine learning
Preprocessing/
Feature
Extraction
Acoustic
Model
Language
Model
• Output tokens from an
acoustic model are largely
unconstrained in the sense of
logical ordering
• A language model attempts
to introduce probabilistic
constraints on the raw output
of an acoustic model to
provide more likely sequences
of words
Language Model
Modeling Approaches
• Historically, Hidden Markov Models (HMM) have found great success in use
as acoustic models for ASR systems
Acoustic Models as Hidden Markov Models
• N-gram language models model
the probability of the next token in
a sequence by computing
statistical transition probabilities
between n-grams for a large
representative corpus
• Even complex, modern
approaches still rely on n-gram
language models to constrain the
output text to a given domain!
N-Gram Language Models
• Time-delay neural networks seek to
directly incorporate the
surrounding temporal context
features into the classification of
each frame into a corresponding
phoneme
Time-delay Neural Network
• Much research over the past
decade has focused on
approaching speech recognition
as a single task
• Many approaches borrow ideas or
architectural design from the
speech and signal processing
domain to maximize the useful
information signal through the
model
End to end approaches
• Deep speech (2014) was a popular
architecture that directly modeled
recursive relationships in the input
signal
End to end approaches
• Wav2vec(2) (2019/2020) leverages
vector product quantization to
encode discrete speech
representations
• This strategy, along with large scale
unsupervised model pretraining,
allowed for markedly performant
models when fine-tuned on
relatively small data sets
End to end approaches
• Conformer (2020) leverages both
convolutional layers and
attention-based transformer layers
to model both localized and longer
term dependencies in input
speech data
End to end approaches
• More recently, the Whisper (2022)
model from Open AI (of ChatGPT
fame) made a splash for providing
a very robust model trained on
over 680,000 hours of audio from
captioned video
• This model uses an
encoder-decoder architecture
accommodated by a multitask
training framework
End to end approaches
10,000 ft. View of ASR System Components
automatic speech
recognition is the
coolest area of
machine learning
The newest awesome end-to-end model

Teaching Machines to Listen: An Introduction to Automatic Speech Recognition

  • 1.
    An Introduction toAutomatic Speech Recognition Teaching Machines to Listen
  • 2.
    ● Introduction ● Considerationsfor Working with Speech Data ● Overview of Automatic Speech Recognition (ASR) Systems ● Modeling Approaches Outline
  • 3.
  • 4.
    • Why shouldyou listen to me tell you about ASR? whoami
  • 5.
    What we willcover • Main components of speech recognition systems • Unique challenges working with speech data • Some common modeling approaches Expectations
  • 6.
    What we willcover • Main components of speech recognition systems • Unique challenges working with speech data • Some common modeling approaches Expectations What we will not cover • Deep dive into specific model architectures • Comprehensive coverage of all aspects of working with speech • Code
  • 7.
    Consider this anorientation to the field of speech recognition for those familiar with other types of Data Science! What we will cover • Main components of speech recognition systems • Unique challenges working with speech data • Some common modeling approaches Expectations What we will not cover • Deep dive into specific model architectures • Comprehensive coverage of all aspects of working with speech • Code
  • 8.
    • Speech recognitionis the task of recognizing speech within audio and converting it into text • Active field since 1950’s! • Very active research field with substantial recent advances driven largely by novel neural network architectures Introduction
  • 9.
    • 1950s and1960s: Focus on limited use cases; digits, phonemes, single speakers • 1980s and 1990s: Shift to focus on statistical approaches (HMMs, etc.) • 2000s and 2010s: Wider availability of speech recognition toolkits • Late 2010s: end-to-end modeling approaches (Selected) History
  • 10.
    • Data volume SpecificChallenges for Speech Recognition
  • 11.
    • Data Quality& Characteristics Specific Challenges for Speech Recognition
  • 12.
    • Annotation Specific Challengesfor Speech Recognition
  • 13.
  • 14.
    • Machine learningalgorithms want to deal with vectors (or tensors) of floating point numbers • Different data types have different preprocessing requirements, as well as different vectorization strategies Preparing Data for Machine Learning
  • 15.
    • Tabular datais often a mix of different field types, likely captured from various systems of reference in a wide variety of ways • “Messiness” often comes from interpreting business context for individual fields in a given dataset Common Characteristics of Tabular Data
  • 16.
    • Preprocessing oftenincludes categorical/numerical standardization and identifying and addressing relationships between columns Preprocessing and Vectorizing Tabular Data
  • 17.
    • Text carriesit own set of unique preprocessing challenges and data characteristics • Characteristics • Language • Text Encoding • Typos and word-level errors Common Characteristics of Text Data
  • 18.
    • Preprocessing • Caseand punctuation normalization • Frequency analysis • Usually map to fixed lexicon • Vectorization • Tokenization: Word / subword / character • Vectorization: Count-based vs. embedding Preprocessing and Vectorizing Text Data
  • 19.
    • Technical Characteristics •Size and resolution • Color space • Format / compression Common Characteristics of Image Data
  • 20.
    • Preprocessing • TypeConversion • Resampling • Re-scaling (centering) • Cropping • Vectorization • Preprocessing produces matrices and tensors of machine-readable data! Preprocessing and Vectorizing Image Data
  • 21.
    • Technical Characteristics •Sample rate • Bit rate • Compression / Encoding • Audio channels • Content Characteristics • Number / demographics of speakers • Acoustic Environment • Accent • Continuous vs. discrete speech • Language(s) • Dialect and vocabulary Common Characteristics of Audio Data
  • 22.
    Overview of AutomaticSpeech Recognition (ASR) Systems
  • 23.
    • Input israw audio waveform Shape of Automatic Speech Recognition Task • Output is discrete text tokens automatic speech recognition is the coolest area of machine learning
  • 24.
    • Input israw audio waveform Shape of Automatic Speech Recognition Task • Output is discrete text tokens automatic speech recognition is the coolest area of machine learning Framed as a sequence to sequence machine learning task
  • 25.
    10,000 ft. Viewof ASR System Components automatic speech recognition is the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 26.
    Preprocessing and FeatureExtraction automatic speech recognition is the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 27.
    • Resampling • Normalization •Splitting / Chunking • Noise detection / cleaning • Language identification / cleaning Preprocessing Audio Data
  • 28.
    • Resampling • Normalization •Splitting / Chunking • Noise detection / cleaning • Language identification / cleaning Preprocessing Audio Data
  • 29.
    • Resampling • Normalization •Splitting / Chunking • Noise detection / cleaning • Language identification / cleaning Preprocessing Audio Data
  • 30.
    • Resampling • Normalization •Splitting / Chunking • Noise detection / parsing • Language identification / cleaning Preprocessing Audio Data
  • 31.
    • Resampling • Normalization •Splitting / Chunking • Noise detection / parsing • Language identification / cleaning Preprocessing Audio Data
  • 32.
    Preprocessing Audio TextData Automatic speach recognition is the coolest area of making learning!!! automatic speech recognition is the coolest area of machine learning • Furthermore, the target text data (transcripts) need to be preprocessed, normalized, etc • Many speech recognition map to a fixed output vocabulary • Using this vocabulary, we can define the output lexicon for our downstream modeling tasks
  • 33.
    • Much ofthe useful information for downstream tasks isn’t directly accessible from the raw audio signal • Most modern frameworks leverage features derived from the frequency representation of the audio signal Feature Extraction and Vectorization
  • 34.
    • The humanear is more sensitive to particular frequency ranges than others • That domain knowledge is leveraged for computing sets of features that capture features most relevant for speech processing Feature Extraction and Vectorization
  • 35.
    • A commonapproach for extracting relevant features incorporating this domain knowledge is the computation of Mel Frequency Cepstral Coefficients, or MFCCs Feature Extraction and Vectorization
  • 36.
    • The resultis now a series of discrete feature vectors which can then be fed as input to the next step in the process Feature Extraction and Vectorization
  • 37.
    Acoustic Model automatic speech recognitionis the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 38.
    • An acousticmodel provides a mapping from the feature space to an intermediate acoustic representation of discrete phonemes Acoustic Model
  • 39.
    Acoustic Model • Givena discrete input of feature vectors, the acoustic model provides probabilities that a given (sequence of) features corresponds to a particular output phoneme
  • 40.
    Acoustic Model • Aswe’re ultimately interested in word output and not sequences of phonemes, an acoustic model • A lexicon is leveraged by several methods to directly map sequences of phonemes to a word token hypothesis
  • 41.
    Language Model automatic speech recognitionis the coolest area of machine learning Preprocessing/ Feature Extraction Acoustic Model Language Model
  • 42.
    • Output tokensfrom an acoustic model are largely unconstrained in the sense of logical ordering • A language model attempts to introduce probabilistic constraints on the raw output of an acoustic model to provide more likely sequences of words Language Model
  • 43.
  • 44.
    • Historically, HiddenMarkov Models (HMM) have found great success in use as acoustic models for ASR systems Acoustic Models as Hidden Markov Models
  • 45.
    • N-gram languagemodels model the probability of the next token in a sequence by computing statistical transition probabilities between n-grams for a large representative corpus • Even complex, modern approaches still rely on n-gram language models to constrain the output text to a given domain! N-Gram Language Models
  • 46.
    • Time-delay neuralnetworks seek to directly incorporate the surrounding temporal context features into the classification of each frame into a corresponding phoneme Time-delay Neural Network
  • 47.
    • Much researchover the past decade has focused on approaching speech recognition as a single task • Many approaches borrow ideas or architectural design from the speech and signal processing domain to maximize the useful information signal through the model End to end approaches
  • 48.
    • Deep speech(2014) was a popular architecture that directly modeled recursive relationships in the input signal End to end approaches
  • 49.
    • Wav2vec(2) (2019/2020)leverages vector product quantization to encode discrete speech representations • This strategy, along with large scale unsupervised model pretraining, allowed for markedly performant models when fine-tuned on relatively small data sets End to end approaches
  • 50.
    • Conformer (2020)leverages both convolutional layers and attention-based transformer layers to model both localized and longer term dependencies in input speech data End to end approaches
  • 51.
    • More recently,the Whisper (2022) model from Open AI (of ChatGPT fame) made a splash for providing a very robust model trained on over 680,000 hours of audio from captioned video • This model uses an encoder-decoder architecture accommodated by a multitask training framework End to end approaches
  • 52.
    10,000 ft. Viewof ASR System Components automatic speech recognition is the coolest area of machine learning The newest awesome end-to-end model