SlideShare a Scribd company logo
1 of 61
Download to read offline
Speech
recognition:
Art of the possible
Dominik.Lukes@ctl.ox.ac.uk @techczech
Dominik’s journey
Computational linguistics
Cognitive linguistics
Language teaching
1990–1995
Language teacher training
Translation
Metaphor / discourse studies
1995–2008
Readability
Learning / Assistive technology
Dyslexia teacher training
2009 – present
Bill Gates in 2011
“The next big thing is definitely
speech and voice recognition.”
What do we want to know?
What is the current state of the
art?
How we got here?
Where are going?
Are we asking the right
questions?
Tasks for speech recognition by difficulty
Select
word
from list
Interpret
command
Type
dictation
Transcribe
presentation
Transcribe
conversation
How we think of it vs how it is
Select word from list
Interpret
command Type
dictation
Transcribe
presentation
Transcribe
conversation
Transcribe
conversation
Transcribe
presentation
Type dictation
Interpret
command
Select
word from
list
Speech recognition approximate timeline
Select digit
1950s
Select from 1000
words
1970s
Select from large
vocabulary
1980s
Dictate word by
word
1990s
Dictate whole
sentences
1997
Transcribe
YouTube video
2012
Transcribe
conversation
2019
What is the actual job of
speech recognition?
What is this word?
[pʰɹɛtsɫ̩]
[pɹɛtsl]
/pretsəl/
<pretzel>
What’s the problem
aspirated /p/ at
start of a stressed syllable
devoiced /r/ following /p/
labialised /r/
following /p/ dark /l/
syllabic
consonant
glottal
stop
It gets worse: find the missing sounds
Course on speech recognition 1993
Faster computers won’t help
improve speech recognition. We
need a new approach.
Dragon Naturally Speaking
released in 1997. Can
recognise whole
sentences.
What happened?
How speech recognition does not work?
Finding individual sounds
(phonemes) in the speech and
matching them to letters.
How speech recognition actually works?
P(W|C)
What is the likelihood that the
next word is X given what came
before?
Actually, it is quite a bit more complicated (Huang and Deng 2009)
Probabilistic (stochastic)
ASR enabled the change.
Linguistics took the back
seat.
Fred Jelinek (ASR Pioneer - 1988?)
"Every time I fire a linguist, the
performance of the speech
recognizer goes up"
Consequence of
probabilistic approach:
Worse on words not
predictable from
context
Names Acronyms
Specialist
Terms
Question in 2011
I recorded a lecture, can I use
Dragon to transcribe it?
“Caption fails” in 2014 provided source for comedy
YouTube Captions today are usable and useful
So what happened
between 2014 and 2022?
Ingredients of success
Larger data sets
More computing power
Neural networks
Patrick Winston (2015) MIT Lecture 12a in AI course
It was in 2010, yes, that's right. It was in 2010. We
were having our annual discussion about what we
would dump from 6034 in order to make room for
some other stuff. And we almost killed off neural
nets. That might seem strange because our heads
are stuffed with neurons. … But many of us felt that
the neural models of the day weren't much in
the way of faithful models of what actually goes
on inside our heads. And besides that, nobody
had ever made a neural net that was worth a
darn for doing anything.
2012 – ImageNet showed
that Neural Networks are
much better at computing
the probabilities for
complex data.
Ok, we have neural nets,
what does that mean?
Things to know about Neural Nets
Everything has a probability
Same input does not produce
same output
They have no ‘sanity check’
or ‘common sense’
What do probabilities look like?
What BERT is not: Lessons from a new suite
of psycholinguistic diagnostics for language
models
Allyson Ettinger 2019
https://what-if.xkcd.com/34
Output changes as more
information is made
available. (Not always for
the better)
Examples from today’s captions
Crystal > Chris is
Am > and
experts > experience
AR > a our
Different ways of transcribing Dua Lipa
alipa
dualipa
dua lipa
lipa
duda lipa
Rise and mostly fall of Google’s new spell Czech
Tracking faces at the tips of the shoes
Hallucination is a big problem
Question asked by faculty member in 2021
We correct the transcripts, why
doesn’t the system learn the
correct spelling?
Adding your own word list
just tweaks the
probabilities.
Setting a genre setting
tweaks the probabilities.
Another thing to know about NN
Neural Nets use very large data
sets and can take days or
weeks to train.
Consequences of NN size
Speech recognition is often not
done on device.
Individual input often cannot adjust
the quality (except in pre-training)
Most applications use APIs from the
big players
Few open source/free options
Big players in the field
Google
Microsoft (now also Nuance)
Amazon
Interesting smaller companies
Verbit.ai
Carescribe.io (Caption.Ed)
Otter.ai
Rev.ai
Interesting applications
Descript
Microsoft Reading Progress
Microsoft Presentation Coach
What can we expect
in the future
Cautionary tale by SMBC
The Original Roomba (2002) vs Roomba S9+ (2019) - Wow!
What happens in speeches
Fillers Repetition
What does conversation actually look like?
Possible futures?
Incremental
improvement
similar to Roomba in 17 years
Accurate
lecture
transcripts
Fluent
dictation with
pauses
Better meeting
transcription
Revolutionary
change
similar to change in speech
recognition in 6 years
Informal
conversation
transcription
Interactive
dictation
Multilingual
speech
transcription
How should we think about accuracy?
We speak 120-180 words per minute
99% accurate = 2 errors per minute
From Sept 2014 xkcd.com/1425
Sometimes it is hard to judge
how much effort will be needed
to solve a seemingly easy
problem.
Wishlist (a few hours of coding)
Transcripts indicate level
of confidence
Benchmarks for lecture
transcripts
Better manual control of
transcripts (like Descript)
Dreamlist (5 years and a research team)
Multilingual transcription
(identify change in
language)
Multimodal transcription
(use information from
video)
Raw to readable
transcript
Welcome to the
panel
Kate Knill
Machine Intelligence
Lab, University of
Cambridge
Richard Cave
MND Association (and
formerly Google
project Euphonia)
Richard
Purcell
Caption.Ed
Irit Opher
Head of Research at
Verbit.ai
What is the current state of
the art of speech recognition
in general and in the
transcription of recorded
speech in particular?
What are the current quality
metrics and how much do
they tell us about suitability
of models? Do we need
better ones?
After the big recent jump in
performance, are we seeing
a plateau with incremental
growth or can we expect
another step change in
quality?
Where can we see the most
innovation? What are the
research and development
blind spots where more effort
is needed?
What are the currently
unsolved problems for
which we do not have a
solution?
What is the space for
smaller players to innovate
in this space? How much do
they have to rely on pre-
trained models from big
providers? Is there space for
open source?
This presentation is licensed
under Creative Commons By
Attribution license except where
otherwise noted.
Icons and stock images from Microsoft
Office 365 creative premium. They
cannot be distributed separately from this
document.

More Related Content

Similar to Speech Recognition: Art of the possible - DigiFest 2022

Going Global Without Going Insane
Going Global Without Going InsaneGoing Global Without Going Insane
Going Global Without Going InsaneKevin Potts
 
How To "Speak Developer"
How To "Speak Developer"How To "Speak Developer"
How To "Speak Developer"Nick Malcolm
 
Midwest km pugh conversational ai and ai for conversation 190809
Midwest km pugh conversational ai and ai for conversation 190809Midwest km pugh conversational ai and ai for conversation 190809
Midwest km pugh conversational ai and ai for conversation 190809Katrina (Kate) Pugh
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Format Matters - How presentation affects understanding
Format Matters - How presentation affects understandingFormat Matters - How presentation affects understanding
Format Matters - How presentation affects understandingMike Rice
 
The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.Isabella Loddo
 
Designing applications for voice interface platforms
Designing applications for voice interface platformsDesigning applications for voice interface platforms
Designing applications for voice interface platformsmanphilip
 
Narrate Your Way To Success
Narrate Your Way To SuccessNarrate Your Way To Success
Narrate Your Way To SuccessTCUK
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better PresentationsJose Ramon Macias
 
How to tell a better story (in code)(final)
How to tell a better story (in code)(final)How to tell a better story (in code)(final)
How to tell a better story (in code)(final)Bonnie Pan
 
Sketchstorming Workshop - UX Copenhagen 2018
Sketchstorming Workshop  - UX Copenhagen 2018 Sketchstorming Workshop  - UX Copenhagen 2018
Sketchstorming Workshop - UX Copenhagen 2018 Teo Choong Ching
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologyAamir-sheriff
 
The State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdfThe State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdf3Play Media
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...TAUS - The Language Data Network
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendEgor Pushkin
 

Similar to Speech Recognition: Art of the possible - DigiFest 2022 (20)

Going Global Without Going Insane
Going Global Without Going InsaneGoing Global Without Going Insane
Going Global Without Going Insane
 
How To "Speak Developer"
How To "Speak Developer"How To "Speak Developer"
How To "Speak Developer"
 
Midwest km pugh conversational ai and ai for conversation 190809
Midwest km pugh conversational ai and ai for conversation 190809Midwest km pugh conversational ai and ai for conversation 190809
Midwest km pugh conversational ai and ai for conversation 190809
 
#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality#5 Predicting Machine Translation Quality
#5 Predicting Machine Translation Quality
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Format Matters - How presentation affects understanding
Format Matters - How presentation affects understandingFormat Matters - How presentation affects understanding
Format Matters - How presentation affects understanding
 
The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.The Cocktail Party Effect. An inclusive vision of conversational interactions.
The Cocktail Party Effect. An inclusive vision of conversational interactions.
 
Designing applications for voice interface platforms
Designing applications for voice interface platformsDesigning applications for voice interface platforms
Designing applications for voice interface platforms
 
Narrate Your Way To Success
Narrate Your Way To SuccessNarrate Your Way To Success
Narrate Your Way To Success
 
Do We Need Better Presentations
Do We Need Better PresentationsDo We Need Better Presentations
Do We Need Better Presentations
 
How to tell a better story (in code)(final)
How to tell a better story (in code)(final)How to tell a better story (in code)(final)
How to tell a better story (in code)(final)
 
Sketchstorming Workshop - UX Copenhagen 2018
Sketchstorming Workshop  - UX Copenhagen 2018 Sketchstorming Workshop  - UX Copenhagen 2018
Sketchstorming Workshop - UX Copenhagen 2018
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
The State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdfThe State of Automatic Speech Recognition 2022 (2).pdf
The State of Automatic Speech Recognition 2022 (2).pdf
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Visual basics
Visual basicsVisual basics
Visual basics
 
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
Iconic Translation: The Neural Frontier by John Tinsley (Iconic Translation M...
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon Comprehend
 

More from Dominik Lukes

How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023Dominik Lukes
 
Reading and Writing Innovation Lab - Assistive technology and the reading pro...
Reading and Writing Innovation Lab - Assistive technology and the reading pro...Reading and Writing Innovation Lab - Assistive technology and the reading pro...
Reading and Writing Innovation Lab - Assistive technology and the reading pro...Dominik Lukes
 
Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...
Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...
Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...Dominik Lukes
 
Innovations in reading and writing: What should learning technologists know -...
Innovations in reading and writing: What should learning technologists know -...Innovations in reading and writing: What should learning technologists know -...
Innovations in reading and writing: What should learning technologists know -...Dominik Lukes
 
What i learned from 20 years of giving domains
What i learned from 20 years of giving domainsWhat i learned from 20 years of giving domains
What i learned from 20 years of giving domainsDominik Lukes
 
Pardon my code mix: Hypostatic frame constructions in Czech
Pardon my code mix: Hypostatic frame constructions in CzechPardon my code mix: Hypostatic frame constructions in Czech
Pardon my code mix: Hypostatic frame constructions in CzechDominik Lukes
 
Reading, writing, and study skills: Technology You Should know
Reading, writing, and study skills: Technology You Should knowReading, writing, and study skills: Technology You Should know
Reading, writing, and study skills: Technology You Should knowDominik Lukes
 
Mindmaps, flowcharts and infographics with everyday tools
Mindmaps, flowcharts and infographics  with everyday toolsMindmaps, flowcharts and infographics  with everyday tools
Mindmaps, flowcharts and infographics with everyday toolsDominik Lukes
 
Tools and strategies for writing in simple language
Tools and strategies for writing in simple languageTools and strategies for writing in simple language
Tools and strategies for writing in simple languageDominik Lukes
 
Computer productivity
Computer productivityComputer productivity
Computer productivityDominik Lukes
 
Using online corpus for literacy teachers
Using online corpus for literacy teachersUsing online corpus for literacy teachers
Using online corpus for literacy teachersDominik Lukes
 
Czech without aspect: Marrying functional schemas with functional representat...
Czech without aspect: Marrying functional schemas with functional representat...Czech without aspect: Marrying functional schemas with functional representat...
Czech without aspect: Marrying functional schemas with functional representat...Dominik Lukes
 
Dyslexia friendly reader: Prototype and designs
Dyslexia friendly reader: Prototype and designsDyslexia friendly reader: Prototype and designs
Dyslexia friendly reader: Prototype and designsDominik Lukes
 
Building a phonics engine for automated text guidance
Building a phonics engine for automated text guidanceBuilding a phonics engine for automated text guidance
Building a phonics engine for automated text guidanceDominik Lukes
 
Open licensing is an accessibility and inclusion feature of OERs
Open licensing is an accessibility and inclusion feature of OERsOpen licensing is an accessibility and inclusion feature of OERs
Open licensing is an accessibility and inclusion feature of OERsDominik Lukes
 
Have the licensing talk early to maximize impact
Have the licensing talk early to maximize impactHave the licensing talk early to maximize impact
Have the licensing talk early to maximize impactDominik Lukes
 
Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...
Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...
Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...Dominik Lukes
 
Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...
Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...
Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...Dominik Lukes
 
Investigating literacy teachers' linguistic knowledge
Investigating literacy teachers' linguistic knowledgeInvestigating literacy teachers' linguistic knowledge
Investigating literacy teachers' linguistic knowledgeDominik Lukes
 
L2L, Alternative Formats and Affordable Inclusive Technology
L2L, Alternative Formats and Affordable Inclusive TechnologyL2L, Alternative Formats and Affordable Inclusive Technology
L2L, Alternative Formats and Affordable Inclusive TechnologyDominik Lukes
 

More from Dominik Lukes (20)

How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023How to Teach and Learn with ChatGPT - BETT 2023
How to Teach and Learn with ChatGPT - BETT 2023
 
Reading and Writing Innovation Lab - Assistive technology and the reading pro...
Reading and Writing Innovation Lab - Assistive technology and the reading pro...Reading and Writing Innovation Lab - Assistive technology and the reading pro...
Reading and Writing Innovation Lab - Assistive technology and the reading pro...
 
Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...
Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...
Supporting Teachers to Support Students-Misaligned incentives, Media and Lear...
 
Innovations in reading and writing: What should learning technologists know -...
Innovations in reading and writing: What should learning technologists know -...Innovations in reading and writing: What should learning technologists know -...
Innovations in reading and writing: What should learning technologists know -...
 
What i learned from 20 years of giving domains
What i learned from 20 years of giving domainsWhat i learned from 20 years of giving domains
What i learned from 20 years of giving domains
 
Pardon my code mix: Hypostatic frame constructions in Czech
Pardon my code mix: Hypostatic frame constructions in CzechPardon my code mix: Hypostatic frame constructions in Czech
Pardon my code mix: Hypostatic frame constructions in Czech
 
Reading, writing, and study skills: Technology You Should know
Reading, writing, and study skills: Technology You Should knowReading, writing, and study skills: Technology You Should know
Reading, writing, and study skills: Technology You Should know
 
Mindmaps, flowcharts and infographics with everyday tools
Mindmaps, flowcharts and infographics  with everyday toolsMindmaps, flowcharts and infographics  with everyday tools
Mindmaps, flowcharts and infographics with everyday tools
 
Tools and strategies for writing in simple language
Tools and strategies for writing in simple languageTools and strategies for writing in simple language
Tools and strategies for writing in simple language
 
Computer productivity
Computer productivityComputer productivity
Computer productivity
 
Using online corpus for literacy teachers
Using online corpus for literacy teachersUsing online corpus for literacy teachers
Using online corpus for literacy teachers
 
Czech without aspect: Marrying functional schemas with functional representat...
Czech without aspect: Marrying functional schemas with functional representat...Czech without aspect: Marrying functional schemas with functional representat...
Czech without aspect: Marrying functional schemas with functional representat...
 
Dyslexia friendly reader: Prototype and designs
Dyslexia friendly reader: Prototype and designsDyslexia friendly reader: Prototype and designs
Dyslexia friendly reader: Prototype and designs
 
Building a phonics engine for automated text guidance
Building a phonics engine for automated text guidanceBuilding a phonics engine for automated text guidance
Building a phonics engine for automated text guidance
 
Open licensing is an accessibility and inclusion feature of OERs
Open licensing is an accessibility and inclusion feature of OERsOpen licensing is an accessibility and inclusion feature of OERs
Open licensing is an accessibility and inclusion feature of OERs
 
Have the licensing talk early to maximize impact
Have the licensing talk early to maximize impactHave the licensing talk early to maximize impact
Have the licensing talk early to maximize impact
 
Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...
Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...
Copyright and Creative Commons for Teachers Making PowerPoints and Other Teac...
 
Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...
Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...
Frame Negotiation and Policy Discourse: Markets, local knowledge and centrali...
 
Investigating literacy teachers' linguistic knowledge
Investigating literacy teachers' linguistic knowledgeInvestigating literacy teachers' linguistic knowledge
Investigating literacy teachers' linguistic knowledge
 
L2L, Alternative Formats and Affordable Inclusive Technology
L2L, Alternative Formats and Affordable Inclusive TechnologyL2L, Alternative Formats and Affordable Inclusive Technology
L2L, Alternative Formats and Affordable Inclusive Technology
 

Recently uploaded

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sectoritnewsafrica
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessWSO2
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxAna-Maria Mihalceanu
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...amber724300
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 

Recently uploaded (20)

Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
4. Cobus Valentine- Cybersecurity Threats and Solutions for the Public Sector
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Accelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with PlatformlessAccelerating Enterprise Software Engineering with Platformless
Accelerating Enterprise Software Engineering with Platformless
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
A Glance At The Java Performance Toolbox
A Glance At The Java Performance ToolboxA Glance At The Java Performance Toolbox
A Glance At The Java Performance Toolbox
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
JET Technology Labs White Paper for Virtualized Security and Encryption Techn...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 

Speech Recognition: Art of the possible - DigiFest 2022

  • 1. Speech recognition: Art of the possible Dominik.Lukes@ctl.ox.ac.uk @techczech
  • 2. Dominik’s journey Computational linguistics Cognitive linguistics Language teaching 1990–1995 Language teacher training Translation Metaphor / discourse studies 1995–2008 Readability Learning / Assistive technology Dyslexia teacher training 2009 – present
  • 3. Bill Gates in 2011 “The next big thing is definitely speech and voice recognition.”
  • 4. What do we want to know? What is the current state of the art? How we got here? Where are going?
  • 5. Are we asking the right questions?
  • 6. Tasks for speech recognition by difficulty Select word from list Interpret command Type dictation Transcribe presentation Transcribe conversation
  • 7. How we think of it vs how it is Select word from list Interpret command Type dictation Transcribe presentation Transcribe conversation Transcribe conversation Transcribe presentation Type dictation Interpret command Select word from list
  • 8. Speech recognition approximate timeline Select digit 1950s Select from 1000 words 1970s Select from large vocabulary 1980s Dictate word by word 1990s Dictate whole sentences 1997 Transcribe YouTube video 2012 Transcribe conversation 2019
  • 9. What is the actual job of speech recognition?
  • 10. What is this word? [pʰɹɛtsɫ̩] [pɹɛtsl] /pretsəl/ <pretzel>
  • 11. What’s the problem aspirated /p/ at start of a stressed syllable devoiced /r/ following /p/ labialised /r/ following /p/ dark /l/ syllabic consonant glottal stop
  • 12. It gets worse: find the missing sounds
  • 13. Course on speech recognition 1993 Faster computers won’t help improve speech recognition. We need a new approach.
  • 14. Dragon Naturally Speaking released in 1997. Can recognise whole sentences. What happened?
  • 15. How speech recognition does not work? Finding individual sounds (phonemes) in the speech and matching them to letters.
  • 16. How speech recognition actually works? P(W|C) What is the likelihood that the next word is X given what came before?
  • 17. Actually, it is quite a bit more complicated (Huang and Deng 2009)
  • 18. Probabilistic (stochastic) ASR enabled the change. Linguistics took the back seat.
  • 19. Fred Jelinek (ASR Pioneer - 1988?) "Every time I fire a linguist, the performance of the speech recognizer goes up"
  • 20. Consequence of probabilistic approach: Worse on words not predictable from context Names Acronyms Specialist Terms
  • 21. Question in 2011 I recorded a lecture, can I use Dragon to transcribe it?
  • 22. “Caption fails” in 2014 provided source for comedy
  • 23. YouTube Captions today are usable and useful
  • 24. So what happened between 2014 and 2022?
  • 25. Ingredients of success Larger data sets More computing power Neural networks
  • 26. Patrick Winston (2015) MIT Lecture 12a in AI course It was in 2010, yes, that's right. It was in 2010. We were having our annual discussion about what we would dump from 6034 in order to make room for some other stuff. And we almost killed off neural nets. That might seem strange because our heads are stuffed with neurons. … But many of us felt that the neural models of the day weren't much in the way of faithful models of what actually goes on inside our heads. And besides that, nobody had ever made a neural net that was worth a darn for doing anything.
  • 27. 2012 – ImageNet showed that Neural Networks are much better at computing the probabilities for complex data.
  • 28. Ok, we have neural nets, what does that mean?
  • 29. Things to know about Neural Nets Everything has a probability Same input does not produce same output They have no ‘sanity check’ or ‘common sense’
  • 30. What do probabilities look like?
  • 31. What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models Allyson Ettinger 2019
  • 33. Output changes as more information is made available. (Not always for the better)
  • 34. Examples from today’s captions Crystal > Chris is Am > and experts > experience AR > a our
  • 35. Different ways of transcribing Dua Lipa alipa dualipa dua lipa lipa duda lipa
  • 36. Rise and mostly fall of Google’s new spell Czech
  • 37. Tracking faces at the tips of the shoes
  • 38. Hallucination is a big problem
  • 39. Question asked by faculty member in 2021 We correct the transcripts, why doesn’t the system learn the correct spelling?
  • 40. Adding your own word list just tweaks the probabilities.
  • 41. Setting a genre setting tweaks the probabilities.
  • 42. Another thing to know about NN Neural Nets use very large data sets and can take days or weeks to train.
  • 43. Consequences of NN size Speech recognition is often not done on device. Individual input often cannot adjust the quality (except in pre-training) Most applications use APIs from the big players Few open source/free options
  • 44. Big players in the field Google Microsoft (now also Nuance) Amazon
  • 46. Interesting applications Descript Microsoft Reading Progress Microsoft Presentation Coach
  • 47. What can we expect in the future
  • 49.
  • 50. The Original Roomba (2002) vs Roomba S9+ (2019) - Wow!
  • 51. What happens in speeches Fillers Repetition
  • 52. What does conversation actually look like?
  • 53. Possible futures? Incremental improvement similar to Roomba in 17 years Accurate lecture transcripts Fluent dictation with pauses Better meeting transcription Revolutionary change similar to change in speech recognition in 6 years Informal conversation transcription Interactive dictation Multilingual speech transcription
  • 54. How should we think about accuracy? We speak 120-180 words per minute 99% accurate = 2 errors per minute
  • 55. From Sept 2014 xkcd.com/1425 Sometimes it is hard to judge how much effort will be needed to solve a seemingly easy problem.
  • 56. Wishlist (a few hours of coding) Transcripts indicate level of confidence Benchmarks for lecture transcripts Better manual control of transcripts (like Descript)
  • 57. Dreamlist (5 years and a research team) Multilingual transcription (identify change in language) Multimodal transcription (use information from video) Raw to readable transcript
  • 59. Kate Knill Machine Intelligence Lab, University of Cambridge Richard Cave MND Association (and formerly Google project Euphonia) Richard Purcell Caption.Ed Irit Opher Head of Research at Verbit.ai
  • 60. What is the current state of the art of speech recognition in general and in the transcription of recorded speech in particular? What are the current quality metrics and how much do they tell us about suitability of models? Do we need better ones? After the big recent jump in performance, are we seeing a plateau with incremental growth or can we expect another step change in quality? Where can we see the most innovation? What are the research and development blind spots where more effort is needed? What are the currently unsolved problems for which we do not have a solution? What is the space for smaller players to innovate in this space? How much do they have to rely on pre- trained models from big providers? Is there space for open source?
  • 61. This presentation is licensed under Creative Commons By Attribution license except where otherwise noted. Icons and stock images from Microsoft Office 365 creative premium. They cannot be distributed separately from this document.

Editor's Notes

  1. Bill Gates big on digital reading, voice recognition, ubiquitous screens – GeekWire https://www.geekwire.com/2011/bill-gates-big-digital-reading-voice-recognition-ubiquitous-screens/ Microsoft's Bill Gates: A rare and remarkable interview with the world's second richest man | Daily Mail Online https://www.dailymail.co.uk/home/moslive/article-2001697/Microsofts-Bill-Gates-A-rare-remarkable-interview-worlds-second-richest-man.html This Photo by Unknown Author is licensed under CC BY
  2. Language Log » First novels (upenn.edu) https://languagelog.ldc.upenn.edu/nll/?p=53940&utm_source=rss&utm_medium=rss&utm_campaign=first-novels
  3. An Overview of Modern Speech Recognition - Microsoft Research https://www.microsoft.com/en-us/research/publication/an-overview-of-modern-speech-recognition/
  4. https://en.wikipedia.org/wiki/Frederick_Jelinek
  5. (10) CAPTION FAIL: Mr. Cuddles (w/ Toby Turner) – YouTube https://www.youtube.com/watch?v=7lTUXVfTVOg
  6. https://www.youtube.com/watch?v=HnA1QmZvSNs
  7. https://www.youtube.com/watch?v=uXt8qF2Zzfo Transcribed from YouTube
  8. http://cs231n.github.io/convolutional-networks
  9. Impressive on English, falls down on Czech
  10. Vanden Stock was a Belgian football player https://ai.googleblog.com/2021/01/totto-controlled-table-to-text.html
  11. Language Log » How AI Reporting Works (upenn.edu)
  12. https://www.youtube.com/watch?v=YLWSXVS71Js
  13. Transcribing Talk-in-Interaction - SAGE Research Methods (sagepub.com) https://methods.sagepub.com/book/doing-conversation-analysis/n6.xml
  14. Comic by XKCD licensed under CC BY NC
  15. What am I allowed to use premium creative content for? (microsoft.com) https://support.microsoft.com/en-us/topic/what-am-i-allowed-to-use-premium-creative-content-for-0de69c76-ff2b-473e-b715-4d245e39e895 Creative Commons — Attribution 4.0 International — CC BY 4.0 https://creativecommons.org/licenses/by/4.0/