SlideShare a Scribd company logo
© Audio Analytic Ltd, 2017
Sacha Krstulović
Sound is not Speech
SANE Workshop, New York - October 19th 2017
Director of AALabs – Audio Analytic Ltd.
© Audio Analytic Ltd, 2017
The missing piece of the AI puzzle
2
© Audio Analytic Ltd, 2017
The missing piece of the AI puzzle
3
Speech AI:
• Speech recognition / synthesis:
natural speech interaction, dialogue
• Biometric voice recognition:
identity, personalisation
• Machine translation
Image AI:
• Face recognition:
identity, security, personalisation
• Video processing:
activity, security, presence
Music AI:
• Fingerprinting:
entertainment, information
• Query by humming:
entertainment, information
Sound AI?
• Sound recognition:
context, attention, presence,
security, entertainment
• Scene recognition:
context, activity
© Audio Analytic Ltd, 2017
For the first time, devices can
intelligently respond to sound.
• We do sound recognition software and
algorithms
• Founded in 2010
• Based in Cambridge, UK and Palo Alto, USA
• Over 40 people
• Experts…
• in machine listening
• in sound recognition
• in software engineering
• Venture-backed company
Audio Analytic:
© Audio Analytic Ltd, 2017
“An AI start-up like no other… like a
Shazam for real-world sounds”
Bloomberg
© Audio Analytic Ltd, 2017© Audio Analytic Ltd, 2017
• Smart speakers and smart home devices
• Support the wellbeing of family, loved ones and
possessions by recognising and responding to:
Active market focus
• Window glass break
• Smoke and CO alarm
• Baby cry
• Dog bark
• Anomaly
• Voice presence
• All technology will become context aware and
intelligent
• New opportunities to enhance wellness,
entertainment and social interaction
• Hearables, wearables, VR, gaming, automotive,
smart home, buildings, mobile and more…
Expanding market focus
© Audio Analytic Ltd, 2017
Sound is not speech.
© Audio Analytic Ltd, 2017
Is there a language of sounds?
• Speech is bounded by language.
• Language is predictable and enumerable:
English language:171,476 words current use,
47,156 obsolete words, 9,500 derivative words
Wall Street Journal
“Hello [???]”
• e.g., end-to-end speech recognition:
CTC networks address the direct prediction
of labels from audio
𝑝 𝑙 𝑥 =
𝜋∈𝐵−1(𝑙)
𝑝(𝜋|𝑥)
(Alex Graves & al, 2006)
• What about environmental sounds?
Is their occurrence predictable?
“What sound will I make now?”
• Nonetheless, sounds tell us something:
• Intentionally communicative
e.g., smoke alarm, alert sounds
• Incidental cues
e.g., glass break, vacuum cleaner
• Environmental
e.g., aircon, babble; wind, rain
• And the notion of generative model remains
somewhat valid.
𝑝 smoke alarm beep )
• Oh, wait a minute:
𝑝( smoke alarm | beep and lots of garbage )
8
© Audio Analytic Ltd, 2017
The variety of production processes…
9
© Audio Analytic Ltd, 2017
… leads to a variety of acoustic features.
Beeps Harmonic Sounds
Crash/Bangs Shaped noise
© Audio Analytic Ltd, 2017
Should acoustic features be hand-crafted or learned?
11
• Most standard acoustic features were invented by looking at spectrogram images.
• If image recognition can infer features, then why not… => Convolutive Neural Networks.
© Audio Analytic Ltd, 2017
Are acoustic features enough?
• Temporal modelling might help.
12
© Audio Analytic Ltd, 2017
DCASE 2017 data challenge - Task 2 results
• DCASE: comparative evaluation.
Task 2: detection of rare events.
• Baby cry, glass break, gun shots
in background noise
• Convolutive Recursive
Neural Networks (CRNNs)
stole the show!
• (But has anybody actually
tried anything else?)
13
DCASE 2017 T2: H. Lim, J. Park and Y. Han E. Cakir and T. Virtanen
© Audio Analytic Ltd, 2017
Interrupted sequences
• T3 pattern:
14
0.5s
beep
0.5s
of silence
1s
of silence
© Audio Analytic Ltd, 2017
1s
of ANYTHING!
0.5s
of ANYTHING!
Interrupted sequences
• T3 pattern:
Out of the 4 second sequence, 62.5% of the acoustics
do not predict anything about the smoke alarm!
Perhaps need to model some attention mechanism?
15
0.5s
beep
“Attention and Localization Based on a Deep Convolutional
Recurrent Model for Weakly Supervised Audio Tagging”
Y. Xu, Q. Kong, Q. Huang, W. Wang and M. D. Plumbley
Interspeech 2017
DCASE 2016 Task4
© Audio Analytic Ltd, 2017
• The variability of the non-target set is very large.
• Data balance: by nature, the non-target set is much larger than the target set.
• Open set: model a ball around the target data and some measure of “outlierism”?
Early results in vision and forensics: “The open set recognition problem”, A. Rocha and W. Scheirer, ICIP 2016
24/7 sound recognition: perhaps an open-set problem?
16
Target Non-target
Confusion
matrix:
1
∞
© Audio Analytic Ltd, 2017
Research is evolving: paradigm shift
• From the explicit modelling of phenomena
• MFCCs related to audition, generative models,
Markov chains, factorization, sparsity etc.
• To heavily data-driven DNN models
• Bottleneck features, posteriors from FFDNNs
• To higher level functions achieved in principle by DNNs
• Feature extraction => CNNs, Temporal modelling => RNNs,
Attention networks, CTC etc.
• Largely evaluation driven
• “Assuming that the network architecture does X,
evaluation shows that it improves the rates by Y%.”
• Data is a parameter.
• Some attempts at interpretation.
But this could all be a massive horse!
17
© Audio Analytic Ltd, 2017
The “real world”…
© Audio Analytic Ltd, 2017
Channel variability
19
© Audio Analytic Ltd, 2017
Running on the edge
• Speech may use Cloud Computing.
What about sound recognition
for consumer products?
• Running on the edge has lots of value
• Distributed computing
• Privacy concerns
• Reliability
• Real time
• etc.
• => Running on embedded systems requires
to optimise the computational cost.
20
(Image: Pubnub.com)
© Audio Analytic Ltd, 2017
Which machine yields the best bang for MIPS?
Sigtia & al., “Automatic Environmental Sound Recognition: Performance versus
Computational Cost”, IEEE/ACM Trans. ASLP, Vol.24 Issue 11, Nov. 2016, pp.2096-2107
21
© Audio Analytic Ltd, 2017
Summary: why sound is not speech
It is not exactly the same problem as speech or music:
• Not bounded by language or musical theory.
• Diversity of production processes and acoustic features.
• Temporal structure matters, but involves interruption.
• One against many: open-set recognition.
Additional topics for industrial impact:
• Robustness to channel and room responses is crucial.
• Running on the edge matters, computational cost
is a tangible question.
This is a new type of AI in its own right, and a new research community is forming around it.
22
© Audio Analytic Ltd, 2017
We are hiring!
https://www.AudioAnalytic.com/careers/
23
UK headquarters
2 Quayside
Cambridge
CB5 8AB, UK
info@audioanalytic.com
audioanalytic.com
US office
3505 El Camino Real
Palo Alto
CA 94306, USA
© Audio Analytic Ltd, 2017
v1
Thank you
© Audio Analytic Ltd, 2017

More Related Content

Similar to Sound is not speech

Building A Wearable With Heart Rate Monitoring
Building A Wearable With Heart Rate MonitoringBuilding A Wearable With Heart Rate Monitoring
Building A Wearable With Heart Rate Monitoring
Valencell, Inc
 
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
I MT
 
CE Pro Ultimate CEDIA 2017 Preview
CE Pro Ultimate CEDIA 2017 PreviewCE Pro Ultimate CEDIA 2017 Preview
CE Pro Ultimate CEDIA 2017 Preview
Julie Jacobson
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
NAVER Engineering
 
Tech-Savvy Residents Expect Tech-Savvy Homes
Tech-Savvy Residents Expect Tech-Savvy HomesTech-Savvy Residents Expect Tech-Savvy Homes
Tech-Savvy Residents Expect Tech-Savvy Homes
Mike Whaling
 
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
I MT
 
Barry Vercoe at the 2015 Innovation Forum
Barry Vercoe at the 2015 Innovation ForumBarry Vercoe at the 2015 Innovation Forum
Barry Vercoe at the 2015 Innovation Forum
Locus Research
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
Maryam Farooq
 
IV_WORKSHOP_NVIDIA-Audio_Processing
IV_WORKSHOP_NVIDIA-Audio_ProcessingIV_WORKSHOP_NVIDIA-Audio_Processing
IV_WORKSHOP_NVIDIA-Audio_Processing
diegogee
 
Speech Analysis
Speech AnalysisSpeech Analysis
Speech Analysis
Mohamed Essam
 
Definition of Audio Quality and Happiness
Definition of Audio Quality and HappinessDefinition of Audio Quality and Happiness
Definition of Audio Quality and Happiness
Henry Wong
 
Back to Nature: Curing Open Office Noise | Seattle Interactive 2017
Back to Nature: Curing Open Office Noise | Seattle Interactive 2017Back to Nature: Curing Open Office Noise | Seattle Interactive 2017
Back to Nature: Curing Open Office Noise | Seattle Interactive 2017
Seattle Interactive Conference
 
Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...
Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...
Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...
swissnex San Francisco
 
EarTouch: Turning the Ear into an Input Surface
EarTouch: Turning the Ear into an Input SurfaceEarTouch: Turning the Ear into an Input Surface
EarTouch: Turning the Ear into an Input Surface
sugiuralab
 
Un-muting Design | Seattle Interactive Conference 2018
Un-muting Design | Seattle Interactive Conference 2018Un-muting Design | Seattle Interactive Conference 2018
Un-muting Design | Seattle Interactive Conference 2018
Seattle Interactive Conference
 
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
DataScienceConferenc1
 
Petralex Pro 2019
Petralex Pro 2019Petralex Pro 2019
Petralex Pro 2019
ITFORYOU CORP.
 
Petralex hearing aid app
Petralex hearing aid appPetralex hearing aid app
Petralex hearing aid app
ITFORYOU CORP.
 
Petralex hearing aid app
Petralex hearing aid appPetralex hearing aid app
Petralex hearing aid app
ITFORYOU CORP.
 
Petralex hearing aid app
Petralex hearing aid appPetralex hearing aid app
Petralex hearing aid app
ITFORYOU CORP.
 

Similar to Sound is not speech (20)

Building A Wearable With Heart Rate Monitoring
Building A Wearable With Heart Rate MonitoringBuilding A Wearable With Heart Rate Monitoring
Building A Wearable With Heart Rate Monitoring
 
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - "Machine...
 
CE Pro Ultimate CEDIA 2017 Preview
CE Pro Ultimate CEDIA 2017 PreviewCE Pro Ultimate CEDIA 2017 Preview
CE Pro Ultimate CEDIA 2017 Preview
 
Visual recognition of human communications
Visual recognition of human communicationsVisual recognition of human communications
Visual recognition of human communications
 
Tech-Savvy Residents Expect Tech-Savvy Homes
Tech-Savvy Residents Expect Tech-Savvy HomesTech-Savvy Residents Expect Tech-Savvy Homes
Tech-Savvy Residents Expect Tech-Savvy Homes
 
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
Colloque IMT -04/04/2019- L'IA au cœur des mutations industrielles - Machine ...
 
Barry Vercoe at the 2015 Innovation Forum
Barry Vercoe at the 2015 Innovation ForumBarry Vercoe at the 2015 Innovation Forum
Barry Vercoe at the 2015 Innovation Forum
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
 
IV_WORKSHOP_NVIDIA-Audio_Processing
IV_WORKSHOP_NVIDIA-Audio_ProcessingIV_WORKSHOP_NVIDIA-Audio_Processing
IV_WORKSHOP_NVIDIA-Audio_Processing
 
Speech Analysis
Speech AnalysisSpeech Analysis
Speech Analysis
 
Definition of Audio Quality and Happiness
Definition of Audio Quality and HappinessDefinition of Audio Quality and Happiness
Definition of Audio Quality and Happiness
 
Back to Nature: Curing Open Office Noise | Seattle Interactive 2017
Back to Nature: Curing Open Office Noise | Seattle Interactive 2017Back to Nature: Curing Open Office Noise | Seattle Interactive 2017
Back to Nature: Curing Open Office Noise | Seattle Interactive 2017
 
Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...
Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...
Shane Myrbeck - Listening to Design - Immersive Acoustics Modeling in the ARU...
 
EarTouch: Turning the Ear into an Input Surface
EarTouch: Turning the Ear into an Input SurfaceEarTouch: Turning the Ear into an Input Surface
EarTouch: Turning the Ear into an Input Surface
 
Un-muting Design | Seattle Interactive Conference 2018
Un-muting Design | Seattle Interactive Conference 2018Un-muting Design | Seattle Interactive Conference 2018
Un-muting Design | Seattle Interactive Conference 2018
 
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
[DSC Europe 22] Make some noise for AI in JavaScript - Sead Delalic
 
Petralex Pro 2019
Petralex Pro 2019Petralex Pro 2019
Petralex Pro 2019
 
Petralex hearing aid app
Petralex hearing aid appPetralex hearing aid app
Petralex hearing aid app
 
Petralex hearing aid app
Petralex hearing aid appPetralex hearing aid app
Petralex hearing aid app
 
Petralex hearing aid app
Petralex hearing aid appPetralex hearing aid app
Petralex hearing aid app
 

Recently uploaded

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
Techgropse Pvt.Ltd.
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
Zilliz
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Safe Software
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
Edge AI and Vision Alliance
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 

Recently uploaded (20)

Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdfAI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
AI-Powered Food Delivery Transforming App Development in Saudi Arabia.pdf
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 
Infrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI modelsInfrastructure Challenges in Scaling RAG with Custom AI models
Infrastructure Challenges in Scaling RAG with Custom AI models
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Driving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success StoryDriving Business Innovation: Latest Generative AI Advancements & Success Story
Driving Business Innovation: Latest Generative AI Advancements & Success Story
 
Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
“Building and Scaling AI Applications with the Nx AI Manager,” a Presentation...
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 

Sound is not speech

  • 1. © Audio Analytic Ltd, 2017 Sacha Krstulović Sound is not Speech SANE Workshop, New York - October 19th 2017 Director of AALabs – Audio Analytic Ltd.
  • 2. © Audio Analytic Ltd, 2017 The missing piece of the AI puzzle 2
  • 3. © Audio Analytic Ltd, 2017 The missing piece of the AI puzzle 3 Speech AI: • Speech recognition / synthesis: natural speech interaction, dialogue • Biometric voice recognition: identity, personalisation • Machine translation Image AI: • Face recognition: identity, security, personalisation • Video processing: activity, security, presence Music AI: • Fingerprinting: entertainment, information • Query by humming: entertainment, information Sound AI? • Sound recognition: context, attention, presence, security, entertainment • Scene recognition: context, activity
  • 4. © Audio Analytic Ltd, 2017 For the first time, devices can intelligently respond to sound. • We do sound recognition software and algorithms • Founded in 2010 • Based in Cambridge, UK and Palo Alto, USA • Over 40 people • Experts… • in machine listening • in sound recognition • in software engineering • Venture-backed company Audio Analytic:
  • 5. © Audio Analytic Ltd, 2017 “An AI start-up like no other… like a Shazam for real-world sounds” Bloomberg
  • 6. © Audio Analytic Ltd, 2017© Audio Analytic Ltd, 2017 • Smart speakers and smart home devices • Support the wellbeing of family, loved ones and possessions by recognising and responding to: Active market focus • Window glass break • Smoke and CO alarm • Baby cry • Dog bark • Anomaly • Voice presence • All technology will become context aware and intelligent • New opportunities to enhance wellness, entertainment and social interaction • Hearables, wearables, VR, gaming, automotive, smart home, buildings, mobile and more… Expanding market focus
  • 7. © Audio Analytic Ltd, 2017 Sound is not speech.
  • 8. © Audio Analytic Ltd, 2017 Is there a language of sounds? • Speech is bounded by language. • Language is predictable and enumerable: English language:171,476 words current use, 47,156 obsolete words, 9,500 derivative words Wall Street Journal “Hello [???]” • e.g., end-to-end speech recognition: CTC networks address the direct prediction of labels from audio 𝑝 𝑙 𝑥 = 𝜋∈𝐵−1(𝑙) 𝑝(𝜋|𝑥) (Alex Graves & al, 2006) • What about environmental sounds? Is their occurrence predictable? “What sound will I make now?” • Nonetheless, sounds tell us something: • Intentionally communicative e.g., smoke alarm, alert sounds • Incidental cues e.g., glass break, vacuum cleaner • Environmental e.g., aircon, babble; wind, rain • And the notion of generative model remains somewhat valid. 𝑝 smoke alarm beep ) • Oh, wait a minute: 𝑝( smoke alarm | beep and lots of garbage ) 8
  • 9. © Audio Analytic Ltd, 2017 The variety of production processes… 9
  • 10. © Audio Analytic Ltd, 2017 … leads to a variety of acoustic features. Beeps Harmonic Sounds Crash/Bangs Shaped noise
  • 11. © Audio Analytic Ltd, 2017 Should acoustic features be hand-crafted or learned? 11 • Most standard acoustic features were invented by looking at spectrogram images. • If image recognition can infer features, then why not… => Convolutive Neural Networks.
  • 12. © Audio Analytic Ltd, 2017 Are acoustic features enough? • Temporal modelling might help. 12
  • 13. © Audio Analytic Ltd, 2017 DCASE 2017 data challenge - Task 2 results • DCASE: comparative evaluation. Task 2: detection of rare events. • Baby cry, glass break, gun shots in background noise • Convolutive Recursive Neural Networks (CRNNs) stole the show! • (But has anybody actually tried anything else?) 13 DCASE 2017 T2: H. Lim, J. Park and Y. Han E. Cakir and T. Virtanen
  • 14. © Audio Analytic Ltd, 2017 Interrupted sequences • T3 pattern: 14 0.5s beep 0.5s of silence 1s of silence
  • 15. © Audio Analytic Ltd, 2017 1s of ANYTHING! 0.5s of ANYTHING! Interrupted sequences • T3 pattern: Out of the 4 second sequence, 62.5% of the acoustics do not predict anything about the smoke alarm! Perhaps need to model some attention mechanism? 15 0.5s beep “Attention and Localization Based on a Deep Convolutional Recurrent Model for Weakly Supervised Audio Tagging” Y. Xu, Q. Kong, Q. Huang, W. Wang and M. D. Plumbley Interspeech 2017 DCASE 2016 Task4
  • 16. © Audio Analytic Ltd, 2017 • The variability of the non-target set is very large. • Data balance: by nature, the non-target set is much larger than the target set. • Open set: model a ball around the target data and some measure of “outlierism”? Early results in vision and forensics: “The open set recognition problem”, A. Rocha and W. Scheirer, ICIP 2016 24/7 sound recognition: perhaps an open-set problem? 16 Target Non-target Confusion matrix: 1 ∞
  • 17. © Audio Analytic Ltd, 2017 Research is evolving: paradigm shift • From the explicit modelling of phenomena • MFCCs related to audition, generative models, Markov chains, factorization, sparsity etc. • To heavily data-driven DNN models • Bottleneck features, posteriors from FFDNNs • To higher level functions achieved in principle by DNNs • Feature extraction => CNNs, Temporal modelling => RNNs, Attention networks, CTC etc. • Largely evaluation driven • “Assuming that the network architecture does X, evaluation shows that it improves the rates by Y%.” • Data is a parameter. • Some attempts at interpretation. But this could all be a massive horse! 17
  • 18. © Audio Analytic Ltd, 2017 The “real world”…
  • 19. © Audio Analytic Ltd, 2017 Channel variability 19
  • 20. © Audio Analytic Ltd, 2017 Running on the edge • Speech may use Cloud Computing. What about sound recognition for consumer products? • Running on the edge has lots of value • Distributed computing • Privacy concerns • Reliability • Real time • etc. • => Running on embedded systems requires to optimise the computational cost. 20 (Image: Pubnub.com)
  • 21. © Audio Analytic Ltd, 2017 Which machine yields the best bang for MIPS? Sigtia & al., “Automatic Environmental Sound Recognition: Performance versus Computational Cost”, IEEE/ACM Trans. ASLP, Vol.24 Issue 11, Nov. 2016, pp.2096-2107 21
  • 22. © Audio Analytic Ltd, 2017 Summary: why sound is not speech It is not exactly the same problem as speech or music: • Not bounded by language or musical theory. • Diversity of production processes and acoustic features. • Temporal structure matters, but involves interruption. • One against many: open-set recognition. Additional topics for industrial impact: • Robustness to channel and room responses is crucial. • Running on the edge matters, computational cost is a tangible question. This is a new type of AI in its own right, and a new research community is forming around it. 22
  • 23. © Audio Analytic Ltd, 2017 We are hiring! https://www.AudioAnalytic.com/careers/ 23
  • 24. UK headquarters 2 Quayside Cambridge CB5 8AB, UK info@audioanalytic.com audioanalytic.com US office 3505 El Camino Real Palo Alto CA 94306, USA © Audio Analytic Ltd, 2017 v1 Thank you
  • 25. © Audio Analytic Ltd, 2017

Editor's Notes

  1. Connectionist Temporal Classification “Somewhat generative” because there is no linguistic or musical score, but there is something nonetheless.