Sound is not speech

© Audio Analytic Ltd, 2017
Sacha Krstulović
Sound is not Speech
SANE Workshop, New York - October 19th 2017
Director of AALabs – Audio Analytic Ltd.

The missing piece of the AI puzzle
2

The missing piece of the AI puzzle
3
Speech AI:
• Speech recognition / synthesis:
natural speech interaction, dialogue
• Biometric voice recognition:
identity, personalisation
• Machine translation
Image AI:
• Face recognition:
identity, security, personalisation
• Video processing:
activity, security, presence
Music AI:
• Fingerprinting:
entertainment, information
• Query by humming:
entertainment, information
Sound AI?
• Sound recognition:
context, attention, presence,
security, entertainment
• Scene recognition:
context, activity

For the first time, devices can
intelligently respond to sound.
• We do sound recognition software and
algorithms
• Founded in 2010
• Based in Cambridge, UK and Palo Alto, USA
• Over 40 people
• Experts…
• in machine listening
• in sound recognition
• in software engineering
• Venture-backed company
Audio Analytic:

“An AI start-up like no other… like a
Shazam for real-world sounds”
Bloomberg

© Audio Analytic Ltd, 2017© Audio Analytic Ltd, 2017
• Smart speakers and smart home devices
• Support the wellbeing of family, loved ones and
possessions by recognising and responding to:
Active market focus
• Window glass break
• Smoke and CO alarm
• Baby cry
• Dog bark
• Anomaly
• Voice presence
• All technology will become context aware and
intelligent
• New opportunities to enhance wellness,
entertainment and social interaction
• Hearables, wearables, VR, gaming, automotive,
smart home, buildings, mobile and more…
Expanding market focus

Sound is not speech.

Is there a language of sounds?
• Speech is bounded by language.
• Language is predictable and enumerable:
English language:171,476 words current use,
47,156 obsolete words, 9,500 derivative words
Wall Street Journal
“Hello [???]”
• e.g., end-to-end speech recognition:
CTC networks address the direct prediction
of labels from audio
𝑝 𝑙 𝑥 =
𝜋∈𝐵−1(𝑙)
𝑝(𝜋|𝑥)
(Alex Graves & al, 2006)
• What about environmental sounds?
Is their occurrence predictable?
“What sound will I make now?”
• Nonetheless, sounds tell us something:
• Intentionally communicative
e.g., smoke alarm, alert sounds
• Incidental cues
e.g., glass break, vacuum cleaner
• Environmental
e.g., aircon, babble; wind, rain
• And the notion of generative model remains
somewhat valid.
𝑝 smoke alarm beep )
• Oh, wait a minute:
𝑝( smoke alarm | beep and lots of garbage )
8

The variety of production processes…
9

… leads to a variety of acoustic features.
Beeps Harmonic Sounds
Crash/Bangs Shaped noise

Should acoustic features be hand-crafted or learned?
11
• Most standard acoustic features were invented by looking at spectrogram images.
• If image recognition can infer features, then why not… => Convolutive Neural Networks.

Are acoustic features enough?
• Temporal modelling might help.
12

DCASE 2017 data challenge - Task 2 results
• DCASE: comparative evaluation.
Task 2: detection of rare events.
• Baby cry, glass break, gun shots
in background noise
• Convolutive Recursive
Neural Networks (CRNNs)
stole the show!
• (But has anybody actually
tried anything else?)
13
DCASE 2017 T2: H. Lim, J. Park and Y. Han E. Cakir and T. Virtanen

Interrupted sequences
• T3 pattern:
14
0.5s
beep
0.5s
of silence
1s
of silence

1s
of ANYTHING!
0.5s
of ANYTHING!
Interrupted sequences
• T3 pattern:
Out of the 4 second sequence, 62.5% of the acoustics
do not predict anything about the smoke alarm!
Perhaps need to model some attention mechanism?
15
0.5s
beep
“Attention and Localization Based on a Deep Convolutional
Recurrent Model for Weakly Supervised Audio Tagging”
Y. Xu, Q. Kong, Q. Huang, W. Wang and M. D. Plumbley
Interspeech 2017
DCASE 2016 Task4

• The variability of the non-target set is very large.
• Data balance: by nature, the non-target set is much larger than the target set.
• Open set: model a ball around the target data and some measure of “outlierism”?
Early results in vision and forensics: “The open set recognition problem”, A. Rocha and W. Scheirer, ICIP 2016
24/7 sound recognition: perhaps an open-set problem?
16
Target Non-target
Confusion
matrix:
1
∞

Research is evolving: paradigm shift
• From the explicit modelling of phenomena
• MFCCs related to audition, generative models,
Markov chains, factorization, sparsity etc.
• To heavily data-driven DNN models
• Bottleneck features, posteriors from FFDNNs
• To higher level functions achieved in principle by DNNs
• Feature extraction => CNNs, Temporal modelling => RNNs,
Attention networks, CTC etc.
• Largely evaluation driven
• “Assuming that the network architecture does X,
evaluation shows that it improves the rates by Y%.”
• Data is a parameter.
• Some attempts at interpretation.
But this could all be a massive horse!
17

The “real world”…

Channel variability
19

Running on the edge
• Speech may use Cloud Computing.
What about sound recognition
for consumer products?
• Running on the edge has lots of value
• Distributed computing
• Privacy concerns
• Reliability
• Real time
• etc.
• => Running on embedded systems requires
to optimise the computational cost.
20
(Image: Pubnub.com)

Which machine yields the best bang for MIPS?
Sigtia & al., “Automatic Environmental Sound Recognition: Performance versus
Computational Cost”, IEEE/ACM Trans. ASLP, Vol.24 Issue 11, Nov. 2016, pp.2096-2107
21

Summary: why sound is not speech
It is not exactly the same problem as speech or music:
• Not bounded by language or musical theory.
• Diversity of production processes and acoustic features.
• Temporal structure matters, but involves interruption.
• One against many: open-set recognition.
Additional topics for industrial impact:
• Robustness to channel and room responses is crucial.
• Running on the edge matters, computational cost
is a tangible question.
This is a new type of AI in its own right, and a new research community is forming around it.
22

We are hiring!
https://www.AudioAnalytic.com/careers/
23

UK headquarters
2 Quayside
Cambridge
CB5 8AB, UK
info@audioanalytic.com
audioanalytic.com
US office
3505 El Camino Real
Palo Alto
CA 94306, USA
v1
Thank you

Sound is not speech

Recommended

Recommended

More Related Content

Similar to Sound is not speech

Similar to Sound is not speech (20)

Recently uploaded

Recently uploaded (20)

Sound is not speech

Editor's Notes