This document presents a summary of work on automatic language identification (LiD) from speech signals. It discusses how LiD could benefit various industries and outlines challenges in the problem. Features explored for LiD include MFCCs, pitch contours, and rhythmic patterns. Classification is done with WEKA using these acoustic features. Results show over 80% accuracy between related languages averaged across files, and over 70% in 5-second segments comparing all 12 languages. Extensions and improving robustness to noisy signals are discussed.
3. The
LiD
Problem
• The
identification
of
a
given
spoken
language
from
a
speech
signal
• 94%
of
global
population
speak
only
6%
of
world
languages
4. Real
World
Situations
• Automation
of
LiD
is
desirable
• Offers
many
benefits
to
international
service
industries
• Hotels,
Airports,
Global
Call
Centres
5. Language
Differences
• Languages
contain
information
that
makes
one
discernable
from
the
other;
– Phonemes
– Prosody
– Phonotactics
– Syntax
6. Human
Abilities
• Bias
towards
native
language
arises
in
infancy
• Prosodic
features
some
of
the
first
cues
to
be
recognised
• Humans
can
make
a
reasonable
estimate
on
language
heard
within
3-‐4
seconds
of
audition
• Even
unfamiliar
languages
may
be
plausibly
judged
7. Previous
Attempts
• Attempts
as
early
as
1974
–
USAF
work,
therefore
classified.
• Methodological
Investigations
as
early
as
1977
(House
&
Neuburg)
• Studies
for
the
most
part
center
on
phonotactic
constrains
and
phoneme
modeling
• Raw
acoustic
waveforms
have
been
visited
(Kwasny
1993)
8. A
Simpler
Approach?
• Phonotactic
approaches
require
expert
linguistic
knowledge
• Phoneme
and
phonotactic
modeling
time
consuming
• Given
the
speed
of
human
LiD
abilities,
discrimination
most
likely
based
on
acoustic
features
10. Feature
Extraction
• Spectral
Information
–
MFCC
Vectors
• Pitch
contour
information
• Handled
by
SCMIR
Library
(Collins,
2010)
• Speech
Rhythm
–
Normalised
Pairwise
Variability
Index
(nPVI)
Feature
Type
Implemented
Measure
Spectral
Content
MFCC
Vector
uGen
Pitch
Contour
Tartini
uGen
Speech
Rhythm
nPVI
function
11. Classification
• Handled
by
the
WEKA
toolkit
• Built
in
Multilayer
Perceptron
• Called
from
the
command
line
through
a
SuperCollider
system
command
12. Comparisons
Made
• 66
language
pairs
from
12
languages
• A
comparison
within
language
families
• A
comparison
of
all
12
languages
• Averaged
&
segmented
data
13. Results
Within
Families
100.00
90.00
80.00
70.00
Germanic
Romantic
60.00
Slavic
SinoAltaic
50.00
Mean
40.00
30.00
20.00
4
MFCC
13
MFCC
Tartini
41
MFCC
Tartini
All
Features
*Data
averaged
across
files
14. Results
Within
Families
100.00
90.00
80.00
70.00
Germanic
Romantic
60.00
Slavic
SinoAltaic
50.00
Mean
40.00
30.00
20.00
4
MFCC
13
MFCC
Tartini
41
MFCC
Tartini
*Data
in
5
second
segments
15. Results
From
All
Languages
100.00
90.00
80.00
70.00
60.00
Averaged
Mean
50.00
5s
Segments
Mean
40.00
30.00
20.00
10.00
4
MFCC
4
MFCC
Tartini
4
MFCC
Tartini
13
MFCC
13
MFCC
Tartini
13
MFCC
Tartini
41
MFCC
41
MFCC
Tartini
41
MFCC
Tartini
nPVI
nPVI
nPVI
16. Extensions
• nPVI
function
to
use
vowel
onsets
• Phonemic
Segmentation
• A
Larger
dataset
• Better
Efficiency
• Real-‐time
operation
17. Robustness
• Real
world
signals
are
very
different
from
processed
‘clean’
data.
• ‘Ideal’
LiD
systems
–
independent
&
robust
• A
need
to
analyse
only
the
part
of
the
signal
that
matters.
18. CASA
• A
computational
modeling
of
human
‘Auditory
Scene
Analysis’
(Bregman
1990)
• Separation
of
signal
into
component
parts
and
reconstitution
into
meaningful
‘streams’