Developments Swahili ASR resources

Developments of Swahili
resources for an ASR system

Hadrien Gelas1,2, Laurent Besacier2, François Pellegrino1
1Laboratoire DDL, CNRS - Université de Lyon, France
2LIG, CNRS - Université Joseph Fourier Grenoble, France

Swahili System
introduction results

1 2 3
ASR
resources

2% only of native speakers
(between 800k and 5M)

98% are
non-
natives

between 40M and 100M speakers

Large
area of
East
Africa

9

Spoken in more than countries

Large
area of
East
Africa

Official language of 5

nations

Large
area of
East
Africa area

Swahili
language

Internet penetration
rate (%) 78.6

67.5
61.3

39.5
35.6
32.7
26.2

13.5

Africa Asia World Middle East Latin Europe Oceania / North
Average America / Australia America
Caribbean

Internet population
2988.4
growth (%)
2244.8
2000-2011

1205.1

789.6
528.1
376.4
214 152.6

Africa Asia World Middle East Latin Europe Oceania / North
Average America / Australia America
Caribbean

Swahili and IT services

But not yet

Swahili features
for ASR

Rich morphology Non-tonal
Noun classes Roman script
agreement systems
complex verbs

ASR resources
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models

2 J Text
output

ASR resources
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models

Needs text corpus
J Text
output

Text corpus (M words)
28
Collected from 16
news websites

12

5
2
Sawa corpus [Getao and Miriti] Helsinki corpus Our corpus

Rich morphology in
Swahili
English They will not tell you

Swahili hawatakuambieni

Segm. ha-wa-ta-ku-ambi-e-ni

Gloss NEG-SM2-FUT-OM2-tell-FIN-PL

Rich morphology
for ASR (Type OOV %)
19.17

12.46
10.28

Word-65k Word-200k Word-400k

Rich morphology
19.17
High OOV rates

12.46
10.28


Rich morphology
19.17
To reach a larger
lexical coverage,
we used an
12.46
unsupervised
10.28 approach
(Morfessor) to
segment words in
sub-words units


Rich morphology
19.17

12.46
10.28 11.36

1.61
Word-65k Word-200k Word-400k Morf-65k Morf-200k

ASR resources
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models

Needs unit pronunciation
J Text
output

Pronunciation
dictionary
65k most frequent units (words or sub-words)
+
Grapheme-to-phoneme script taking benefits
of the regularity of Swahili spelling

Pronunciation
dictionary
65k most frequent units (words or sub-words)
+
Grapheme-to-phoneme script taking benefits
of the regularity of Swahili spelling
BUT…
Issue with English words, proper names and
acronyms!

Pronunciation
dictionary

Near 9% of units in 65k lexicon are
found in CMU English dictionary

Pronunciation
dictionary
Words in 65k dictionary Words in CMU

… …
games g a m e s games G EY M Z
… …

Pronunciation
dictionary

… 1 …
Identical word
… …

Pronunciation
dictionary

… 1 …
Identical word
… …

2 Mapping to Swahili phones

Pronunciation
dictionary

… 1 …
Identical word
games(2) g e y m z …
…
Add as a
3 variant
2 Mapping to Swahili phones

ASR resources
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models

Needs audio data and
matching transcriptions
J Text
output

Audio corpus

Main constraint for us !
It is a time consuming and
expensive task.

Read speech corpus
(1st solution)

Transcriptions are directly available and the
task is easy to prepare
BUT…
May not be natural enough, need to find
speakers willing to record

3h30 collected this way

Crowdsourcing
transcriptions (2ndsolution)

Amazon’s Mechanical Turk:
Tasks can be posted online and anyone can be
paid to do them.

Good enough quality Completion rate lower
for acoustic models than for English
Possibility to find Ethical issues
transcribers
Only a test, 1h30 of read speech corpus
transcribed this way

Collaborative
transcriptions (3rdsolution)

Corpus to transcribe: web broadcast news
(available online with good enough quality)

Collaboration with a Kenyan institute :

Collaborative

A 1st acoustic model (AM)
is trained using read
speech corpus

1st set AM

Collaborative
2hrs set
preparation
A 2hrs set is
automatically
1st set AM segmented and
filtered

Collaborative
2hrs set
preparation

2hrs set
1st set AM transcribed

The 2hrs set is transcribed
using our 1st set AM

Collaborative
2hrs set
preparation

2hrs set
1st set AM The 2hrs set is sent to transcribed
the Ta ji Institute for
correction

2hrs set
corrected

Collaborative
2hrs set
preparation

After correction, data
are added to the 2hrs set
2nd set AM training corpus and a transcribed
new corpus is trained

2hrs set
corrected

Collaborative
2hrs set
preparation

12 hours were 2hrs set
6th set AM transcribed
transcribed

2hrs set
corrected

Collaborative transcriptions
1st set
40
40

35

Time
Spent
Time Spent (hours)

30

(hours) 3rd set
5th set
25 2nd set
25

4th set
20

6th set
15
15

60
60 65 70
70 75 80 85
85

Character Accuracy rate (%)
Character Accuracy Rate (%)

System results (WER)
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models

3 J Text
output

Asante! (Thank you!)

hadrien.gelas@univ-lyon2.fr

laurent.besacier@imag.fr

françois.pellegrino@univ-lyon2.fr

Developments Swahili ASR resources

Recommended

Recommended

More Related Content

Similar to Developments Swahili ASR resources

Similar to Developments Swahili ASR resources (20)

Developments Swahili ASR resources