Contemporary Models of Natural Language Processing
Developments Swahili ASR resources
1. Developments of Swahili
resources for an ASR system
Hadrien Gelas1,2, Laurent Besacier2, François Pellegrino1
1Laboratoire DDL, CNRS - Université de Lyon, France
2LIG, CNRS - Université Joseph Fourier Grenoble, France
2. Swahili System
introduction results
1 2 3
ASR
resources
8. Internet penetration
rate (%) 78.6
67.5
61.3
39.5
35.6
32.7
26.2
13.5
Africa Asia World Middle East Latin Europe Oceania / North
Average America / Australia America
Caribbean
9. Internet population
2988.4
growth (%)
2244.8
2000-2011
1205.1
789.6
528.1
376.4
214 152.6
Africa Asia World Middle East Latin Europe Oceania / North
Average America / Australia America
Caribbean
20. Rich morphology
for ASR (Type OOV %)
19.17
High OOV rates
12.46
10.28
Word-65k Word-200k Word-400k
21. Rich morphology
for ASR (Type OOV %)
19.17
To reach a larger
lexical coverage,
we used an
12.46
unsupervised
10.28 approach
(Morfessor) to
segment words in
sub-words units
Word-65k Word-200k Word-400k
22. Rich morphology
for ASR (Type OOV %)
19.17
12.46
10.28 11.36
1.61
Word-65k Word-200k Word-400k Morf-65k Morf-200k
23. ASR resources
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models
Needs unit pronunciation
J Text
output
25. Pronunciation
dictionary
65k most frequent units (words or sub-words)
+
Grapheme-to-phoneme script taking benefits
of the regularity of Swahili spelling
BUT…
Issue with English words, proper names and
acronyms!
30. Pronunciation
dictionary
Words in 65k dictionary Words in CMU
… 1 …
Identical word
games g a m e s games G EY M Z
games(2) g e y m z …
…
Add as a
3 variant
2 Mapping to Swahili phones
31. ASR resources
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models
Needs audio data and
matching transcriptions
J Text
output
33. Read speech corpus
(1st solution)
Transcriptions are directly available and the
task is easy to prepare
BUT…
May not be natural enough, need to find
speakers willing to record
3h30 collected this way
34. Crowdsourcing
transcriptions (2ndsolution)
Amazon’s Mechanical Turk:
Tasks can be posted online and anyone can be
paid to do them.
Good enough quality Completion rate lower
for acoustic models than for English
Possibility to find Ethical issues
transcribers
Only a test, 1h30 of read speech corpus
transcribed this way
35. Collaborative
transcriptions (3rdsolution)
Corpus to transcribe: web broadcast news
(available online with good enough quality)
Collaboration with a Kenyan institute :
36. Collaborative
transcriptions (3rdsolution)
A 1st acoustic model (AM)
is trained using read
speech corpus
1st set AM
37. Collaborative
transcriptions (3rdsolution)
2hrs set
preparation
A 2hrs set is
automatically
1st set AM segmented and
filtered
38. Collaborative
transcriptions (3rdsolution)
2hrs set
preparation
2hrs set
1st set AM transcribed
The 2hrs set is transcribed
using our 1st set AM
39. Collaborative
transcriptions (3rdsolution)
2hrs set
preparation
2hrs set
1st set AM The 2hrs set is sent to transcribed
the Ta ji Institute for
correction
2hrs set
corrected
40. Collaborative
transcriptions (3rdsolution)
2hrs set
preparation
After correction, data
are added to the 2hrs set
2nd set AM training corpus and a transcribed
new corpus is trained
2hrs set
corrected
41. Collaborative
transcriptions (3rdsolution)
2hrs set
preparation
12 hours were 2hrs set
6th set AM transcribed
transcribed
2hrs set
corrected
42. Collaborative transcriptions
1st set
40
40
35
Time
Spent
Time Spent (hours)
30
(hours) 3rd set
5th set
25 2nd set
25
4th set
20
6th set
15
15
60
60 65 70
70 75 80 85
85
Character Accuracy rate (%)
Character Accuracy Rate (%)
43. System results (WER)
"
Acoustic
r Pronunciation
r Language
models
l dictionary
l models
3 J Text
output