SlideShare a Scribd company logo
1 of 20
Download to read offline
Collecting and evaluating speech
recognition corpora for nine Southern Bantu
                 languages

  Jaco Badenhorst, Charl van Heerden, Marelie Davel and
                    Etienne Barnard


                     March 31, 2009
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                            Outline



               Introduction
               Background:
                   ASR corpus design
                   The Lwazi ASR corpus

               Computational analysis
                   Approach
                   Analysis of phoneme variability

               Conclusion
Introduction         ASR corpus design   Project Lwazi     Computational analysis   Conclusion


                           Introduction


               Information flow in developing countries
                   Availability of alternate information sources is low in
                   developing countries
                   Telephone networks (cellular) are spreading rapidly

               Spoken dialog systems (SDSs)
                   Widespread belief that impact can be significant
                   Speech-based access can empower semi-literate people

               Applications of SDSs
                   Education (Speech-enabled learning)
                   Agriculture
                   Health care
                   Government services
Introduction         ASR corpus design   Project Lwazi     Computational analysis   Conclusion


                           Introduction



               To implement SDSs: ASR and TTS systems are needed
               Main linguistic resources needed for telephone-based ASR
               systems:
                   Electronic pronunciation dictionaries
                   Annotated audio corpora
                   Recognition grammars

               Challenges:
                   ASR only available for handful of African languages
                   Lack of linguistic resources for African languages
                   Lack of relevant audio for specific application (language used,
                   profile of speakers, speaking style, etc.)
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           ASR audio corpus



               Resource intensive process
               Factors that add to complexity:
                   Recordings of multiple speakers
                   Matching channel and style
                   Careful orthographic transcription
                   Markers required to indicate important events (eg. non-speech)

               Size of corpora:
                   Corpora of resource-scarce languages tend to be very small
                   (1-10 hours of audio)
                   Contrasts with speech corpora used to build commercial
                   systems (hundreds to thousands of hours)
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           Project Lwazi




               Three year (2006-2009) project commissioned by the South
               African Department of Arts and Culture
               Development of core speech technology resources and
               components (ASR, TTS, SDS, etc.)
               National pilot demonstrating potential impact of speech based
               systems in South Africa
               All 11 official languages of South Africa
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           Project Lwazi: Languages


               Distribution of home languages for South African population:
                   9 Southern Bantu languages, 2 Germanic languages
Introduction        ASR corpus design    Project Lwazi    Computational analysis   Conclusion


                          Project Lwazi



               ASR corpus:
                   Approximately 200 speakers per language
                   Speaker population selected to provide a balanced profile with
                   regard to age, gender and type of telephone
                   (cellphone/landline)
                   Read and elicited speech recorded over telephone channel
                   30 Utterances/speaker:
                        16 Randomly selected from phonetically balanced corpus
                        14 Short words and phrases
Introduction                 ASR corpus design                 Project Lwazi                        Computational analysis                         Conclusion


                                                           Project Lwazi: Southern Bantu
                                                           languages

Distinct phonemes per language                                                           Speech minutes per language
                     60                                                                                    500


                     50
                                                                                                           400
Distinct phonemes




                                                                                          Speech minutes
                     40
                                                                                                           300
                     30
                                                                                                           200
                     20

                                                                                                           100
                     10


                      0                                                                                     0
                           tsn     ven   ssw   sot   nso   zul   nbl   xho   tso                                 ssw   nbl   zul   xho   tso   sot   nso   tsn   ven
                                                Languages                                                                           Languages




                                                            Amount of data within Lwazi ASR corpus
Introduction           ASR corpus design    Project Lwazi     Computational analysis   Conclusion


                             Computational analysis




               Goal:
                   Understand data requirements to develop a minimal system
                   that is practically usable
                   Use as seed ASR system to collect additional resources
                   Implications of additional speakers and utterances
                   Develop tools:
                           Provide indication of data sufficiency
                           Potential for cross-language sharing
Introduction        ASR corpus design    Project Lwazi    Computational analysis   Conclusion


                           Computational analysis




               Approach:
                   Measure acoustic variance in terms of the separability between
                   probability densities by modelling specific phonemes
                   Statistical measure provides an indication of the effect that
                   additional training data will have on recognition accuracy
                   Utilise the same measure as indication of acoustic similarity
                   across languages
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           Computational analysis


               Mainly focus on four languages here:
                   isiNdebele (nbl)
                   siSwati (ssw)
                   isiZulu (zul)
                   Tshivenda (ven)

               We report only on single-mixture context-independent models
               (similar trends observed for more complex models)
               Report on examples from several broad categories of
               phonemes (SAMPA) which occur most in target languages:
                   /a/ (vowels)
                   /m/ (nasals)
                   /b/ and /g/ (voiced plosives)
                   /s/ (unvoiced fricatives)
Introduction     ASR corpus design           Project Lwazi                   Computational analysis    Conclusion


                          Analysis of phoneme variability




                                                                                                0.48
                  Bhattacharyya bound
                                                                                                0.46
                                                                                                0.44
                    0.5
                                                                                                0.42
                   0.45                                                                         0.4
                    0.4                                                                         0.38
                   0.35                                                                         0.36
                                                                                                0.34
                    0.3
                                                                                                0.32
                   0.25
                      5
                                                                                           50
                          10                                                          45
                                                                                 40
                               15                                           35
                                                                       30
                Number of Speakers 20                             25
                                                             20
                                                         15
                                        25
                                                    10     Phone Observations per Speaker
                                             30 5



     Figure: Speaker-and-utterance three-dimensional plot for the siSwati nasal
     /m/
Introduction                          ASR corpus design            Project Lwazi                               Computational analysis   Conclusion


                                            Number of phoneme utterances


                Bhattacharyya bound




                                                                            Bhattacharyya bound
                                         0.5                                                       0.5
                                        0.48                                                      0.48
                                        0.46                                                      0.46
                                        0.44                                                      0.44
                                                       /m/-zul                                                  /s/-zul
                                                       /m/-nbl                                                  /s/-nbl
                                        0.42                                                      0.42
                                                      /m/-ven                                                  /s/-ssw
                                         0.4                                                       0.4
                                             0   10 20 30 40 50                                       0   10 20 30 40 50
                                           Phone Observations per Speaker                           Phone Observations per Speaker
                Bhattacharyya bound




                                                                            Bhattacharyya bound
                                         0.5                                                       0.5
                                        0.48                                                      0.48
                                        0.46                                                      0.46
                                        0.44                                                      0.44
                                                       /a/-zul                                                 /b/-ssw
                                                       /a/-nbl                                                  /g/-zul
                                        0.42                                                      0.42
                                                      /a/-ven                                                   /g/-nbl
                                         0.4                                                       0.4
                                             0   10 20 30 40 50                                       0   10 20 30 40 50
                                           Phone Observations per Speaker                           Phone Observations per Speaker



     Figure: Effect of number of phoneme utterances per speaker on similarity
     measure for different phoneme groups using data from 30 speakers
Introduction                          ASR corpus design            Project Lwazi                              Computational analysis   Conclusion


                                            Number of speakers


                Bhattacharyya bound




                                                                         Bhattacharyya bound
                                         0.5                                                    0.5
                                        0.45                                                   0.45
                                         0.4                                                    0.4
                                                        /m/-zul                                                 /s/-zul
                                        0.35                                                   0.35
                                                        /m/-nbl                                                 /s/-nbl
                                                       /m/-ven                                                 /s/-ssw
                                         0.3                                                    0.3
                                               0   5 10 15 20 25 30 35                                0   5 10 15 20 25 30 35
                                                   Number of Speakers                                     Number of Speakers
                Bhattacharyya bound




                                                                         Bhattacharyya bound
                                         0.5                                                    0.5
                                        0.45                                                   0.45
                                         0.4                                                    0.4
                                                         /a/-zul                                              /b/-ssw
                                        0.35                                                   0.35
                                                         /a/-nbl                                               /g/-zul
                                                        /a/-ven                                                /g/-nbl
                                         0.3                                                    0.3
                                               0   5 10 15 20 25 30 35                                0   5 10 15 20 25 30 35
                                                   Number of Speakers                                     Number of Speakers



     Figure: Effect of number of speakers on similarity measure for different
     phoneme groups using 20 utterances per speaker
Introduction          ASR corpus design             Project Lwazi        Computational analysis   Conclusion


                                       Initial ASR Accuracy

Accuracy of phoneme recognisers
               100
                                                                               Developed initial ASR systems for
                                                                               all of the Bantu languages
               80
Accuracy (%)




                                                                               Test sets: 30 speakers per language
               60

                                                                               ASR system is phoneme recogniser,
               40
                                                                               with flat language model
               20
                                                                               A rough benchmark of acceptable
                                                                               phoneme accuracy: N-TIMIT
                0
                     tsn nbl ssw zul tso xho tsn TIM nso sot
                                   Languages
Introduction         ASR corpus design    Project Lwazi    Computational analysis   Conclusion


                           Impact of data reduction

               Division factor of 8:
                    Approximately 20 training speakers
                    Correlate well with the stable phoneme similarity values




     Figure: Reducing the number of speakers has (approximately) the same effect
     as reducing the amount of speech per speaker
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           Distances between phonemes

               Based upon proven stability of our phoneme models:
                   Phoneme similarity between phonemes across languages




     Figure: Effective distances for isiNdebele phonemes /a/ and /n/ and their
     closest matches.
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           Conclusion




               New method to determine data sufficiency
               Confirmed that different phoneme classes have different data
               requirements
               Our results suggest that similar phoneme accuracies may be
               achievable by using more speech from fewer speakers
               Based upon proven model stability we performed successful
               measurements of distances between phonemes of different
               languages
Introduction         ASR corpus design   Project Lwazi   Computational analysis   Conclusion


                           Conclusion




               Project Lwazi website:
                   http://www.meraka.org.za/lwazi
                   More info
                   Download corpora (ASR, TTS)
                   Download tools
                   Contact details

More Related Content

What's hot

Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
Hugo Moreno
 

What's hot (20)

IRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb People
 
Basics of speech coding
Basics of speech codingBasics of speech coding
Basics of speech coding
 
Speech synthesis technology
Speech synthesis technologySpeech synthesis technology
Speech synthesis technology
 
Planeaciones bloque v
Planeaciones bloque vPlaneaciones bloque v
Planeaciones bloque v
 
Introduction to text to speech
Introduction to text to speechIntroduction to text to speech
Introduction to text to speech
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
An Introduction To Speech Recognition
An Introduction To Speech RecognitionAn Introduction To Speech Recognition
An Introduction To Speech Recognition
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
A seminar report on speech recognition technology
A seminar report on speech recognition technologyA seminar report on speech recognition technology
A seminar report on speech recognition technology
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basics
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Ece speech-recognition-report
Ece speech-recognition-reportEce speech-recognition-report
Ece speech-recognition-report
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Resources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora ResolutionResources for linguistically motivated Multilingual Anaphora Resolution
Resources for linguistically motivated Multilingual Anaphora Resolution
 
Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)Esophageal Speech Recognition using Artificial Neural Network (ANN)
Esophageal Speech Recognition using Artificial Neural Network (ANN)
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
Voice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social mediaVoice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social media
 

Similar to Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu Languages

Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
Guy De Pauw
 

Similar to Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu Languages (20)

Speech-Recognition.pptx
Speech-Recognition.pptxSpeech-Recognition.pptx
Speech-Recognition.pptx
 
Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...Do we need linguistic knowledge for speech technology applications in African...
Do we need linguistic knowledge for speech technology applications in African...
 
Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Sp...
Advances in Automatic Speech Recognition: From Audio-Only  To Audio-Visual Sp...Advances in Automatic Speech Recognition: From Audio-Only  To Audio-Visual Sp...
Advances in Automatic Speech Recognition: From Audio-Only To Audio-Visual Sp...
 
Sindhi computing in Human Language Technology
Sindhi computing in Human Language TechnologySindhi computing in Human Language Technology
Sindhi computing in Human Language Technology
 
Khmer ASR
Khmer ASRKhmer ASR
Khmer ASR
 
Linguistic localization framework for Ooo
Linguistic localization framework for OooLinguistic localization framework for Ooo
Linguistic localization framework for Ooo
 
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Intro to Auto Speech Recognition -- How ML Learns Speech-to-TextIntro to Auto Speech Recognition -- How ML Learns Speech-to-Text
Intro to Auto Speech Recognition -- How ML Learns Speech-to-Text
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSTUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
 
Tuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural NetworksTuning Dari Speech Classification Employing Deep Neural Networks
Tuning Dari Speech Classification Employing Deep Neural Networks
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Language engineering
Language engineeringLanguage engineering
Language engineering
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performances
 
Sslis
SslisSslis
Sslis
 
PhD Projects in Audio Speech Language Processing Tutorial
PhD Projects in Audio Speech Language Processing TutorialPhD Projects in Audio Speech Language Processing Tutorial
PhD Projects in Audio Speech Language Processing Tutorial
 
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
 
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKSTUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
TUNING DARI SPEECH CLASSIFICATION EMPLOYING DEEP NEURAL NETWORKS
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
Implementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large DictionaryImplementation of Marathi Language Speech Databases for Large Dictionary
Implementation of Marathi Language Speech Databases for Large Dictionary
 
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
PERFORMANCE ANALYSIS OF DIFFERENT ACOUSTIC FEATURES BASED ON LSTM FOR BANGLA ...
 

More from Guy De Pauw

The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
Guy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Collecting and Evaluating Speech Recognition Corpora for Nine Southern Bantu Languages

  • 1. Collecting and evaluating speech recognition corpora for nine Southern Bantu languages Jaco Badenhorst, Charl van Heerden, Marelie Davel and Etienne Barnard March 31, 2009
  • 2. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Outline Introduction Background: ASR corpus design The Lwazi ASR corpus Computational analysis Approach Analysis of phoneme variability Conclusion
  • 3. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Introduction Information flow in developing countries Availability of alternate information sources is low in developing countries Telephone networks (cellular) are spreading rapidly Spoken dialog systems (SDSs) Widespread belief that impact can be significant Speech-based access can empower semi-literate people Applications of SDSs Education (Speech-enabled learning) Agriculture Health care Government services
  • 4. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Introduction To implement SDSs: ASR and TTS systems are needed Main linguistic resources needed for telephone-based ASR systems: Electronic pronunciation dictionaries Annotated audio corpora Recognition grammars Challenges: ASR only available for handful of African languages Lack of linguistic resources for African languages Lack of relevant audio for specific application (language used, profile of speakers, speaking style, etc.)
  • 5. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion ASR audio corpus Resource intensive process Factors that add to complexity: Recordings of multiple speakers Matching channel and style Careful orthographic transcription Markers required to indicate important events (eg. non-speech) Size of corpora: Corpora of resource-scarce languages tend to be very small (1-10 hours of audio) Contrasts with speech corpora used to build commercial systems (hundreds to thousands of hours)
  • 6. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi Three year (2006-2009) project commissioned by the South African Department of Arts and Culture Development of core speech technology resources and components (ASR, TTS, SDS, etc.) National pilot demonstrating potential impact of speech based systems in South Africa All 11 official languages of South Africa
  • 7. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi: Languages Distribution of home languages for South African population: 9 Southern Bantu languages, 2 Germanic languages
  • 8. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi ASR corpus: Approximately 200 speakers per language Speaker population selected to provide a balanced profile with regard to age, gender and type of telephone (cellphone/landline) Read and elicited speech recorded over telephone channel 30 Utterances/speaker: 16 Randomly selected from phonetically balanced corpus 14 Short words and phrases
  • 9. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Project Lwazi: Southern Bantu languages Distinct phonemes per language Speech minutes per language 60 500 50 400 Distinct phonemes Speech minutes 40 300 30 200 20 100 10 0 0 tsn ven ssw sot nso zul nbl xho tso ssw nbl zul xho tso sot nso tsn ven Languages Languages Amount of data within Lwazi ASR corpus
  • 10. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Computational analysis Goal: Understand data requirements to develop a minimal system that is practically usable Use as seed ASR system to collect additional resources Implications of additional speakers and utterances Develop tools: Provide indication of data sufficiency Potential for cross-language sharing
  • 11. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Computational analysis Approach: Measure acoustic variance in terms of the separability between probability densities by modelling specific phonemes Statistical measure provides an indication of the effect that additional training data will have on recognition accuracy Utilise the same measure as indication of acoustic similarity across languages
  • 12. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Computational analysis Mainly focus on four languages here: isiNdebele (nbl) siSwati (ssw) isiZulu (zul) Tshivenda (ven) We report only on single-mixture context-independent models (similar trends observed for more complex models) Report on examples from several broad categories of phonemes (SAMPA) which occur most in target languages: /a/ (vowels) /m/ (nasals) /b/ and /g/ (voiced plosives) /s/ (unvoiced fricatives)
  • 13. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Analysis of phoneme variability 0.48 Bhattacharyya bound 0.46 0.44 0.5 0.42 0.45 0.4 0.4 0.38 0.35 0.36 0.34 0.3 0.32 0.25 5 50 10 45 40 15 35 30 Number of Speakers 20 25 20 15 25 10 Phone Observations per Speaker 30 5 Figure: Speaker-and-utterance three-dimensional plot for the siSwati nasal /m/
  • 14. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Number of phoneme utterances Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.48 0.48 0.46 0.46 0.44 0.44 /m/-zul /s/-zul /m/-nbl /s/-nbl 0.42 0.42 /m/-ven /s/-ssw 0.4 0.4 0 10 20 30 40 50 0 10 20 30 40 50 Phone Observations per Speaker Phone Observations per Speaker Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.48 0.48 0.46 0.46 0.44 0.44 /a/-zul /b/-ssw /a/-nbl /g/-zul 0.42 0.42 /a/-ven /g/-nbl 0.4 0.4 0 10 20 30 40 50 0 10 20 30 40 50 Phone Observations per Speaker Phone Observations per Speaker Figure: Effect of number of phoneme utterances per speaker on similarity measure for different phoneme groups using data from 30 speakers
  • 15. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Number of speakers Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.45 0.45 0.4 0.4 /m/-zul /s/-zul 0.35 0.35 /m/-nbl /s/-nbl /m/-ven /s/-ssw 0.3 0.3 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Number of Speakers Number of Speakers Bhattacharyya bound Bhattacharyya bound 0.5 0.5 0.45 0.45 0.4 0.4 /a/-zul /b/-ssw 0.35 0.35 /a/-nbl /g/-zul /a/-ven /g/-nbl 0.3 0.3 0 5 10 15 20 25 30 35 0 5 10 15 20 25 30 35 Number of Speakers Number of Speakers Figure: Effect of number of speakers on similarity measure for different phoneme groups using 20 utterances per speaker
  • 16. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Initial ASR Accuracy Accuracy of phoneme recognisers 100 Developed initial ASR systems for all of the Bantu languages 80 Accuracy (%) Test sets: 30 speakers per language 60 ASR system is phoneme recogniser, 40 with flat language model 20 A rough benchmark of acceptable phoneme accuracy: N-TIMIT 0 tsn nbl ssw zul tso xho tsn TIM nso sot Languages
  • 17. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Impact of data reduction Division factor of 8: Approximately 20 training speakers Correlate well with the stable phoneme similarity values Figure: Reducing the number of speakers has (approximately) the same effect as reducing the amount of speech per speaker
  • 18. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Distances between phonemes Based upon proven stability of our phoneme models: Phoneme similarity between phonemes across languages Figure: Effective distances for isiNdebele phonemes /a/ and /n/ and their closest matches.
  • 19. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Conclusion New method to determine data sufficiency Confirmed that different phoneme classes have different data requirements Our results suggest that similar phoneme accuracies may be achievable by using more speech from fewer speakers Based upon proven model stability we performed successful measurements of distances between phonemes of different languages
  • 20. Introduction ASR corpus design Project Lwazi Computational analysis Conclusion Conclusion Project Lwazi website: http://www.meraka.org.za/lwazi More info Download corpora (ASR, TTS) Download tools Contact details