SlideShare a Scribd company logo
1 of 22
Download to read offline
Improving orthographic
             transcriptions using sentence
                       similarities
                     NL Oosthuizen, MJ Puttkammer & M Schlemmer
                                         Centre for Text Technology (CTexT®)
                        Research Unit: Languages and Literature in the South African Context
                                 North-West University, Potchefstroom Campus (PUK)
                                                      South Africa
                     E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za




18 May 2010; AfLaT 2010; Valletta, Malta
Introduction    Background
                                           Identified Differences    Problem Statement
                                                    Methodology
                                                          Results
                                                      Conclusion



Introduction: Background
 • Lwazi (knowledge) project
        – 200 Mother tongue speakers per language
        – 30 phrases – 14 open ended and 16 phoneme-
          rich sentences
        – 350 phoneme-rich sentences from various
          corpora each recorded 6-10 times – Totalling
          3200 phoneme-rich sentences
        – Relatively small ASR corpus meant extremely
          accurate transcriptions

18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Background
                                           Identified Differences    Problem Statement
                                                    Methodology
                                                          Results
                                                      Conclusion



Introduction: Problem Statement
 • Lwazi project – Issues
        – 2 year running time
        – 4-6 transcribers were employed per language
        – Different quality control phases were
          unsuccessful
        – Another solution was needed to improve
          quality



18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Confusables
                                           Identified Differences    Splits
                                                    Methodology      Insertions
                                                          Results    Deletions
                                                      Conclusion     Non-words



Identified Differences: Confusables
 • English examples:
        – has it been tried on <too> small a scale
        – has it been tried on <to> small a scale

 • isiXhosa examples:
        – andingomntu <othanda> kufunda
              • (I’m not a person <who loves> to read)
        – andingomntu <uthanda> kufunda
              • (I’m not a person <you love> to read)

 • Setswana examples:
        – bosa bo <jang> ko engelane ka nako e
        – bosa bo <yang> ko engelane ka nako e
              • (How is the weather in England at this time?)
              • <yang> in the second example is slang for “how”


18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Confusables
                                           Identified Differences    Splits
                                                    Methodology      Insertions
                                                          Results    Deletions
                                                      Conclusion     Non-words



Identified Differences: Splits
 • English examples:
        – there’s <nowhere> else for it to go
        – there’s <no_where> else for it to go

 • isiXhosa examples:
        – alwela phi na <loo_madabi>
        – alwelwa phi na <loomadabi>
              • (Where is it taking place, these challenges)

 • Setswana examples:
        – le fa e le <gone> re ratanang tota
        – le fa e le <go_ne> re ratanang tota
              • (Even though we have started dating)




18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Confusables
                                           Identified Differences    Splits
                                                    Methodology      Insertions
                                                          Results    Deletions
                                                      Conclusion     Non-words



Identified Differences: Insertions
 • English examples:
        – so we took our way toward the palace
        – so we <we_>took our way toward the palace

 • isiXhosa examples:
        – ibiyini ukuba unga mbambi wakumbona
              • (Why didn’t <you catch> him or her when you saw him or her)
        – ibiyini <na_>ukuba unga mbambi wakumbona
              • (Why didn’t < you caught> him or her when you saw him or her)

 • Setswana examples:
        – ba mmatlela mapai ba mo alela a robala
        – ba mmatlela mapai <li-> ba mo alela a robala
              • (They have looked for blankets and made a bed for themselves to sleep)



18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Confusables
                                           Identified Differences    Splits
                                                    Methodology      Insertions
                                                          Results    Deletions
                                                      Conclusion     Non-words



Identified Differences: Deletions
 • English examples:
        – as <to_>the first the answer is simple
        – as the first the answer is simple

 • isiXhosa examples:
        – yagaleleka impi <ke_>xa kuthi qheke ukusa
        – yagaleleka impi xa kuthi qheke ukusa
              • (It started the battle at the beginning of the morning)

 • Setswana examples:
        – ke eng gape se seng<we> se o se lemogang
        – ke eng gape se seng se o se lemogang
              • (What else have you noticed?)




18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Confusables
                                           Identified Differences    Splits
                                                    Methodology      Insertions
                                                          Results    Deletions
                                                      Conclusion     Non-words



Identified Differences: Non-words
 • English examples:
        – there is no <arbitrator> except a legislature fifteen thousand miles off
        – there is no <abritator> except a legislature fifteen thousand miles off

 • isiXhosa examples:
        – yile <venkile> yayikhethwe ngabathembu le
        – yile <venkeli> yayikhethwe ngabathembu le
              • (It is this shop that was selected by the Bathembu)
              • <venkeli> in the second example is a spelling mistake.

 • Setswana examples:
        – lefapha la dimenerale le <eneji>
        – lefapha la dimenerale le <energy>
              • (Department of minerals and energy)
              • < energy > in the second example is a spelling mistake.


18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction       Flowchart
                                                Identified Differences       Cleanup
                                                         Methodology         Transcription Mapping
                                                               Results       Sentences and Mark-up
                                                           Conclusion        Manual Verification



Methodology: Flowchart
                                                                        Original                     Average of 350 sentences
                                                                      sentences &                    recorded 6-10 times, transcribed
                                                                     transcriptions                  by 4-6 people per language



                                                                                                           For transcriptions
                              For original sentences
                                                                                                     Remove punctuation,
                           Remove punctuation                            Cleanup
                                                                                                     noise markers & partials
                           Convert to LC
                                                                                                     Convert to LC




                                                                          Map                        Compute Levenshtein distance
                slighter faults of substance are numerous            transcription to                and map transcriptions to closest
                                                                         original                    original sentence


             slighter fault substance are numerous      90.20%
             slighter faults of substances are numerous 97.60%


                                                                         Compare                 String Similarity - Brad Wood
                                                                          mapped                 “Look ahead” window finds
                                                                         sentences               differences




18 May 2010; AfLaT 2010; Valletta, Malta                                    Oosthuizen, Puttkammer & Schlemmer
Introduction        Flowchart
                                             Identified Differences        Cleanup
                                                      Methodology          Transcription Mapping
                                                            Results        Sentences and Mark-up
                                                        Conclusion         Manual Verification



Methodology: Flowchart

              slighter faults of substance are numerous
              slighter fault substance are numerous
                                                                                             HTML markup to illustrate
                                                                        Markup
                                                                                             differences with colours
              slighter faults of substance are numerous
              slighter faults of substances are numerous


                    I told him to make the charge at once                Manual
                                                                                             Verify the differences in context
                                 vs                                    verification
                                                                                             by listening to recordings
                   I told him to make the change at once




                                                                         Correct
                                                                                             Replace errors with correct string
                                                                         errors




                                                                        Improved
                                                                      transcriptions


18 May 2010; AfLaT 2010; Valletta, Malta                                 Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Cleanup
 • Remove possible differences from the
   sentences to improve matches:
        – Punctuation
              • Any commas, full stops, extra spaces ect.
        – Noise Markers
              • External noises [n] and speaker noises [s]
        – Partials
              • Any incomplete words (indicted by leading or trailing
                hyphen)

18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Transcription Mapping
 • Levenshtein mapping:
        – Link each transcribed sentence (T) to an
          original sentence (O) using Levenshtein
          distance
        – If no difference is found (DIFF (O, T) = 0)
              • Do nothing
        – If a difference is found (DIFF (O, T) = 1)
              • Continue to next step



18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Transcription Mapping
 • Levenshtein example:
        – Original sentence (O):
              • slighter faults of substance are numerous
        – Transcriptions (T):
              • slighter fault substance are numerous
                     – 90.20%
              • slighter faults of substances are numerous
                     – 97.60%




18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Sentences and Mark-up
 • String comparison algorithm developed by
   Brad Wood (2008):
        – Based on finding the Longest Common String
          (LCS)
        – Windowing compares the strings on character
          level over a maximum search distance
        – Differences found are annotated with HTML
 • Repeat after swapping the string 1 with
   string 2
18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Manual Verification
 • If(DIFF (O, T) = 1)
        – The spoken utterance (U) is compared to the
          original sentence (O)
        – If(DIFF (O, U) = 1 AND DIFF (T, U) = 0) then
              • U = T (No change is needed)
        – If(DIFF (O, U) = 1 AND DIFF (T, U) = 1) then
              • The transcription is incorrect and needs to be
                checked manually



18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Manual Verification
 • Transcribed correctly
        – Original sentence (O):
              • “I told him to make the charge at once.”
        – Spoken utterance (U)
              • “I told him to make the change at once.”
        – Transcriptions (T):
              • “I told him to make the cha<n>ge at once.”




18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Flowchart
                                           Identified Differences    Cleanup
                                                    Methodology      Transcription Mapping
                                                          Results    Sentences and Mark-up
                                                      Conclusion     Manual Verification



Methodology: Manual Verification
 • Transcribed incorrectly
        – Original sentence (O):
              • “a heavy word intervened, between...”
        – Spoken utterance (U)
              • “a heavy word intervened, between...”
        – Transcriptions (T):
              • “a heavy wo<o>d intervened, between...”




18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction
                                           Identified Differences
                                                    Methodology
                                                          Results
                                                      Conclusion



Results
                   Language                                 Differences found                    Actual errors

                   Afrikaans                                        776                              152

                    English                                         1143                             337

                   isiNdebele                                       958                              291

                    isiXhosa                                        1484                             1081

                    isiZulu                                         1854                             1228

                    Sepedi                                          1596                             736

                    Sesotho                                         739                              261

                   Setswana                                         1479                             828

                    Siswati                                         1558                             351

                   Tshivenda                                        814                              191

                   Xitsonga                                         1586                             456




18 May 2010; AfLaT 2010; Valletta, Malta                              Oosthuizen, Puttkammer & Schlemmer
Introduction    Summary
                                           Identified Differences    Future Work
                                                    Methodology
                                                          Results    Questions?
                                                      Conclusion



Conclusion: Summary
 • We introduced a method for identifying
   differences in ASR data
 • Overall quality of the transcriptions were
   increased
 • The Lwazi project had an average
   transcription accuracy of 98%.



18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Summary
                                           Identified Differences    Future Work
                                                    Methodology
                                                          Results    Questions?
                                                      Conclusion



Conclusion: Summary
 • Even with inexperienced transcribers high
   accuracy is still possible
 • Provide employment opportunities to
   people with little linguistic skills but have
   basic knowledge of their language
 • Empowering people to learn skills that may
   be invaluable in future projects


18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Summary
                                           Identified Differences    Future Work
                                                    Methodology
                                                          Results    Questions?
                                                      Conclusion



Conclusion: Future Work
 • If(DIFF (O, T) = 0 AND DIFF (O, U) = 1)
        – This will indicate that DIFF (T, U) = 1
        – For the current system DIFF (T, U) = 0 was
          considered only, as the specifications required
          it
        – This will mean that one can check the reader’s
          performance
        – Future work will include this statement


18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer
Introduction    Summary
                                           Identified Differences    Future Work
                                                    Methodology
                                                          Results    Questions?
                                                      Conclusion



Conclusion: Questions?
       Improving orthographic
    transcriptions using sentence
              similarities
                        NL Oosthuizen, MJ Puttkammer & M Schlemmer
                                            Centre for Text Technology (CTexT®)
                           Research Unit: Languages and Literature in the South African Context
                                    North-West University, Potchefstroom Campus (PUK)
                                                         South Africa
                        E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za




18 May 2010; AfLaT 2010; Valletta, Malta                            Oosthuizen, Puttkammer & Schlemmer

More Related Content

More from Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 

More from Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Recently uploaded

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 

Recently uploaded (20)

Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 

Improving orthographic transcriptions using sentence similarities

  • 1. Improving orthographic transcriptions using sentence similarities NL Oosthuizen, MJ Puttkammer & M Schlemmer Centre for Text Technology (CTexT®) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK) South Africa E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za 18 May 2010; AfLaT 2010; Valletta, Malta
  • 2. Introduction Background Identified Differences Problem Statement Methodology Results Conclusion Introduction: Background • Lwazi (knowledge) project – 200 Mother tongue speakers per language – 30 phrases – 14 open ended and 16 phoneme- rich sentences – 350 phoneme-rich sentences from various corpora each recorded 6-10 times – Totalling 3200 phoneme-rich sentences – Relatively small ASR corpus meant extremely accurate transcriptions 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 3. Introduction Background Identified Differences Problem Statement Methodology Results Conclusion Introduction: Problem Statement • Lwazi project – Issues – 2 year running time – 4-6 transcribers were employed per language – Different quality control phases were unsuccessful – Another solution was needed to improve quality 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 4. Introduction Confusables Identified Differences Splits Methodology Insertions Results Deletions Conclusion Non-words Identified Differences: Confusables • English examples: – has it been tried on <too> small a scale – has it been tried on <to> small a scale • isiXhosa examples: – andingomntu <othanda> kufunda • (I’m not a person <who loves> to read) – andingomntu <uthanda> kufunda • (I’m not a person <you love> to read) • Setswana examples: – bosa bo <jang> ko engelane ka nako e – bosa bo <yang> ko engelane ka nako e • (How is the weather in England at this time?) • <yang> in the second example is slang for “how” 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 5. Introduction Confusables Identified Differences Splits Methodology Insertions Results Deletions Conclusion Non-words Identified Differences: Splits • English examples: – there’s <nowhere> else for it to go – there’s <no_where> else for it to go • isiXhosa examples: – alwela phi na <loo_madabi> – alwelwa phi na <loomadabi> • (Where is it taking place, these challenges) • Setswana examples: – le fa e le <gone> re ratanang tota – le fa e le <go_ne> re ratanang tota • (Even though we have started dating) 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 6. Introduction Confusables Identified Differences Splits Methodology Insertions Results Deletions Conclusion Non-words Identified Differences: Insertions • English examples: – so we took our way toward the palace – so we <we_>took our way toward the palace • isiXhosa examples: – ibiyini ukuba unga mbambi wakumbona • (Why didn’t <you catch> him or her when you saw him or her) – ibiyini <na_>ukuba unga mbambi wakumbona • (Why didn’t < you caught> him or her when you saw him or her) • Setswana examples: – ba mmatlela mapai ba mo alela a robala – ba mmatlela mapai <li-> ba mo alela a robala • (They have looked for blankets and made a bed for themselves to sleep) 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 7. Introduction Confusables Identified Differences Splits Methodology Insertions Results Deletions Conclusion Non-words Identified Differences: Deletions • English examples: – as <to_>the first the answer is simple – as the first the answer is simple • isiXhosa examples: – yagaleleka impi <ke_>xa kuthi qheke ukusa – yagaleleka impi xa kuthi qheke ukusa • (It started the battle at the beginning of the morning) • Setswana examples: – ke eng gape se seng<we> se o se lemogang – ke eng gape se seng se o se lemogang • (What else have you noticed?) 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 8. Introduction Confusables Identified Differences Splits Methodology Insertions Results Deletions Conclusion Non-words Identified Differences: Non-words • English examples: – there is no <arbitrator> except a legislature fifteen thousand miles off – there is no <abritator> except a legislature fifteen thousand miles off • isiXhosa examples: – yile <venkile> yayikhethwe ngabathembu le – yile <venkeli> yayikhethwe ngabathembu le • (It is this shop that was selected by the Bathembu) • <venkeli> in the second example is a spelling mistake. • Setswana examples: – lefapha la dimenerale le <eneji> – lefapha la dimenerale le <energy> • (Department of minerals and energy) • < energy > in the second example is a spelling mistake. 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 9. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Flowchart Original Average of 350 sentences sentences & recorded 6-10 times, transcribed transcriptions by 4-6 people per language For transcriptions For original sentences Remove punctuation, Remove punctuation Cleanup noise markers & partials Convert to LC Convert to LC Map Compute Levenshtein distance slighter faults of substance are numerous transcription to and map transcriptions to closest original original sentence slighter fault substance are numerous 90.20% slighter faults of substances are numerous 97.60% Compare String Similarity - Brad Wood mapped “Look ahead” window finds sentences differences 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 10. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Flowchart slighter faults of substance are numerous slighter fault substance are numerous HTML markup to illustrate Markup differences with colours slighter faults of substance are numerous slighter faults of substances are numerous I told him to make the charge at once Manual Verify the differences in context vs verification by listening to recordings I told him to make the change at once Correct Replace errors with correct string errors Improved transcriptions 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 11. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Cleanup • Remove possible differences from the sentences to improve matches: – Punctuation • Any commas, full stops, extra spaces ect. – Noise Markers • External noises [n] and speaker noises [s] – Partials • Any incomplete words (indicted by leading or trailing hyphen) 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 12. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Transcription Mapping • Levenshtein mapping: – Link each transcribed sentence (T) to an original sentence (O) using Levenshtein distance – If no difference is found (DIFF (O, T) = 0) • Do nothing – If a difference is found (DIFF (O, T) = 1) • Continue to next step 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 13. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Transcription Mapping • Levenshtein example: – Original sentence (O): • slighter faults of substance are numerous – Transcriptions (T): • slighter fault substance are numerous – 90.20% • slighter faults of substances are numerous – 97.60% 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 14. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Sentences and Mark-up • String comparison algorithm developed by Brad Wood (2008): – Based on finding the Longest Common String (LCS) – Windowing compares the strings on character level over a maximum search distance – Differences found are annotated with HTML • Repeat after swapping the string 1 with string 2 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 15. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Manual Verification • If(DIFF (O, T) = 1) – The spoken utterance (U) is compared to the original sentence (O) – If(DIFF (O, U) = 1 AND DIFF (T, U) = 0) then • U = T (No change is needed) – If(DIFF (O, U) = 1 AND DIFF (T, U) = 1) then • The transcription is incorrect and needs to be checked manually 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 16. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Manual Verification • Transcribed correctly – Original sentence (O): • “I told him to make the charge at once.” – Spoken utterance (U) • “I told him to make the change at once.” – Transcriptions (T): • “I told him to make the cha<n>ge at once.” 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 17. Introduction Flowchart Identified Differences Cleanup Methodology Transcription Mapping Results Sentences and Mark-up Conclusion Manual Verification Methodology: Manual Verification • Transcribed incorrectly – Original sentence (O): • “a heavy word intervened, between...” – Spoken utterance (U) • “a heavy word intervened, between...” – Transcriptions (T): • “a heavy wo<o>d intervened, between...” 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 18. Introduction Identified Differences Methodology Results Conclusion Results Language Differences found Actual errors Afrikaans 776 152 English 1143 337 isiNdebele 958 291 isiXhosa 1484 1081 isiZulu 1854 1228 Sepedi 1596 736 Sesotho 739 261 Setswana 1479 828 Siswati 1558 351 Tshivenda 814 191 Xitsonga 1586 456 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 19. Introduction Summary Identified Differences Future Work Methodology Results Questions? Conclusion Conclusion: Summary • We introduced a method for identifying differences in ASR data • Overall quality of the transcriptions were increased • The Lwazi project had an average transcription accuracy of 98%. 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 20. Introduction Summary Identified Differences Future Work Methodology Results Questions? Conclusion Conclusion: Summary • Even with inexperienced transcribers high accuracy is still possible • Provide employment opportunities to people with little linguistic skills but have basic knowledge of their language • Empowering people to learn skills that may be invaluable in future projects 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 21. Introduction Summary Identified Differences Future Work Methodology Results Questions? Conclusion Conclusion: Future Work • If(DIFF (O, T) = 0 AND DIFF (O, U) = 1) – This will indicate that DIFF (T, U) = 1 – For the current system DIFF (T, U) = 0 was considered only, as the specifications required it – This will mean that one can check the reader’s performance – Future work will include this statement 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
  • 22. Introduction Summary Identified Differences Future Work Methodology Results Questions? Conclusion Conclusion: Questions? Improving orthographic transcriptions using sentence similarities NL Oosthuizen, MJ Puttkammer & M Schlemmer Centre for Text Technology (CTexT®) Research Unit: Languages and Literature in the South African Context North-West University, Potchefstroom Campus (PUK) South Africa E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za 18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer