Frisco Automating Purchase Orders with MuleSoft IDP- May 10th, 2024.pptx.pdf
Improving orthographic transcriptions using sentence similarities
1. Improving orthographic
transcriptions using sentence
similarities
NL Oosthuizen, MJ Puttkammer & M Schlemmer
Centre for Text Technology (CTexT®)
Research Unit: Languages and Literature in the South African Context
North-West University, Potchefstroom Campus (PUK)
South Africa
E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za
18 May 2010; AfLaT 2010; Valletta, Malta
2. Introduction Background
Identified Differences Problem Statement
Methodology
Results
Conclusion
Introduction: Background
• Lwazi (knowledge) project
– 200 Mother tongue speakers per language
– 30 phrases – 14 open ended and 16 phoneme-
rich sentences
– 350 phoneme-rich sentences from various
corpora each recorded 6-10 times – Totalling
3200 phoneme-rich sentences
– Relatively small ASR corpus meant extremely
accurate transcriptions
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
3. Introduction Background
Identified Differences Problem Statement
Methodology
Results
Conclusion
Introduction: Problem Statement
• Lwazi project – Issues
– 2 year running time
– 4-6 transcribers were employed per language
– Different quality control phases were
unsuccessful
– Another solution was needed to improve
quality
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
4. Introduction Confusables
Identified Differences Splits
Methodology Insertions
Results Deletions
Conclusion Non-words
Identified Differences: Confusables
• English examples:
– has it been tried on <too> small a scale
– has it been tried on <to> small a scale
• isiXhosa examples:
– andingomntu <othanda> kufunda
• (I’m not a person <who loves> to read)
– andingomntu <uthanda> kufunda
• (I’m not a person <you love> to read)
• Setswana examples:
– bosa bo <jang> ko engelane ka nako e
– bosa bo <yang> ko engelane ka nako e
• (How is the weather in England at this time?)
• <yang> in the second example is slang for “how”
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
5. Introduction Confusables
Identified Differences Splits
Methodology Insertions
Results Deletions
Conclusion Non-words
Identified Differences: Splits
• English examples:
– there’s <nowhere> else for it to go
– there’s <no_where> else for it to go
• isiXhosa examples:
– alwela phi na <loo_madabi>
– alwelwa phi na <loomadabi>
• (Where is it taking place, these challenges)
• Setswana examples:
– le fa e le <gone> re ratanang tota
– le fa e le <go_ne> re ratanang tota
• (Even though we have started dating)
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
6. Introduction Confusables
Identified Differences Splits
Methodology Insertions
Results Deletions
Conclusion Non-words
Identified Differences: Insertions
• English examples:
– so we took our way toward the palace
– so we <we_>took our way toward the palace
• isiXhosa examples:
– ibiyini ukuba unga mbambi wakumbona
• (Why didn’t <you catch> him or her when you saw him or her)
– ibiyini <na_>ukuba unga mbambi wakumbona
• (Why didn’t < you caught> him or her when you saw him or her)
• Setswana examples:
– ba mmatlela mapai ba mo alela a robala
– ba mmatlela mapai <li-> ba mo alela a robala
• (They have looked for blankets and made a bed for themselves to sleep)
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
7. Introduction Confusables
Identified Differences Splits
Methodology Insertions
Results Deletions
Conclusion Non-words
Identified Differences: Deletions
• English examples:
– as <to_>the first the answer is simple
– as the first the answer is simple
• isiXhosa examples:
– yagaleleka impi <ke_>xa kuthi qheke ukusa
– yagaleleka impi xa kuthi qheke ukusa
• (It started the battle at the beginning of the morning)
• Setswana examples:
– ke eng gape se seng<we> se o se lemogang
– ke eng gape se seng se o se lemogang
• (What else have you noticed?)
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
8. Introduction Confusables
Identified Differences Splits
Methodology Insertions
Results Deletions
Conclusion Non-words
Identified Differences: Non-words
• English examples:
– there is no <arbitrator> except a legislature fifteen thousand miles off
– there is no <abritator> except a legislature fifteen thousand miles off
• isiXhosa examples:
– yile <venkile> yayikhethwe ngabathembu le
– yile <venkeli> yayikhethwe ngabathembu le
• (It is this shop that was selected by the Bathembu)
• <venkeli> in the second example is a spelling mistake.
• Setswana examples:
– lefapha la dimenerale le <eneji>
– lefapha la dimenerale le <energy>
• (Department of minerals and energy)
• < energy > in the second example is a spelling mistake.
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
9. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Flowchart
Original Average of 350 sentences
sentences & recorded 6-10 times, transcribed
transcriptions by 4-6 people per language
For transcriptions
For original sentences
Remove punctuation,
Remove punctuation Cleanup
noise markers & partials
Convert to LC
Convert to LC
Map Compute Levenshtein distance
slighter faults of substance are numerous transcription to and map transcriptions to closest
original original sentence
slighter fault substance are numerous 90.20%
slighter faults of substances are numerous 97.60%
Compare String Similarity - Brad Wood
mapped “Look ahead” window finds
sentences differences
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
10. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Flowchart
slighter faults of substance are numerous
slighter fault substance are numerous
HTML markup to illustrate
Markup
differences with colours
slighter faults of substance are numerous
slighter faults of substances are numerous
I told him to make the charge at once Manual
Verify the differences in context
vs verification
by listening to recordings
I told him to make the change at once
Correct
Replace errors with correct string
errors
Improved
transcriptions
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
11. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Cleanup
• Remove possible differences from the
sentences to improve matches:
– Punctuation
• Any commas, full stops, extra spaces ect.
– Noise Markers
• External noises [n] and speaker noises [s]
– Partials
• Any incomplete words (indicted by leading or trailing
hyphen)
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
12. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Transcription Mapping
• Levenshtein mapping:
– Link each transcribed sentence (T) to an
original sentence (O) using Levenshtein
distance
– If no difference is found (DIFF (O, T) = 0)
• Do nothing
– If a difference is found (DIFF (O, T) = 1)
• Continue to next step
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
13. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Transcription Mapping
• Levenshtein example:
– Original sentence (O):
• slighter faults of substance are numerous
– Transcriptions (T):
• slighter fault substance are numerous
– 90.20%
• slighter faults of substances are numerous
– 97.60%
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
14. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Sentences and Mark-up
• String comparison algorithm developed by
Brad Wood (2008):
– Based on finding the Longest Common String
(LCS)
– Windowing compares the strings on character
level over a maximum search distance
– Differences found are annotated with HTML
• Repeat after swapping the string 1 with
string 2
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
15. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Manual Verification
• If(DIFF (O, T) = 1)
– The spoken utterance (U) is compared to the
original sentence (O)
– If(DIFF (O, U) = 1 AND DIFF (T, U) = 0) then
• U = T (No change is needed)
– If(DIFF (O, U) = 1 AND DIFF (T, U) = 1) then
• The transcription is incorrect and needs to be
checked manually
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
16. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Manual Verification
• Transcribed correctly
– Original sentence (O):
• “I told him to make the charge at once.”
– Spoken utterance (U)
• “I told him to make the change at once.”
– Transcriptions (T):
• “I told him to make the cha<n>ge at once.”
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
17. Introduction Flowchart
Identified Differences Cleanup
Methodology Transcription Mapping
Results Sentences and Mark-up
Conclusion Manual Verification
Methodology: Manual Verification
• Transcribed incorrectly
– Original sentence (O):
• “a heavy word intervened, between...”
– Spoken utterance (U)
• “a heavy word intervened, between...”
– Transcriptions (T):
• “a heavy wo<o>d intervened, between...”
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
18. Introduction
Identified Differences
Methodology
Results
Conclusion
Results
Language Differences found Actual errors
Afrikaans 776 152
English 1143 337
isiNdebele 958 291
isiXhosa 1484 1081
isiZulu 1854 1228
Sepedi 1596 736
Sesotho 739 261
Setswana 1479 828
Siswati 1558 351
Tshivenda 814 191
Xitsonga 1586 456
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
19. Introduction Summary
Identified Differences Future Work
Methodology
Results Questions?
Conclusion
Conclusion: Summary
• We introduced a method for identifying
differences in ASR data
• Overall quality of the transcriptions were
increased
• The Lwazi project had an average
transcription accuracy of 98%.
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
20. Introduction Summary
Identified Differences Future Work
Methodology
Results Questions?
Conclusion
Conclusion: Summary
• Even with inexperienced transcribers high
accuracy is still possible
• Provide employment opportunities to
people with little linguistic skills but have
basic knowledge of their language
• Empowering people to learn skills that may
be invaluable in future projects
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
21. Introduction Summary
Identified Differences Future Work
Methodology
Results Questions?
Conclusion
Conclusion: Future Work
• If(DIFF (O, T) = 0 AND DIFF (O, U) = 1)
– This will indicate that DIFF (T, U) = 1
– For the current system DIFF (T, U) = 0 was
considered only, as the specifications required
it
– This will mean that one can check the reader’s
performance
– Future work will include this statement
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer
22. Introduction Summary
Identified Differences Future Work
Methodology
Results Questions?
Conclusion
Conclusion: Questions?
Improving orthographic
transcriptions using sentence
similarities
NL Oosthuizen, MJ Puttkammer & M Schlemmer
Centre for Text Technology (CTexT®)
Research Unit: Languages and Literature in the South African Context
North-West University, Potchefstroom Campus (PUK)
South Africa
E-mail: {nico.oosthuizen, martin.puttkammer, martin.schlemmer}@nwu.ac.za
18 May 2010; AfLaT 2010; Valletta, Malta Oosthuizen, Puttkammer & Schlemmer