An HLT profile of the official South African languages

735 views

Published on

© Aditi Sharma Grover, Gerhard B van Huyssteen, Marthinus W. Pretorius

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
735
On SlideShare
0
From Embeds
0
Number of Embeds
166
Actions
Shares
0
Downloads
1
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

An HLT profile of the official South African languages

  1. 1. An HLT profile of the official South African languages Aditi Sharma Grover1,2, Gerhard B van Huyssteen1,3 & Marthinus W. Pretorius2 1HLT Research Group, CSIR, South Africa 2Graduate School of Technology Management, University of Pretoria, South Africa 3Centre for Text Technology (CTexT), North-West University, South Africa
  2. 2. Overview • Background • Process • Results • Conclusion
  3. 3. Background South African HLT landscape • 11 official languages • HLT community – R&D community (universities & science councils) – Very few private sector companies • Various government initiatives – DST: HLT road-mapping process, NHN – DAC: HLT strategy, National Centre for HLT – NRF: research funding
  4. 4. Background Challenge • SA has not yet capitalised on opportunities to create a thriving HLT industry – Lack of awareness within the local HLT community • Perpetuated by perceived fragmentation of South African R&D activities – Lack of a unified technological profile of HLT activities across the 11 languages • 2009: a technology audit for the South African HLT landscape (SAHLTA) – Align R&D activities and stimulate cooperation – Similar to Dutch (BLaRK), EuroMap
  5. 5. Process SAHLTA Process      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop
  6. 6. Process SAHLTA Process Phase 1      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop Establish lingua franca Consolidate prior knowledge regarding data, modules, applications, and platforms/tools
  7. 7. Process SAHLTA Process Phase 2      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop Priorities
  8. 8. Results Prioritisation Priorities Preliminary HLT • Based on international trends, local needs, and feasibility • Priority 1: Basic & robust core HLT technology applications, modules and data • Priority 2, 3: LRs that further enhance and complement core LRs (priority 1), and base their development on a strong foundation of core HLT LRs – Many advanced HLT applications are priority 2, 3 • Verification by larger SA HLT community – Need to be updated regularly
  9. 9. Results Preliminary HLT Priorities Priority 1: Applications Speech Text • Proofing tools • Accessibility • Information • Telephony Extraction applications • Information Retrieval • Computer-assisted • Human-aided language learning machine translation • Voice search • Machine-aided • Audio management human translation
  10. 10. Results Preliminary HLT Priorities Priority 2: Applications Speech Text • OCR/ICR • Access control • Multilingual • Embedded comprehension speech assistants recognition • CALL • Speaking devices • Authorship • Computer- identification assisted training
  11. 11. Results Preliminary HLT Priorities Priority 3: Applications Speech Text • Text generation • Transcription/dictation • Document classification • Multimodal • Summarisation information access • QA • Command&Control • Dialogue systems • Announcement • Reference works systems • Audio books • S2S translation
  12. 12. Results Preliminary HLT Priorities Priority 1: Modules Speech Text • G2P • Complete ASR • Text pre-processing • Non-native ASR • Normalisation • Complete TTS • Morphological analysis • Confidence measures • POS tagging • Speaker ID • Chunking • Diarisation • WSD • Language ID • Language/dialect ID
  13. 13. Results Priority 1: Data Priorities Preliminary HLT Speech Text • Monolingual corpora • Annotated • Multilingual corpora monolingual corpora • Test suites and • Domain-/Application- corpora specific corpora • Lexica (incl. named- • Test suites and entity lists) corpora • Domain-/Application- • Pronunciation specific corpora resources (e.g. Phone sets, dictionaries, etc.)
  14. 14. Process SAHLTA Process Phase 3      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop Indexes Detailed inventory Gap analysis
  15. 15. Process Response rate
  16. 16. Results Maturity Index • Maturity stages: – Under development (UD), Alpha version (AV), Beta version (BV) , Released (RV) • Maturity Index – Measure of the maturity of HLT components in a language. – Considers the maturity stage of item against the relative importance of each maturity stage – MaturityInd = Σ (1.UD+2.AV+4.BV+8.RV)/ Σ Weights of maturity stages
  17. 17. Results Maturity Index 40 35 30 25 20 15 10 5 0 Afr SAE Zul Xho Sts Sep Ses Tsv Ssw Ndb Xit L.I.
  18. 18. Results Accessibility Index • Accessibility stages: – Unspecified (UN), Not available (NA) (proprietary or contract R&D), Research and education (RE), Available for commercial purposes (CO), Available for commercial purposes and R&E (CRE) • Accessibility Index – Measure of the accessibility of HLT components in a language – Considers the accessibility stage of an item against the relative importance of each accessibility stage – AccessInd = Σ (1.UN+2.NA+4.RE+8.CO+12.CRE)/ Σ Weights of accessibility stages
  19. 19. Results Accessibility Index 40 35 30 25 20 15 10 5 0 Afr SAE Zul Xho Sep Sts Ses Tsv Ssw Ndb Xit L.I.
  20. 20. Results HLT Language Index • Impressionistic index that relatively ranks languages based on the total quantity of HLT activity per language • Considers the stage of maturity and accessibility of all the HLT components • HLT Language Index = Maturity Index (per language, all components) + Accessibility Index
  21. 21. Results HLT Language Index 80 70 60 50 40 30 20 10 0 Afr SAE Zul Xho Sep Sts Ses Tsv Ssw Ndb Xit L.I.
  22. 22. Results HLT Component Indexes • Alternative perspective: • Quantity of activity taking place within each of the data, modules, and applications on a HLT component grouping level (e.g. pronunciation resources)
  23. 23. Results HLT Component Indexes: Modules
  24. 24. Results HLT Detailed Inventory : Item exists, is accessible, released & of fairly adequate quality : Item may exist but available for restricted use not released/limited quality : Items do not exist ‘–’: Category is not applicable to the language
  25. 25. Results Gap Analysis (speech) : Item exists, is accessible, released & of fairly adequate quality : Item may exist but available for restricted use or not released/ limited quality : Items do not exist ‘–’: Category not applicable to the language
  26. 26. Results SAHLTA Outcomes • A SAHLTA online database of LRs and applications (alpha) www.meraka.org.za/nhnaudit
  27. 27. Results SAHLTA Outcomes
  28. 28. Conclusion Summary • Few resources available, of basic nature • Several factors influence this: – HLT expert knowledge and interests – Availability of data resources – Market needs of a language – Relatedness to other world languages
  29. 29. Conclusion Recommendations • Further resource development based on gap analysis – Also of more advanced LRs • Availability and distribution of existing LRs – To enable usage, licensing agreements need to be in place • Funding: support by government in formative years – Also industry stimulation programmes (e.g. support for R&D consortia) • Collaborations: across SA and internationally, also based on gap analysis • Human capital development (HCD): scientific & technical, cross silos of academic disciplines, especially for lesser-resourced languages
  30. 30. Conclusion Acknowledgments • DST – project sponsorship • Prof Sonja Bosch & Prof Laurette Pretorius – results of the 2008 BLaRK survey • Audit mini-workshop contributors – Prof. Danie Prinsloo (UP), Prof. Sonja Bosch (UNISA), Mr. Martin Puttkammer (NWU), Prof. Gerhard van Huyssteen (CSIR), Prof. Etienne Barnard (CSIR), Dr. Febe de Wet (US), Dr. Marelie Davel (CSIR) • Numerous audit participants • Various HLT RG members – guidance and support www.meraka.org.za/nhnaudit

×