An HLT profile of the official South African languages
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

An HLT profile of the official South African languages

on

  • 818 views

© Aditi Sharma Grover, Gerhard B van Huyssteen, Marthinus W. Pretorius

© Aditi Sharma Grover, Gerhard B van Huyssteen, Marthinus W. Pretorius

Statistics

Views

Total Views
818
Views on SlideShare
667
Embed Views
151

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 151

http://aflat.org 150
http://www.aflat.org 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

An HLT profile of the official South African languages Presentation Transcript

  • 1. An HLT profile of the official South African languages Aditi Sharma Grover1,2, Gerhard B van Huyssteen1,3 & Marthinus W. Pretorius2 1HLT Research Group, CSIR, South Africa 2Graduate School of Technology Management, University of Pretoria, South Africa 3Centre for Text Technology (CTexT), North-West University, South Africa
  • 2. Overview • Background • Process • Results • Conclusion
  • 3. Background South African HLT landscape • 11 official languages • HLT community – R&D community (universities & science councils) – Very few private sector companies • Various government initiatives – DST: HLT road-mapping process, NHN – DAC: HLT strategy, National Centre for HLT – NRF: research funding
  • 4. Background Challenge • SA has not yet capitalised on opportunities to create a thriving HLT industry – Lack of awareness within the local HLT community • Perpetuated by perceived fragmentation of South African R&D activities – Lack of a unified technological profile of HLT activities across the 11 languages • 2009: a technology audit for the South African HLT landscape (SAHLTA) – Align R&D activities and stimulate cooperation – Similar to Dutch (BLaRK), EuroMap
  • 5. Process SAHLTA Process      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop
  • 6. Process SAHLTA Process Phase 1      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop Establish lingua franca Consolidate prior knowledge regarding data, modules, applications, and platforms/tools
  • 7. Process SAHLTA Process Phase 2      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop Priorities
  • 8. Results Prioritisation Priorities Preliminary HLT • Based on international trends, local needs, and feasibility • Priority 1: Basic & robust core HLT technology applications, modules and data • Priority 2, 3: LRs that further enhance and complement core LRs (priority 1), and base their development on a strong foundation of core HLT LRs – Many advanced HLT applications are priority 2, 3 • Verification by larger SA HLT community – Need to be updated regularly
  • 9. Results Preliminary HLT Priorities Priority 1: Applications Speech Text • Proofing tools • Accessibility • Information • Telephony Extraction applications • Information Retrieval • Computer-assisted • Human-aided language learning machine translation • Voice search • Machine-aided • Audio management human translation
  • 10. Results Preliminary HLT Priorities Priority 2: Applications Speech Text • OCR/ICR • Access control • Multilingual • Embedded comprehension speech assistants recognition • CALL • Speaking devices • Authorship • Computer- identification assisted training
  • 11. Results Preliminary HLT Priorities Priority 3: Applications Speech Text • Text generation • Transcription/dictation • Document classification • Multimodal • Summarisation information access • QA • Command&Control • Dialogue systems • Announcement • Reference works systems • Audio books • S2S translation
  • 12. Results Preliminary HLT Priorities Priority 1: Modules Speech Text • G2P • Complete ASR • Text pre-processing • Non-native ASR • Normalisation • Complete TTS • Morphological analysis • Confidence measures • POS tagging • Speaker ID • Chunking • Diarisation • WSD • Language ID • Language/dialect ID
  • 13. Results Priority 1: Data Priorities Preliminary HLT Speech Text • Monolingual corpora • Annotated • Multilingual corpora monolingual corpora • Test suites and • Domain-/Application- corpora specific corpora • Lexica (incl. named- • Test suites and entity lists) corpora • Domain-/Application- • Pronunciation specific corpora resources (e.g. Phone sets, dictionaries, etc.)
  • 14. Process SAHLTA Process Phase 3      Inventory Cursory Audit Terminology Questionnaire criteria inventory workshop Indexes Detailed inventory Gap analysis
  • 15. Process Response rate
  • 16. Results Maturity Index • Maturity stages: – Under development (UD), Alpha version (AV), Beta version (BV) , Released (RV) • Maturity Index – Measure of the maturity of HLT components in a language. – Considers the maturity stage of item against the relative importance of each maturity stage – MaturityInd = Σ (1.UD+2.AV+4.BV+8.RV)/ Σ Weights of maturity stages
  • 17. Results Maturity Index 40 35 30 25 20 15 10 5 0 Afr SAE Zul Xho Sts Sep Ses Tsv Ssw Ndb Xit L.I.
  • 18. Results Accessibility Index • Accessibility stages: – Unspecified (UN), Not available (NA) (proprietary or contract R&D), Research and education (RE), Available for commercial purposes (CO), Available for commercial purposes and R&E (CRE) • Accessibility Index – Measure of the accessibility of HLT components in a language – Considers the accessibility stage of an item against the relative importance of each accessibility stage – AccessInd = Σ (1.UN+2.NA+4.RE+8.CO+12.CRE)/ Σ Weights of accessibility stages
  • 19. Results Accessibility Index 40 35 30 25 20 15 10 5 0 Afr SAE Zul Xho Sep Sts Ses Tsv Ssw Ndb Xit L.I.
  • 20. Results HLT Language Index • Impressionistic index that relatively ranks languages based on the total quantity of HLT activity per language • Considers the stage of maturity and accessibility of all the HLT components • HLT Language Index = Maturity Index (per language, all components) + Accessibility Index
  • 21. Results HLT Language Index 80 70 60 50 40 30 20 10 0 Afr SAE Zul Xho Sep Sts Ses Tsv Ssw Ndb Xit L.I.
  • 22. Results HLT Component Indexes • Alternative perspective: • Quantity of activity taking place within each of the data, modules, and applications on a HLT component grouping level (e.g. pronunciation resources)
  • 23. Results HLT Component Indexes: Modules
  • 24. Results HLT Detailed Inventory : Item exists, is accessible, released & of fairly adequate quality : Item may exist but available for restricted use not released/limited quality : Items do not exist ‘–’: Category is not applicable to the language
  • 25. Results Gap Analysis (speech) : Item exists, is accessible, released & of fairly adequate quality : Item may exist but available for restricted use or not released/ limited quality : Items do not exist ‘–’: Category not applicable to the language
  • 26. Results SAHLTA Outcomes • A SAHLTA online database of LRs and applications (alpha) www.meraka.org.za/nhnaudit
  • 27. Results SAHLTA Outcomes
  • 28. Conclusion Summary • Few resources available, of basic nature • Several factors influence this: – HLT expert knowledge and interests – Availability of data resources – Market needs of a language – Relatedness to other world languages
  • 29. Conclusion Recommendations • Further resource development based on gap analysis – Also of more advanced LRs • Availability and distribution of existing LRs – To enable usage, licensing agreements need to be in place • Funding: support by government in formative years – Also industry stimulation programmes (e.g. support for R&D consortia) • Collaborations: across SA and internationally, also based on gap analysis • Human capital development (HCD): scientific & technical, cross silos of academic disciplines, especially for lesser-resourced languages
  • 30. Conclusion Acknowledgments • DST – project sponsorship • Prof Sonja Bosch & Prof Laurette Pretorius – results of the 2008 BLaRK survey • Audit mini-workshop contributors – Prof. Danie Prinsloo (UP), Prof. Sonja Bosch (UNISA), Mr. Martin Puttkammer (NWU), Prof. Gerhard van Huyssteen (CSIR), Prof. Etienne Barnard (CSIR), Dr. Febe de Wet (US), Dr. Marelie Davel (CSIR) • Numerous audit participants • Various HLT RG members – guidance and support www.meraka.org.za/nhnaudit