Processing Parallel Text Corpora for Three South African Language Pairs in the Autshumato Project 18 May 2010
[Overview] <ul><li>Introduction </li></ul><ul><li>Autshumato Project </li></ul><ul><li>Text Anonymisation </li></ul><ul><l...
Introduction I <ul><li>Diversity in South Africa </li></ul><ul><ul><li>Culture </li></ul></ul><ul><ul><li>Religious belief...
Introduction II <ul><li>Consequence is that English often acts  lingua franca </li></ul><ul><ul><li>English is the home la...
Autshumato Project: Introduction <ul><li>Development of open source machine-aided translation tools and resources for SA l...
Autshumato Project: MT I <ul><li>Hybrid approach </li></ul><ul><ul><li>Combining Statistical Machine Translation (SMT) wit...
Autshumato Project: MT II <ul><li>Why use SMT? </li></ul><ul><ul><li>SMT is currently the preferred approach of numerous i...
Autshumato Project: Data Providers <ul><li>Amount of parallel corpora for SA languages is limited </li></ul><ul><li>Limite...
Text Anonymisation: Introduction <ul><li>Publishers and translators are not eager to make their data available for purpose...
Text Anonymisation: Method I <ul><li>Entities conveying confidential information cannot merely be removed from parallel co...
Text Anonymisation: Method II <ul><li>Rule-based approach is followed </li></ul><ul><ul><li>Gazetteers </li></ul></ul><ul>...
Text Anonymisation: Method III <ul><li>Regular expressions </li></ul><ul><ul><li>Identify entities with a predictable form...
Text Anonymisation: Method IV <ul><li>Gazetteers </li></ul><ul><ul><li>Compiled from various resources </li></ul></ul><ul>...
Text Anonymisation: Method V <ul><li>Context rules </li></ul><ul><ul><li>Applied to identify entities that do not appear i...
Text Anonymisation: Results Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Pos...
Text Anonymisation: Post-processing <ul><li>After anonymisation the corpora are </li></ul><ul><ul><li>Sentencised </li></u...
Text Anonymisation Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-process...
Future Work <ul><li>Improvement of the anonymisation system </li></ul><ul><ul><li>Expanding gazetteers </li></ul></ul><ul>...
Upcoming SlideShare
Loading in …5
×

Processing Parallel Text Corpora for Three South African Language Pairs in the Autshumato Project

681 views

Published on

© Handré Groenewald & Liza du Plooy

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
681
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
4
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Processing Parallel Text Corpora for Three South African Language Pairs in the Autshumato Project

  1. 1. Processing Parallel Text Corpora for Three South African Language Pairs in the Autshumato Project 18 May 2010
  2. 2. [Overview] <ul><li>Introduction </li></ul><ul><li>Autshumato Project </li></ul><ul><li>Text Anonymisation </li></ul><ul><li>Processing of Data </li></ul><ul><li>Future Work </li></ul>Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
  3. 3. Introduction I <ul><li>Diversity in South Africa </li></ul><ul><ul><li>Culture </li></ul></ul><ul><ul><li>Religious beliefs </li></ul></ul><ul><ul><li>Languages (11 official languages) </li></ul></ul><ul><li>SA government aims to provide access to information in all official languages </li></ul><ul><ul><li>Large volumes of translation work </li></ul></ul><ul><ul><li>Difficult task </li></ul></ul><ul><li>Government translation agencies cannot keep up with the vast quantities of translation work </li></ul><ul><ul><li>Machine-aided translation tools not available for SA languages </li></ul></ul><ul><ul><li>CAT tools not widely used - high licensing costs </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
  4. 4. Introduction II <ul><li>Consequence is that English often acts lingua franca </li></ul><ul><ul><li>English is the home language of only 8.2% of the South African population </li></ul></ul><ul><ul><li>Remaining SA languages are further marginalised </li></ul></ul><ul><ul><li>SA citizens are deprived of their constitutional right of access to information in their language of choice </li></ul></ul><ul><li>Innovative solutions are required to overcome this problem </li></ul><ul><ul><li>Human Language Technology (HLT) </li></ul></ul><ul><li>SA government supports HLT projects </li></ul>Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
  5. 5. Autshumato Project: Introduction <ul><li>Development of open source machine-aided translation tools and resources for SA languages </li></ul><ul><ul><li>CAT software </li></ul></ul><ul><ul><ul><li>Autshumato Integrated Translation Environment (ITE) </li></ul></ul></ul><ul><ul><ul><li>Terminology Management </li></ul></ul></ul><ul><ul><li>Machine Translation (MT) </li></ul></ul><ul><ul><ul><li>English – isiZulu/Afrikaans/Sesotho sa Leboa </li></ul></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  6. 6. Autshumato Project: MT I <ul><li>Hybrid approach </li></ul><ul><ul><li>Combining Statistical Machine Translation (SMT) with language-specific rules based on linguistic knowledge </li></ul></ul><ul><li>Performance of SMT depends on the amount and quality of parallel text corpora available </li></ul><ul><ul><li>Limited parallel corpora available for SA languages </li></ul></ul><ul><li>Obtaining and processing parallel data to develop MT systems is the central theme of the research presented here </li></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  7. 7. Autshumato Project: MT II <ul><li>Why use SMT? </li></ul><ul><ul><li>SMT is currently the preferred approach of numerous industrial and academic research laboratories </li></ul></ul><ul><ul><li>State-of-the art open source SMT toolkits are readily available </li></ul></ul><ul><ul><li>Less expert linguistic knowledge is required to create a working baseline system in comparison to rule-based systems </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  8. 8. Autshumato Project: Data Providers <ul><li>Amount of parallel corpora for SA languages is limited </li></ul><ul><li>Limited government sources </li></ul><ul><li>Unavailability of parallel text corpora </li></ul><ul><ul><li>CAT software suites are not widely used, with the result that translation memories are not readily available </li></ul></ul><ul><ul><li>Lack of publications (e.g. books, newspapers, magazines and websites) in the indigenous South African Languages </li></ul></ul><ul><ul><li>Lack of sound document management practices, which makes it difficult to obtain parallel documents from translators </li></ul></ul><ul><ul><li>Unwillingness of translators and private companies to make their data available for purposes of machine translation research </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  9. 9. Text Anonymisation: Introduction <ul><li>Publishers and translators are not eager to make their data available for purposes of MT research and development </li></ul><ul><ul><li>Reason: confidential information </li></ul></ul><ul><li>Text anonymisation software is developed to overcome this problem </li></ul><ul><li>Text anonymisation </li></ul><ul><ul><li>Subcategory of named-entity recognition </li></ul></ul><ul><ul><li>Focuses on identifying and hiding confidential information </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  10. 10. Text Anonymisation: Method I <ul><li>Entities conveying confidential information cannot merely be removed from parallel corpora </li></ul><ul><ul><li>Contains syntactical and contextual information that are utilised by SMT </li></ul></ul><ul><li>Replaced by randomly selected entities from the same category </li></ul><ul><ul><li>Original: Mr. Tito Mboweni was awarded a merit bonus of R500,000 . </li></ul></ul><ul><ul><li>Anonymised: Mr. Peter Steyn was awarded a merit bonus of R20,000 . </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  11. 11. Text Anonymisation: Method II <ul><li>Rule-based approach is followed </li></ul><ul><ul><li>Gazetteers </li></ul></ul><ul><ul><li>Regular expression </li></ul></ul><ul><ul><li>Simple context rules </li></ul></ul><ul><li>We aim to make the anonymiser as language independent as possible </li></ul><ul><li>Three basic steps </li></ul><ul><ul><li>Entities with a predictable form are identified with regular expressions </li></ul></ul><ul><ul><li>All words that appear in the gazetteers are marked </li></ul></ul><ul><ul><li>Context rules are applied to find entities that do not appear in the gazetteers </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  12. 12. Text Anonymisation: Method III <ul><li>Regular expressions </li></ul><ul><ul><li>Identify entities with a predictable form (e.g. e-mail addresses, URL’s, telephone numbers etc.) </li></ul></ul><ul><ul><li>Several different forms of entities such as dates are recognised </li></ul></ul><ul><ul><ul><li>1978-02-16 </li></ul></ul></ul><ul><ul><ul><li>16/02/1978 </li></ul></ul></ul><ul><ul><ul><li>16 Feberware 1978 </li></ul></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  13. 13. Text Anonymisation: Method IV <ul><li>Gazetteers </li></ul><ul><ul><li>Compiled from various resources </li></ul></ul><ul><ul><li>8,853 first names </li></ul></ul><ul><ul><li>81,711 surnames </li></ul></ul><ul><ul><li>Company, organisation and product names </li></ul></ul><ul><li>Several of the entries in the gazetteers are also valid words when not used in the “first name or surname sense” </li></ul><ul><ul><li>Ke na le khumo means “I have wealth” </li></ul></ul><ul><ul><li>Khumo is also a common first name </li></ul></ul><ul><li>Entries such as Khumo were removed from the gazetteers by comparing it to lexica of valid lower case words </li></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  14. 14. Text Anonymisation: Method V <ul><li>Context rules </li></ul><ul><ul><li>Applied to identify entities that do not appear in the gazetteers </li></ul></ul><ul><ul><li>E.g. a word starting with a capital letter, following a word that has been tagged as a first name, is considered to be a surname if that word does not appear in the lowercase lexicon </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  15. 15. Text Anonymisation: Results Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  16. 16. Text Anonymisation: Post-processing <ul><li>After anonymisation the corpora are </li></ul><ul><ul><li>Sentencised </li></ul></ul><ul><ul><ul><li>based on language-specific rules and abbreviation lists </li></ul></ul></ul><ul><ul><li>Aligned </li></ul></ul><ul><ul><ul><li>Microsoft’s bilingual sentence aligner (Moore, 2002) </li></ul></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  17. 17. Text Anonymisation Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010 <ul><li>Number of aligned text units: </li></ul>
  18. 18. Future Work <ul><li>Improvement of the anonymisation system </li></ul><ul><ul><li>Expanding gazetteers </li></ul></ul><ul><ul><li>“ Cleaning” the gazetteers by removing ambiguous words </li></ul></ul><ul><ul><li>Adding more context rules and refining existing rules </li></ul></ul><ul><ul><li>Implementing machine learning techniques </li></ul></ul>Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010

×