Processing Parallel Text Corpora for Three South African Language Pairs in the Autshumato Project 18 May 2010
[Overview] Introduction Autshumato Project Text Anonymisation Processing of Data Future Work Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
Introduction I Diversity in South Africa Culture Religious beliefs Languages (11 official languages) SA government aims to provide access to information in all official languages Large volumes of translation work Difficult task  Government translation agencies cannot keep up with the vast quantities of translation work Machine-aided translation tools not available for SA languages CAT tools not widely used - high licensing costs  Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
Introduction II Consequence is that English often acts  lingua franca English is the home language of only 8.2% of the South African population Remaining SA languages are further marginalised SA citizens are deprived of their constitutional right of access to information in their language of choice Innovative solutions are required to overcome this problem Human Language Technology (HLT) SA government supports HLT projects Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
Autshumato Project: Introduction Development of open source machine-aided translation tools and resources for SA languages CAT software Autshumato Integrated Translation Environment (ITE) Terminology Management Machine Translation (MT) English – isiZulu/Afrikaans/Sesotho sa Leboa Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
Autshumato Project: MT I Hybrid approach Combining Statistical Machine Translation (SMT) with language-specific rules based on linguistic knowledge Performance of SMT depends on the amount and quality of parallel text corpora available Limited parallel corpora available for SA languages Obtaining and processing parallel data to develop MT systems is the central theme of the research presented here Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
Autshumato Project: MT II Why use SMT? SMT is currently the preferred approach of numerous industrial and academic research laboratories State-of-the art open source SMT toolkits are readily available Less expert linguistic knowledge is required to create a working baseline system in comparison to rule-based systems Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
Autshumato Project: Data Providers Amount of parallel corpora for SA languages is limited Limited government sources Unavailability of parallel text corpora CAT software suites are not widely used, with the result that translation memories are not readily available Lack of publications (e.g. books, newspapers, magazines and websites) in the indigenous South African Languages Lack of sound document management practices, which makes it difficult to obtain parallel documents from translators Unwillingness of translators and private companies to make their data available for purposes of machine translation research Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
Text Anonymisation: Introduction Publishers and translators are not eager to make their data available for purposes of MT research and development Reason: confidential information Text anonymisation software is developed to overcome this problem Text anonymisation Subcategory of named-entity recognition Focuses on identifying and hiding confidential information Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Method I Entities conveying confidential information cannot merely be removed from parallel corpora Contains syntactical and contextual information that are utilised by SMT Replaced by randomly selected entities from the same category Original:  Mr.  Tito Mboweni  was awarded a merit bonus of  R500,000 . Anonymised:  Mr.  Peter Steyn  was awarded a merit bonus of  R20,000 . Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Method II Rule-based approach is followed Gazetteers Regular expression Simple context rules We aim to make the anonymiser as language independent as possible Three basic steps Entities with a predictable form are identified with regular expressions All words that appear in the gazetteers are marked Context rules are applied to find entities that do not appear in the gazetteers Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Method III Regular expressions Identify entities with a predictable form (e.g. e-mail addresses, URL’s, telephone numbers etc.) Several different forms of entities such as dates are recognised 1978-02-16 16/02/1978 16 Feberware 1978 Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Method IV Gazetteers Compiled from various resources 8,853 first names  81,711 surnames Company, organisation and product names Several of the entries in the gazetteers are also valid words when not used in the “first name or surname sense” Ke na le khumo  means “I have wealth” Khumo  is also a common first name Entries such as  Khumo  were removed from the gazetteers by comparing it to lexica of valid lower case words  Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Method V Context rules Applied to identify entities that do not appear in the gazetteers E.g.  a word starting with a capital letter, following a word that has been tagged as a first name, is considered to be a surname if that word does not appear in the lowercase lexicon Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Results Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation: Post-processing After anonymisation the corpora are Sentencised based on language-specific rules and abbreviation lists Aligned Microsoft’s bilingual sentence aligner (Moore, 2002) Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
Text Anonymisation Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010 Number of aligned text units:
Future Work Improvement of the anonymisation system Expanding gazetteers “ Cleaning” the gazetteers by removing ambiguous words Adding more context rules and refining existing rules Implementing machine learning techniques Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010

Processing Parallel Text Corpora for Three South African Language Pairs in the Autshumato Project

  • 1.
    Processing Parallel TextCorpora for Three South African Language Pairs in the Autshumato Project 18 May 2010
  • 2.
    [Overview] Introduction AutshumatoProject Text Anonymisation Processing of Data Future Work Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
  • 3.
    Introduction I Diversityin South Africa Culture Religious beliefs Languages (11 official languages) SA government aims to provide access to information in all official languages Large volumes of translation work Difficult task Government translation agencies cannot keep up with the vast quantities of translation work Machine-aided translation tools not available for SA languages CAT tools not widely used - high licensing costs Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
  • 4.
    Introduction II Consequenceis that English often acts lingua franca English is the home language of only 8.2% of the South African population Remaining SA languages are further marginalised SA citizens are deprived of their constitutional right of access to information in their language of choice Innovative solutions are required to overcome this problem Human Language Technology (HLT) SA government supports HLT projects Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010
  • 5.
    Autshumato Project: IntroductionDevelopment of open source machine-aided translation tools and resources for SA languages CAT software Autshumato Integrated Translation Environment (ITE) Terminology Management Machine Translation (MT) English – isiZulu/Afrikaans/Sesotho sa Leboa Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  • 6.
    Autshumato Project: MTI Hybrid approach Combining Statistical Machine Translation (SMT) with language-specific rules based on linguistic knowledge Performance of SMT depends on the amount and quality of parallel text corpora available Limited parallel corpora available for SA languages Obtaining and processing parallel data to develop MT systems is the central theme of the research presented here Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  • 7.
    Autshumato Project: MTII Why use SMT? SMT is currently the preferred approach of numerous industrial and academic research laboratories State-of-the art open source SMT toolkits are readily available Less expert linguistic knowledge is required to create a working baseline system in comparison to rule-based systems Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  • 8.
    Autshumato Project: DataProviders Amount of parallel corpora for SA languages is limited Limited government sources Unavailability of parallel text corpora CAT software suites are not widely used, with the result that translation memories are not readily available Lack of publications (e.g. books, newspapers, magazines and websites) in the indigenous South African Languages Lack of sound document management practices, which makes it difficult to obtain parallel documents from translators Unwillingness of translators and private companies to make their data available for purposes of machine translation research Introduction Autshumato Project Text Anonymisation Future Work Introduction MT Data Providers 18 May 2010
  • 9.
    Text Anonymisation: IntroductionPublishers and translators are not eager to make their data available for purposes of MT research and development Reason: confidential information Text anonymisation software is developed to overcome this problem Text anonymisation Subcategory of named-entity recognition Focuses on identifying and hiding confidential information Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 10.
    Text Anonymisation: MethodI Entities conveying confidential information cannot merely be removed from parallel corpora Contains syntactical and contextual information that are utilised by SMT Replaced by randomly selected entities from the same category Original: Mr. Tito Mboweni was awarded a merit bonus of R500,000 . Anonymised: Mr. Peter Steyn was awarded a merit bonus of R20,000 . Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 11.
    Text Anonymisation: MethodII Rule-based approach is followed Gazetteers Regular expression Simple context rules We aim to make the anonymiser as language independent as possible Three basic steps Entities with a predictable form are identified with regular expressions All words that appear in the gazetteers are marked Context rules are applied to find entities that do not appear in the gazetteers Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 12.
    Text Anonymisation: MethodIII Regular expressions Identify entities with a predictable form (e.g. e-mail addresses, URL’s, telephone numbers etc.) Several different forms of entities such as dates are recognised 1978-02-16 16/02/1978 16 Feberware 1978 Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 13.
    Text Anonymisation: MethodIV Gazetteers Compiled from various resources 8,853 first names 81,711 surnames Company, organisation and product names Several of the entries in the gazetteers are also valid words when not used in the “first name or surname sense” Ke na le khumo means “I have wealth” Khumo is also a common first name Entries such as Khumo were removed from the gazetteers by comparing it to lexica of valid lower case words Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 14.
    Text Anonymisation: MethodV Context rules Applied to identify entities that do not appear in the gazetteers E.g. a word starting with a capital letter, following a word that has been tagged as a first name, is considered to be a surname if that word does not appear in the lowercase lexicon Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 15.
    Text Anonymisation: ResultsIntroduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 16.
    Text Anonymisation: Post-processingAfter anonymisation the corpora are Sentencised based on language-specific rules and abbreviation lists Aligned Microsoft’s bilingual sentence aligner (Moore, 2002) Introduction Autshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010
  • 17.
    Text Anonymisation IntroductionAutshumato Project Text Anonymisation Future Work Introduction Method Results Post-processing 18 May 2010 Number of aligned text units:
  • 18.
    Future Work Improvementof the anonymisation system Expanding gazetteers “ Cleaning” the gazetteers by removing ambiguous words Adding more context rules and refining existing rules Implementing machine learning techniques Introduction Autshumato Project Text Anonymisation Future Work 18 May 2010