SlideShare a Scribd company logo
1 of 1
Download to read offline
Bigorna
                                                A Toolkit for Orthography Migration Challenges
                                                            Jos´ Jo˜o Almeida, Andr´ Santos, Alberto Sim˜es
                                                               e a                 e                    o


 Abstract                                                                                         Conversion examples
   Languages are born, evolve and, eventually, die. During this evolution their spelling          $ pt2ptao
 rules (and sometimes the syntactic and semantic ones) change, putting old documents              A adop¸ao do acordo implica a actualiza¸ao de ferramentas.
                                                                                                        c~                               c~
 out of use. In Portugal, a pair of political agreements with Brazil forced relevant                            ↓
 changes on the way the Portuguese language is written, the most recent one being the             A ado¸~o do acordo implica a atualiza¸ao de ferramentas.
                                                                                                       ca                              c~
 Portuguese Language Orthographic Agreement (PLOA) signed in 1990.
 Bigorna is a toolkit for the classification of language variants, their comparison and
 the conversion of texts in different language versions. As Bigorna relies on a set                Conversion examples
 of conversion rules we will also discuss how to infer conversion rules from a set of             $ br2brao
 documents (texts with different ages).                                                            Ele fez um v^o rasante sobre a ar´ia.
                                                                                                               o                   e
                                                                                                                ↓
 Contents                                                                                         Ele fez um voo rasante sobre a areia.

 1 Compiling OAKB                                                                          1

 2 Updating the dictionary vocabulary                                                      1
                                                                                                 4      Variant classifier tool
                                                                                                   After creating the language converters became clear the need to have a language classifier,
 3 Language conversion tools                                                               1     capable of detecting the variant of Portuguese in which a given text was written, allowing to
                                                                                                 automatically differentiate texts (possibly to further conversion).
 4 Variant classifier tool                                                                  1
                                                                                                   To build this tool two lists were generated: one with European Portuguese-only words and
 5 Lexical comparison tools                                                                1     another with Brazilian Portuguese ones.
                                                                                                  Calculating the lists
                                                                                                  Function calc_whichpt_lsts(dicpt,dicbr,oakb)
                                                                                                    for ( x ∈ dom(oakb)
                                                                                                       ∧ oakb[x].type = (normal or accent)
                                                                                                       ∧ oakb[x].pt_pt = oakb[x].pt_br
                                                                                                       ∧ oakb[x].pt_pt ∈ dom(dicpt)
                                                                                                       ∧ oakb[x].pt_br ∈ dom(dicbr))
                                                                                                     {
                                                                                                      wpt ← oakb[x].pt_pt
                                                                                                      wbr ← oakb[x].pt_br
                                                                                                      justpt ← justpt ∪ {x ∈ deriv(wpt,dicpt)| x ∈ dicbr}
                                                                                                                                                 /
                                                                                                      justbr ← justbr ∪ {x ∈ deriv(wbr,dicbr)| x ∈ dicpt}
                                                                                                                                                 /
                                                                                                     }

                                                                                                  Language classifier definition
                                                                                                  Function classify_pt(text)
                                                                                                    for ( x ← text )
                                                                                                       if ( x ∈ justpt ) PTcount++
                                                                                                       if ( x ∈ justbr ) BRcount++
1      Compiling OAKB                                                                                compare ( PTcount, BRcount)
  A table containing all the information about the word changes. This table was built based
on previously existing resources and proved to be crucial to the subsequent tasks performed.
 OAKB structure
                                                                                                 5      Lexical comparison tools
                                                                                                   There are other situations where there are no available lists of words, only documents with
 oakb = entry*
                                                                                                 different orthographic versions.
 entry =
                                                                                                   lexdiff, is able to compare two versions of a text with different spelling and detect (lin-
   pt_pt                : word
                                                                                                 guistic) differences. This may be used to help building tools (as the previously mentioned).
   pt_br                : word
   pt_oa                : word *                                                                  Lexdiff example: word level changes
   preferencial_pt      : word                                                                    $ lexdiff -wa AmPerd.ptBR AmPerd.ptPT
   preferencial_br      : word                                                                       32 acad^mico → acad´mico
                                                                                                            e             e
   type    : Capit      | Hyphen | Accent | Normal | Excep                                           14 id´ia
                                                                                                          e        → ideia
                                                                                                     12 redarg¨iu → redarguiu
                                                                                                              u
 aden´ide :: aden´ide :: adenoide :: adenoide :: adenoide :: Accent
     o           o                                                                                    7 g^nio
                                                                                                         e         → g´nio
                                                                                                                       e
 adjec¸ao :: adje¸~o :: adje¸ao :: adje¸ao :: adje¸~o :: Normal
      c~         ca          c~           c~         ca                                               4 refletiu → reflectiu
 Mar¸o
    c     :: mar¸o
                c     :: mar¸o
                            c     :: mar¸o
                                        c     :: mar¸o
                                                    c     :: Capit                                    ...

                                                                                                  Lexdiff example: char level changes
2      Updating the dictionary vocabulary                                                         $ lexdiff -cctx AmPerd.ptBR AmPerd.ptPT
  An existing European Portuguese spellchecker dictionary (jSpell) was updated. This dictio-       changed PT→BR (unchanged)         changed            BR→PT (unchan) Concl
nary was later used to generate lists to both the language conversion tools and the language
classifier.                                                                                        !    36    ect→et      (9)                   36      et →ect     (206)    BR →?PT
  From the 2 600 words in OAKB, just 960 were related directly with a lemma in jSpell’s           !    34    d´m→d^m
                                                                                                              e   e      (1)           !!      34      d^m→d´m
                                                                                                                                                        e   e               BR?↔ PT
dictionary. From these 960 lemmas jSpell generates a total of 11 500 words.                            18    dei→d´i
                                                                                                                  e      (164)         !!      18      d´i→dei
                                                                                                                                                        e                   BR → PT
                                                                                                       17    gui→g¨i
                                                                                                                  u      (88)          !!      17      g¨i→gui
                                                                                                                                                        u                   BR → PT
 Update function
                                                                                                       15    que→q¨e
                                                                                                                  u      (2417)        !!      15      q¨e→que
                                                                                                                                                        u                   BR → PT
 Function newdic(oakdb,dicjs)
                                                                                                  !!   11    g´n→g^n
                                                                                                              e   e                    !       11      g^n→g´n
                                                                                                                                                        e   e      (6)      BR → PT
   for ( x ∈ dom(oakb) ∧ oakb[x].type = normal
                                                                                                  !!   9     m´n→m^n
                                                                                                              o   o                    !!      9       m^n→m´n
                                                                                                                                                        o   o               BR ↔ PT
        ∧ x = oakb[x].preferpt ∧ x ∈ dom(dicjs))
                                                                                                  !    8     act→at      (1)                   8       at →act     (456)    BR ← PT
     {
                                                                                                  !!   7     ec¸→e¸
                                                                                                               c  c                            7       e¸ →ec¸
                                                                                                                                                        c    c     (77)     BR ← PT
       neww ← oakb[x].preferpt
                                                                                                  !!   6     ac¸→a¸
                                                                                                               c  c                            6       a¸ →ac¸
                                                                                                                                                        c    c     (431)    BR ← PT
       dicjs[neww] ← dicjs[x]
                                                                                                  !!   6     t´n→t^n
                                                                                                              o   o                    !!      6       t^n→t´n
                                                                                                                                                        o   o               BR ↔ PT
       delete dicjs[x]
     }
                                                                                                  Confusion matrix pt-br        → pt-pt
                                                                                                     et   → {     et →         206, ect → 36 },
                                                                                                     d´i → { dei →
                                                                                                      e                        18 },
3      Language conversion tools                                                                     g¨i → { gui →
                                                                                                      u                        17 },
  As many texts will need to be updated to the new spelling form, there was the need to create       at   → {     at →         456, act → 8 apt          → 1, apt      → 1},
automated conversion tools. Due to the multiple spelling cases, two versions were created: an        e¸
                                                                                                      c   → {     e¸ →
                                                                                                                   c           77, ec¸ → 7, ea¸
                                                                                                                                      c        c         → 2, ep¸
                                                                                                                                                                c      → 2},
European Portuguese converter and a Brazilian Portuguese one.

                                                                           http://natura.di.uminho.pt/

More Related Content

Viewers also liked (8)

Identifying similar text documents
Identifying similar text documentsIdentifying similar text documents
Identifying similar text documents
 
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de FormatosDetecção e Correcção Parcial de Problemas na Conversão de Formatos
Detecção e Correcção Parcial de Problemas na Conversão de Formatos
 
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
Text::Perfide::BookCleaner, a Perl module to clean and normalize plain text b...
 
My Hero- Cody Schiller
My Hero- Cody SchillerMy Hero- Cody Schiller
My Hero- Cody Schiller
 
Nature and environments
Nature and environmentsNature and environments
Nature and environments
 
高性能网站建设指南
高性能网站建设指南高性能网站建设指南
高性能网站建设指南
 
CAS
CASCAS
CAS
 
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
Virdata: lessons learned from the Internet of Things and M2M Cloud Services @...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Poster - Bigorna, a toolkit for orthography migration challenges

  • 1. Bigorna A Toolkit for Orthography Migration Challenges Jos´ Jo˜o Almeida, Andr´ Santos, Alberto Sim˜es e a e o Abstract Conversion examples Languages are born, evolve and, eventually, die. During this evolution their spelling $ pt2ptao rules (and sometimes the syntactic and semantic ones) change, putting old documents A adop¸ao do acordo implica a actualiza¸ao de ferramentas. c~ c~ out of use. In Portugal, a pair of political agreements with Brazil forced relevant ↓ changes on the way the Portuguese language is written, the most recent one being the A ado¸~o do acordo implica a atualiza¸ao de ferramentas. ca c~ Portuguese Language Orthographic Agreement (PLOA) signed in 1990. Bigorna is a toolkit for the classification of language variants, their comparison and the conversion of texts in different language versions. As Bigorna relies on a set Conversion examples of conversion rules we will also discuss how to infer conversion rules from a set of $ br2brao documents (texts with different ages). Ele fez um v^o rasante sobre a ar´ia. o e ↓ Contents Ele fez um voo rasante sobre a areia. 1 Compiling OAKB 1 2 Updating the dictionary vocabulary 1 4 Variant classifier tool After creating the language converters became clear the need to have a language classifier, 3 Language conversion tools 1 capable of detecting the variant of Portuguese in which a given text was written, allowing to automatically differentiate texts (possibly to further conversion). 4 Variant classifier tool 1 To build this tool two lists were generated: one with European Portuguese-only words and 5 Lexical comparison tools 1 another with Brazilian Portuguese ones. Calculating the lists Function calc_whichpt_lsts(dicpt,dicbr,oakb) for ( x ∈ dom(oakb) ∧ oakb[x].type = (normal or accent) ∧ oakb[x].pt_pt = oakb[x].pt_br ∧ oakb[x].pt_pt ∈ dom(dicpt) ∧ oakb[x].pt_br ∈ dom(dicbr)) { wpt ← oakb[x].pt_pt wbr ← oakb[x].pt_br justpt ← justpt ∪ {x ∈ deriv(wpt,dicpt)| x ∈ dicbr} / justbr ← justbr ∪ {x ∈ deriv(wbr,dicbr)| x ∈ dicpt} / } Language classifier definition Function classify_pt(text) for ( x ← text ) if ( x ∈ justpt ) PTcount++ if ( x ∈ justbr ) BRcount++ 1 Compiling OAKB compare ( PTcount, BRcount) A table containing all the information about the word changes. This table was built based on previously existing resources and proved to be crucial to the subsequent tasks performed. OAKB structure 5 Lexical comparison tools There are other situations where there are no available lists of words, only documents with oakb = entry* different orthographic versions. entry = lexdiff, is able to compare two versions of a text with different spelling and detect (lin- pt_pt : word guistic) differences. This may be used to help building tools (as the previously mentioned). pt_br : word pt_oa : word * Lexdiff example: word level changes preferencial_pt : word $ lexdiff -wa AmPerd.ptBR AmPerd.ptPT preferencial_br : word 32 acad^mico → acad´mico e e type : Capit | Hyphen | Accent | Normal | Excep 14 id´ia e → ideia 12 redarg¨iu → redarguiu u aden´ide :: aden´ide :: adenoide :: adenoide :: adenoide :: Accent o o 7 g^nio e → g´nio e adjec¸ao :: adje¸~o :: adje¸ao :: adje¸ao :: adje¸~o :: Normal c~ ca c~ c~ ca 4 refletiu → reflectiu Mar¸o c :: mar¸o c :: mar¸o c :: mar¸o c :: mar¸o c :: Capit ... Lexdiff example: char level changes 2 Updating the dictionary vocabulary $ lexdiff -cctx AmPerd.ptBR AmPerd.ptPT An existing European Portuguese spellchecker dictionary (jSpell) was updated. This dictio- changed PT→BR (unchanged) changed BR→PT (unchan) Concl nary was later used to generate lists to both the language conversion tools and the language classifier. ! 36 ect→et (9) 36 et →ect (206) BR →?PT From the 2 600 words in OAKB, just 960 were related directly with a lemma in jSpell’s ! 34 d´m→d^m e e (1) !! 34 d^m→d´m e e BR?↔ PT dictionary. From these 960 lemmas jSpell generates a total of 11 500 words. 18 dei→d´i e (164) !! 18 d´i→dei e BR → PT 17 gui→g¨i u (88) !! 17 g¨i→gui u BR → PT Update function 15 que→q¨e u (2417) !! 15 q¨e→que u BR → PT Function newdic(oakdb,dicjs) !! 11 g´n→g^n e e ! 11 g^n→g´n e e (6) BR → PT for ( x ∈ dom(oakb) ∧ oakb[x].type = normal !! 9 m´n→m^n o o !! 9 m^n→m´n o o BR ↔ PT ∧ x = oakb[x].preferpt ∧ x ∈ dom(dicjs)) ! 8 act→at (1) 8 at →act (456) BR ← PT { !! 7 ec¸→e¸ c c 7 e¸ →ec¸ c c (77) BR ← PT neww ← oakb[x].preferpt !! 6 ac¸→a¸ c c 6 a¸ →ac¸ c c (431) BR ← PT dicjs[neww] ← dicjs[x] !! 6 t´n→t^n o o !! 6 t^n→t´n o o BR ↔ PT delete dicjs[x] } Confusion matrix pt-br → pt-pt et → { et → 206, ect → 36 }, d´i → { dei → e 18 }, 3 Language conversion tools g¨i → { gui → u 17 }, As many texts will need to be updated to the new spelling form, there was the need to create at → { at → 456, act → 8 apt → 1, apt → 1}, automated conversion tools. Due to the multiple spelling cases, two versions were created: an e¸ c → { e¸ → c 77, ec¸ → 7, ea¸ c c → 2, ep¸ c → 2}, European Portuguese converter and a Brazilian Portuguese one. http://natura.di.uminho.pt/