Constraint Grammar and Apertium


Published on

Apertium is a free and open source MT platform, where both the linguistic data and engines are under free licences. Constraint Grammar is used for pre-disambiguation in several language pairs.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Constraint Grammar and Apertium

  1. 1. CG in Apertium Kevin Brubeck Unhammer University of Bergen, Norway 14th May 2009
  2. 2. What is Apertium? An Open Source Machine Translation platform both source code and data have Free / Open Source licences Modular stand-alone programs communicate through standard Unix pipes particular language pairs need not use all modules! Developed by universities, companies and independent (volunteer and paid) developers
  3. 3. History of Apertium Initially developed for closely related languages (Portuguese ↔ Spanish ↔ Catalan) by the Transducens group at the Universitat d’Alacant Later extended to allow more distant language pairs Now also involves various companies in Spain, the universities of Vigo, Reykjavík, Oviedo, Barcelona (Pompeu Fabra), etc.
  4. 4. Language pairs “Stable”: Spanish ↔ Catalan, Spanish ← Romanian, French ↔ Catalan, Occitan ↔ Catalan, English ↔ Galician, Occitan ↔ Spanish, Spanish ↔ Portuguese, English ↔ Catalan, English ↔ Spanish, English → Esperanto, Spanish ↔ Galician, French ↔ Spanish, Esperanto ← Spanish, Welsh → English, Esperanto ← Catalan, Portuguese ↔ Catalan, Portuguese ↔ Galician, Basque → Spanish Other pairs being developed (Spanish ↔ Asturian, Icelandic ↔ English, Swedish ↔ Danish, Nynorsk ↔ Bokmål, . . . )
  5. 5. Marginalised Few free resources Copious free resources
  6. 6. Modules Morphological dictionaries lttoolbox: XML format, compiles to FSTs Fast (seems to perform 5x faster than SFST) one dictionary gives both analysis and generation CG pre-disambiguation Statistical disambiguation (HMM) Bilingual dictionary for lexical transfer Shallow syntactic transfer rules Local re-ordering (nom adj → adj nom) Chunking (adj adj nom → SN[adj adj nom]) Insertions, deletions and substitutions of lexical units and chunks
  7. 7. A sketch of the architecture
  8. 8. The Apertium Stream Format Simple example from Norwegian Bokmål “lese en” (‘read a/one’) Morphological analysis gives: ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf> /ene<vblex><imp>/en<det><ind><mf><sg>$ After CG: ^lese/lese<vblex><inf>$ ^en/en<num><sg><mf> /en<det><ind><mf><sg>$ Formatting information (like HTML tags) is saved in superblanks making document and web translation easy original: Kva er det du <em>seier</em>? deformatted: Kva er det du[ <em>]seier[</em>]?
  9. 9. Visualising the process helps find errors
  10. 10. The platform provides a language-independent machine translation engine tools to manage the linguistic data necessary to build a machine translation system for a given language pair little programming knowledge required to get started graphical user interfaces that show each step in the translation process many more advanced tools (for eg. merging or sorting dictionaries) linguistic data for a growing number of language pairs also usable for other NLP purposes (spelling & grammar checking, ...)
  11. 11. CG in Apertium Used after morphological analysis for pre-disambiguation in Nynorsk ↔ Bokmål, Welsh ↔ English, Breton ↔ French, Irish ↔ Scottish Gaelic Apertium’s own statistical disambiguator makes a choice if CG doesn’t completely disambiguate
  12. 12. CG in Apertium Norwegian CG is from the Oslo-Bergen Tagger (GPL) Sámi giellatekno provides Free grammars for Sámi languages and Faroese Irish grammar mostly converted manually from the An Gramadóir project (GPL) Other grammars made solely by Apertium members
  13. 13. Some statistics Sections Rules Sets Tags Welsh 2 98 141 128 Breton 4 121 125 154 Irish 1 285 298 292 Table: Rule counts for some of the CG grammars in Apertium
  14. 14. Same concepts apply between modules CG Apertium/lttoolbox Apertium stream format wordform surface form books baseform lemma book cohort ambiguous lexical unit ^books/book<n><pl> /book<vblex><pres><p3><sg>$ reading analysis /book<n><pl>/ Table: Terminology differences
  15. 15. Same format readable by all modules Both SFST/HFST and vislcg3 read and write the Apertium stream format. Example from the Open Morphology of Finnish, output by the Apertium reader in SFST/HFST: ^kaikki/kaikki<noun><7><a><sg><nom>$ ^ihmiset/ihminen<noun><38><pl><acc>/ihminen<noun><38><pl><nom>$ ^syntyvät/syntyä<verb><52><j><act><pcpva><pl><acc> /syntyä<verb><52><j><act><pcpva><pl><nom> /syntyä<verb><52><j><act><indv><pres><pl3>$ ^vapaina/vapaa<noun><17><pl><ess>$ ^ja/*ja$ ^tasavertaisina/*tasavertaisina$ ^arvoltaan/arvo<noun><1><sg><abl><pl3>/arvo<noun><1><sg><abl><sg3>$ ^ja/*ja$ ^oikeuksiltaan/oikeus<noun><40><pl><abl><pl3>/oikeus<noun><40><pl><abl><sg3>$
  16. 16. Why Apertium Rule-based MT most languages of the world have little freely available textual data, let alone parallel corpora for SMT purposes; Apertium is thus suitable for marginalised languages Rule-based systems are linguistically interesting, and provide test beds for linguistic theory Reuse and Interoperability Monolingual dictionaries and constraint grammars are directly reusable for new language pairs apertium-dixtools: generates new language pairs from existing ones vislcg3 reads and outputs the Apertium stream format, as do Stuttgart/Helsinki Finite State Tools Free licences allow other systems to use Apertium data and tools
  17. 17. Why Apertium Open Source + fairly simple learning curve = great potential for contributors Eg. Jacob Nordfalk: entered Apertium last fall, had English → Esperanto pair by March 2009 Very helpful and accessible community
  18. 18. Future work: dependency-based reordering in Apertium Currently, CG is only used for disambiguation Many constraint grammars out there give dependency information, this could be integrated into Apertium to provide dependency based reordering, simplifying the transfer step
  19. 19. Future Work: integration with Matxin Matxin is a Free Software sister project of Apertium which currently uses FreeLing for dependency analyses: <SENTENCE ord=’1’> <CHUNK ord=’2’ type=’grup-verb’ si=’top’> <NODE ord=’4’ alloc=’19’ form=’sacude’ lem=’sacudir’ mi=’VMIP3S0’> </NODE> <CHUNK ord=’1’ type=’sn’ si=’subj’> <NODE ord=’3’ alloc=’10’ form=’atentado’ lem=’atentado’ mi=’NCMS000’> <NODE ord=’1’ alloc=’0’ form=’Un’ lem=’uno’ mi=’DI0MS0’> </NODE> <NODE ord=’2’ alloc=’3’ form=’triple’ lem=’triple’ mi=’AQ0CS0’> </NODE> </NODE> </CHUNK> <CHUNK ord=’3’ type=’sn’ si=’obj’> <NODE ord=’5’ alloc=’26’ form=’Bagdad’ lem=’Bagdad’ mi=’NP00000’> </NODE> </CHUNK> <CHUNK ord=’4’ type=’F-term’ si=’modnomatch’> <NODE ord=’6’ alloc=’32’ form=’.’ lem=’.’ mi=’Fp’> </NODE> </CHUNK> </CHUNK> </SENTENCE>
  20. 20. Future work: integration with Matxin We would like to get CG dependency information into a Matxin-compatible format. Apertium’s CG would handle analysis while Matxin handles the transfer step. Eg. given the following analysis (Faroese): "<Í>" "í" Pr @ADVL> #1->3 "<upphavi>" "upphav" N Neu Sg Dat Indef @P< #2->1 "<skapti>" "skapa" V Ind Prt Sg @VMAIN #3->0 "<Gud>" "gudur" N Msc Sg Acc Indef @<SUBJ #4->3 "<himmal>" "himmal" N Msc Sg Acc Indef @<OBJ #5->3
  21. 21. Future work: integration with Matxin ...we would like to get this dependency tree structure: <SENTENCE ord="1"> <NODE form=’skapti’ lem=’skapa’ ord=’3’ mi=’V.Ind.Prt.Sg’ si=’VMAIN’> <NODE form=’Í’ lem=’Í’ ord=’1’ mi=’Pr’ si=’ADVL’> <NODE form=’upphavi’ lem=’upphav’ ord=’2’ mi=’N.Neu.Sg.Dat.Indef’ si=’P’/> </NODE> <NODE form=’Gud’ lem=’Gud’ ord=’4’ mi=’N.Prop.Sg.Nom’ si=’SUBJ’/> <NODE form=’himmal’ lem=’himmal’ ord=’5’ mi=’N.Msc.Sg.Acc.Indef’ si=’OBJ’/> </NODE> </SENTENCE> and let Matxin do reordering and other transfer operations
  22. 22. Thanks for listening!
  23. 23. Licences This presentation may be distributed under the terms of the GNU GPL, GNU FDL and CC-BY-SA licences. GNU GPL v. 3.0 GNU FDL v. 1.2 CC-BY-SA v. 3.0