• Like
Reuse of Free Resources in Machine Translation between Norwegian Nynorsk and Bokmål
Upcoming SlideShare
Loading in...5
×

Reuse of Free Resources in Machine Translation between Norwegian Nynorsk and Bokmål

  • 412 views
Uploaded on

(PDF: http://tr.im/freennnb Article: http://hdl.handle.net/10045/12025 ) …

(PDF: http://tr.im/freennnb Article: http://hdl.handle.net/10045/12025 )
We describe the development of a two-way shallow-transfer machine translation system between Norwegian Nynorsk and Norwegian Bokmål built on the Apertium platform, using the Free and Open Source resources Norsk Ordbank and the Oslo–Bergen Constraint Grammar tagger. We detail the integration of these and other resources in the system along with the construction of the lexical and structural transfer, and evaluate the translation quality in comparison with another system. Finally, some future work is suggested.

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
412
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
6
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Reuse of Free Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Reuse of Free Resources in Machine Trond Trosterud Translation between Nynorsk and Bokmål Introduction Nynorsk and Bokmål Norwegian language resources The Apertium architecture and Kevin Unhammer1 Trond Trosterud2 nn-nb pipeline Constraint Grammar 1 Developing Department of Linguistics apertium-nn-nb University of Bergen Disambiguation and CG conversion Bergen, Norway Translation dictionary kun041@student.uib.no Structural transfer 2 Evaluation Department of Linguistics Coverage University of Tromsø WER and B LEU Tromsø, Norway Future work trond.trosterud@uit.no 2nd November 2009
  • 2. Reuse of Free Outline of talk Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Introduction Trond Trosterud Nynorsk and Bokmål Introduction Norwegian language resources Nynorsk and Bokmål Norwegian language resources The Apertium The Apertium architecture and nn-nb pipeline architecture and nn-nb pipeline Constraint Grammar Constraint Grammar Developing Developing apertium-nn-nb apertium-nn-nb Disambiguation and CG Disambiguation and CG conversion conversion Translation dictionary Translation dictionary Structural transfer Structural transfer Evaluation Coverage WER and B LEU Evaluation Future work Coverage WER and B LEU Future work
  • 3. Reuse of Free The Norwegian language(s) Resources in Nynorsk↔Bokmål MT Kevin Unhammer, A lot of dialectal variation Trond Trosterud Two written variants: Introduction Nynorsk and Bokmål Bokmål Norwegian language resources Based on Danish and the Dano-Norwegian koiné of the The Apertium architecture and major cities in the 1800’s nn-nb pipeline Nynorsk Constraint Grammar Developing Based on the spoken dialects of Norway, standardised by apertium-nn-nb linguist Ivar Aasen in the late 1800’s Disambiguation and CG conversion Nynorsk used by around 12% of the population Translation dictionary Structural transfer “Language-friendly” politics: Both standards are officially Evaluation Coverage recognised and both are taught in school from age 12 and WER and B LEU up Future work Both Nynorsk and Bokmål allow quite a lot of variation, with some choices being considered more “radical” or “conservative” than others
  • 4. Reuse of Free Free, Open Source Norwegian language Resources in Nynorsk↔Bokmål MT resources Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources Norsk Ordbank The Apertium architecture and full form dictionaries for Nynorsk and Bokmål; 106,789 and nn-nb pipeline Constraint Grammar 142,899 lemmas, respectively Developing The Oslo–Bergen tagger apertium-nn-nb Disambiguation and CG Constraint Grammar morphological disambiguation conversion Translation dictionary Constraint Grammar syntactic dependency parser Structural transfer Various other modules (compounding, NER, . . . ) Evaluation Coverage No freely available bilingual dictionary between Nynorsk WER and B LEU Future work and Bokmål, until now. . .
  • 5. Reuse of Free The apertium-nn-nb pipeline Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Morphological analysis Introduction Nynorsk and Bokmål lttoolbox: XML format, compiles to very fast FSTs Norwegian language resources one XML dictionary gives both analysis and generation The Apertium architecture and nn-nb pipeline CG pre-disambiguation Constraint Grammar Statistical disambiguation (HMM) Developing apertium-nn-nb Bilingual dictionary for lexical transfer Disambiguation and CG conversion Translation dictionary Shallow syntactic transfer rules Structural transfer Local re-ordering (det noun → noun det) Evaluation Coverage Insertions, deletions and substitutions of lexical units (and WER and B LEU chunks, but we don’t use them yet) Future work Morphological generation (again with lttoolbox)
  • 6. Reuse of Free Constraint Grammar Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Rules work on ambiguous input and may SELECT one Trond Trosterud analysis over all others, or REMOVE one analysis from the Introduction set of analyses, or ADD a new tag, etc. Nynorsk and Bokmål Norwegian language resources Often thousands of short, hand-written rules The Apertium architecture and Rules apply based on “context conditions”: nn-nb pipeline Constraint Grammar (-1* noun) means “there must be word with a noun Developing analysis somewhere to the left” apertium-nn-nb Disambiguation and CG (1C* verb) means “there must be a word disambiguated conversion Translation dictionary to a verb somewhere to the right” Structural transfer (1* verb LINK 2 noun) means “there must be a Evaluation verb-analysis to the right, and a noun-analysis two Coverage WER and B LEU positions to the right of that” Future work (1* verb BARRIER noun) means “there must be a verb-analysis to the right, and no noun-analyses before that” There are many other possibilities. . .
  • 7. Reuse of Free Example of a CG rule Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction If input contains the word ‘walks’ analysed as either Nynorsk and Bokmål Norwegian language resources verb 3sg present or noun pl, the following rule The Apertium architecture and nn-nb pipeline SELECT (verb 3sg present) IF Constraint Grammar Developing (-1*C 3sg BARRIER verb) apertium-nn-nb Disambiguation and CG (NOT -1 det); conversion Translation dictionary Structural transfer would choose the verb analysis if there is a disambiguated Evaluation Coverage word, analysed as third singular, to the left, with no verb WER and B LEU between the two; and there is no determiner to the left Future work
  • 8. Reuse of Free Development of apertium-nn-nb Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium Most of the work done within 12 weeks (Google Summer of architecture and nn-nb pipeline Code 2009) Constraint Grammar Helped by high quality free resources Developing apertium-nn-nb Monolingual dictionaries: Norsk Ordbank converted from Disambiguation and CG conversion full form listing to lttoolbox format Translation dictionary CG: Oslo–Bergen tagger converted to use Apertium tag Structural transfer Evaluation scheme Coverage WER and B LEU Future work
  • 9. Reuse of Free Disambiguation and CG conversion Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Bigram HMM’s trained on Wikipedia text (Baum-Welch, 8 Introduction Nynorsk and Bokmål iterations) Norwegian language resources The Apertium Conversion of CG tag set mostly done within a few days architecture and nn-nb pipeline Errors fixed in CG reported back to Oslo–Bergen tagger Constraint Grammar team, win-win. Developing apertium-nn-nb However: the Oslo–Bergen tagger was designed for Disambiguation and CG conversion corpus annotation and lexicography Translation dictionary Structural transfer For the linguist, recall is more important than precision Evaluation For (our) MT, only one analysis matters Coverage WER and B LEU So we need to take more chances with our rules Future work Also, we get some MT-specific rules (like CG-based lexical selection)
  • 10. Reuse of Free Finding word translations semi-automatically Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Method 1: Exact matches where the morphology is the Trond Trosterud same Introduction If lemma and morphological possibilities are the same, Nynorsk and Bokmål Norwegian language resources assume we have a translation The Apertium ‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both architecture and nn-nb pipeline monolingual dictionaries; add it as a translation Constraint Grammar 36,000 entries (although quite a lot are low-frequency / Developing apertium-nn-nb loan-words) Disambiguation and CG Risk of “radical forms” conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 11. Reuse of Free Finding word translations semi-automatically Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Method 1: Exact matches where the morphology is the Trond Trosterud same Introduction If lemma and morphological possibilities are the same, Nynorsk and Bokmål Norwegian language resources assume we have a translation The Apertium ‘snøvle’, verb, pres/pass/imp/pret/inf. . . exists in both architecture and nn-nb pipeline monolingual dictionaries; add it as a translation Constraint Grammar 36,000 entries (although quite a lot are low-frequency / Developing apertium-nn-nb loan-words) Disambiguation and CG Risk of “radical forms” conversion Translation dictionary Method 2: Predictable substring-translations Structural transfer Evaluation find Bokmål entries without translations Coverage run string replacements for typical differences WER and B LEU (-hjem-→-heim-, -lig→-leg, . . . ) Future work check if the altered entries are in the Nynorsk analyser . . . and vice versa Main run gave 2500 good entries
  • 12. Reuse of Free Expanding the translational dictionary using Resources in Nynorsk↔Bokmål MT alignments Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources Method 3: Automatic word aligments The Apertium Corpora: architecture and nn-nb pipeline KDE4 software translations (400,000 words) Constraint Grammar government web pages (50,000 words, crawled with Developing bitextor) apertium-nn-nb Disambiguation and CG po-terminology (only on KDE4) conversion Translation dictionary gave some hundreds of new terms Structural transfer morphological tagging → Giza++ → ReTraTos Evaluation Coverage about 3500 entries WER and B LEU Lots of cleaning needed Future work
  • 13. Reuse of Free Expanding the translational dictionary using Resources in Nynorsk↔Bokmål MT alignments Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources Method 3: Automatic word aligments The Apertium Corpora: architecture and nn-nb pipeline KDE4 software translations (400,000 words) Constraint Grammar government web pages (50,000 words, crawled with Developing bitextor) apertium-nn-nb Disambiguation and CG po-terminology (only on KDE4) conversion Translation dictionary gave some hundreds of new terms Structural transfer morphological tagging → Giza++ → ReTraTos Evaluation Coverage about 3500 entries WER and B LEU Lots of cleaning needed Future work Method 4: User-contributed entries (via Wikipedia)
  • 14. Reuse of Free Structural transfer Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Finite passive verbs Introduction Nynorsk and Bokmål Norwegian language resources (1) a. Bevilgning gis oftest ikke The Apertium architecture and grant.IND give.PRES. PASS usually not nn-nb pipeline Constraint Grammar b. Løyve blir oftast ikkje gjeve Developing grant.IND AUX usually not give.PART apertium-nn-nb ‘Grants are usually not given’ Disambiguation and CG conversion Translation dictionary c. Om høsten fylles fjorden med sild Structural transfer In fall.DEF fill.PRES. PASS fjord.DEF with herring Evaluation Coverage d. Om hausten blir fjorden fylt med sild WER and B LEU In fall.DEF AUX fjord.DEF fill.PRES. PASS with herring Future work ‘In fall, the fjord is filled with herring’
  • 15. Reuse of Free Structural transfer Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Genitive noun phrases Introduction Nynorsk and Bokmål Norwegian language resources (2) a. forfatterens siste utgivelse The Apertium architecture and author.DEF. GEN last publication.IND nn-nb pipeline Constraint Grammar b. den siste utgjevinga til forfattaren Developing the last publication.DEF of author.DEF apertium-nn-nb ‘the author’s last publication’ Disambiguation and CG conversion Translation dictionary c. mitt nye luftputefartøy Structural transfer my new hovercraft.IND Evaluation Coverage d. det nye luftputefartøyet mitt WER and B LEU the new hovercraft.DEF mine Future work ‘my new hovercraft’
  • 16. Reuse of Free Evaluation Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium architecture and nn-nb pipeline Coverage Constraint Grammar Developing WER apertium-nn-nb Disambiguation and CG B LEU conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 17. Reuse of Free Coverage Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium Naïve coverage on Nynorsk Wikipedia: 89.6% architecture and nn-nb pipeline Naïve coverage on Bokmål Wikipedia: 88.2% Constraint Grammar Developing Coverage seems to be the most important issue: apertium-nn-nb Disambiguation and CG Not only is every 10th word untranslated, but we get conversion Translation dictionary disambiguation problems and transfer problems in the rest Structural transfer of the sentence Evaluation Coverage WER and B LEU Future work
  • 18. Reuse of Free WER and B LEU scores in the nb→nn direction Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Word Error Rate, B LEU and Unknown Word Rate on text from government web pages Introduction Nynorsk and Bokmål Norwegian language resources The Apertium B LEU WERO WERW UWR architecture and nn-nb pipeline Apertium 0.74 32.5 (36.1) 17.7 (50.5) 9.5 Constraint Grammar Nyno 0.85 29.1 (34.6) 13.3 (47.3) 0.8 Developing apertium-nn-nb Disambiguation and CG Table: B LEU score (two reference translations) and WER (for the conversion Translation dictionary Original and Wikipedia references). Numbers in parenthesis give Structural transfer percentage of unknown words which were free-rides. Evaluation Coverage WER and B LEU Future work WER on post-edited Apertium MT output on a Wikipedia article, however, was 10.71% (64.93% free-rides) Coverage seems like the major difference.
  • 19. Reuse of Free Future work Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Compounding Introduction (3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål Norwegian language resources car.cemetery → car.cemetery The Apertium architecture and b. postordrelager → #postordrelagar nn-nb pipeline mail.order.storage → mail.order.creator Constraint Grammar Developing apertium-nn-nb Disambiguation and CG conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 20. Reuse of Free Future work Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Compounding Introduction (3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål Norwegian language resources car.cemetery → car.cemetery The Apertium architecture and b. postordrelager → #postordrelagar nn-nb pipeline mail.order.storage → mail.order.creator Constraint Grammar Developing apertium-nn-nb Multi-word expressions Disambiguation and CG conversion Translation dictionary (4) a. Han anbefalte meg å gå hjem Structural transfer he recommended me INF go home Evaluation Coverage b. Han rådte meg til å gå heim WER and B LEU Future work he counseled me to INF go home ‘He recommended that I go home’
  • 21. Reuse of Free Future work Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Compounding Introduction (3) a. bilkirkegård → bilkyrkjegard Nynorsk and Bokmål Norwegian language resources car.cemetery → car.cemetery The Apertium architecture and b. postordrelager → #postordrelagar nn-nb pipeline mail.order.storage → mail.order.creator Constraint Grammar Developing apertium-nn-nb Multi-word expressions Disambiguation and CG conversion Translation dictionary (4) a. Han anbefalte meg å gå hjem Structural transfer he recommended me INF go home Evaluation Coverage b. Han rådte meg til å gå heim WER and B LEU Future work he counseled me to INF go home ‘He recommended that I go home’ Expanding the Scandinavian language group
  • 22. Reuse of Free Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål Norwegian language resources The Apertium architecture and Thanks for listening! nn-nb pipeline Constraint Grammar Developing apertium-nn-nb Disambiguation and CG conversion Translation dictionary Structural transfer Evaluation Coverage WER and B LEU Future work
  • 23. Reuse of Free Licences Resources in Nynorsk↔Bokmål MT Kevin Unhammer, Trond Trosterud Introduction Nynorsk and Bokmål This presentation may be distributed under the terms of the Norwegian language resources The Apertium GNU GPL, GNU FDL and CC-BY-SA licences. architecture and nn-nb pipeline GNU GPL v. 3.0 Constraint Grammar http://www.gnu.org/licenses/gpl.html Developing apertium-nn-nb GNU FDL v. 1.2 Disambiguation and CG conversion http://www.gnu.org/licenses/gfdl.html Translation dictionary Structural transfer CC-BY-SA v. 3.0 Evaluation Coverage http://creativecommons.org/licenses/by-sa/3.0/ WER and B LEU Future work