Your SlideShare is downloading. ×
0
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Svetlin Nakov - Improved Word Alignments Using the Web as a Corpus

783

Published on

Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. …

Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007

Published in: Technology
1 Comment
1 Like
Statistics
Notes
No Downloads
Views
Total Views
783
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
27
Comments
1
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Improved Word Alignments Using the Web as a Corpus <ul><li>Preslav Nakov, University of California, Berkeley </li></ul><ul><li>Svetlin Nakov, Sofia University &quot;St. Kliment Ohridski&quot; </li></ul><ul><li>Elena Paskaleva, Bulgarian Academy of Sciences </li></ul>International Conference RANLP 2007 (Recent Advances in Natural Language Processing)
  • 2. Statistical Machine Translation (SMT) <ul><li>1988 – IBM models 1, 2, 3, 4 and 5 </li></ul><ul><ul><li>Start with bilingual parallel sentence-aligned corpus </li></ul></ul><ul><ul><li>Learn translation probabilities of individual words </li></ul></ul><ul><li>2004 – PHARAOH model </li></ul><ul><ul><li>Learn translation probabilities for phrases </li></ul></ul><ul><ul><li>Alignment template approach – extracts translation phrases from word alignments </li></ul></ul><ul><ul><li>Improved word alignments in sentences improve translation quality! </li></ul></ul>
  • 3. Word Alignments <ul><li>The w ord alignments problem </li></ul><ul><ul><li>Given a bilingual parallel sentence-aligned corpus align the words in each sentence with corresponding words in its translation </li></ul></ul><ul><li>Example English sentence </li></ul><ul><li>Example Bulgarian sentence </li></ul>Try our s ame day delivery of fresh flowers, roses, and unique gift baskets. Опитайте нашите свежи цветя, рози и уникални кошници с подаръци с доставка на същия ден.
  • 4. Word Alignments – Example try our s ame day delivery of fresh flowers roses and unique gift baskets опитайте нашите свежи цветя рози и уникални кошници с подаръци с доставка на същия ден
  • 5. Our Method <ul><li>Use combination of </li></ul><ul><ul><li>Orthographic similarity measure </li></ul></ul><ul><ul><li>Semantic similarity measure </li></ul></ul><ul><ul><li>Competitive linking </li></ul></ul><ul><li>Orthographic similarity measure </li></ul><ul><ul><li>Modified weighted minimum-edit-distance </li></ul></ul><ul><li>Semantic similarity measure </li></ul><ul><ul><li>Analyses the co-occurring words in the local contexts of the target words using the Web as a corpus </li></ul></ul>
  • 6. Orthographic Similarity <ul><li>Minimum Edit Distance Ratio (MEDR) </li></ul><ul><ul><li>MED(s 1 , s 2 ) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s 1 to s 2 </li></ul></ul><ul><li>Longest Common Subsequence Ratio (LCSR) </li></ul><ul><ul><li>LCS(s 1 , s 2 ) = the longest common subsequence of s 1 and s 2 </li></ul></ul>
  • 7. Orthographic Similarity <ul><li>Modified Minimum Edit Distance Ratio (MMEDR) for Bulgarian / Russian </li></ul><ul><ul><li>Normalize the strings </li></ul></ul><ul><ul><li>Assign weights for the edit operations </li></ul></ul><ul><li>Normalizing the strings </li></ul><ul><ul><li>Hand-crafted rules </li></ul></ul><ul><ul><ul><li>Strip the Russian letters &quot; ь &quot; and &quot;ъ&quot; </li></ul></ul></ul><ul><ul><ul><li>Remove the Russian &quot;й&quot; at the endings </li></ul></ul></ul><ul><ul><ul><li>Remove the definite article in Bulgarian (e.g. &quot; ът &quot; , &quot; ят &quot; at the endings) </li></ul></ul></ul>
  • 8. Orthographic Similarity <ul><li>Assigning weights for the edit operations </li></ul><ul><ul><li>0.5-0.9 for the vowel to vowel substitutions, e.g. 0.5 for е  о </li></ul></ul><ul><ul><li>0.5-0.9 for some consonant-consonant replacements, e.g. с  з </li></ul></ul><ul><ul><li>1.0 for all other edit operations </li></ul></ul><ul><li>Example: Bulgarian първият and the Russian первый (first) </li></ul><ul><ul><li>Normalization produces първи and перви , thus MMED = 0.5 (weight 0.5 for ъ  о ) </li></ul></ul>
  • 9. Semantic Similarity <ul><li>What is local context ? </li></ul><ul><ul><li>Few words before and after the target word </li></ul></ul><ul><li>The words in the local context of given word are semantically related to it </li></ul><ul><li>Need to exclude the stop words : prepositions, pronouns, conjunctions, etc. </li></ul><ul><ul><li>Stop words appear in all contexts </li></ul></ul><ul><li>Need of sufficiently big corpus </li></ul>Same day delivery of fresh flowers , roses, and unique gift baskets from our online boutique . Flower delivery online by local florists for birthday flowers .
  • 10. Semantic Similarity <ul><li>Web as a corpus </li></ul><ul><ul><li>The Web can be used as a corpus to extract the local context for given word </li></ul></ul><ul><ul><ul><li>The Web is the largest possible corpus </li></ul></ul></ul><ul><ul><ul><li>Contains big corpora in any language </li></ul></ul></ul><ul><ul><li>Searching some word in Google can return up to 1 000 excerpts of texts </li></ul></ul><ul><ul><ul><li>The target word is given along with its local context: few words before and after it </li></ul></ul></ul><ul><ul><ul><li>Target language can be specified </li></ul></ul></ul>
  • 11. Semantic Similarity <ul><li>Web as a corpus </li></ul><ul><ul><li>Example: Google query for &quot; flower &quot; </li></ul></ul>Flowers, plants, roses, & gifts. Flower s delivery with fewer ... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flower s delivery from florists again. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
  • 12. Semantic Similarity <ul><li>Measuring semantic similarity </li></ul><ul><ul><li>For given two words their local contexts are extracted from the Web </li></ul></ul><ul><ul><ul><li>A set of words and their frequencies </li></ul></ul></ul><ul><ul><ul><li>Apply lemmatization </li></ul></ul></ul><ul><ul><li>Semantic similarity is measured as similarity between these local contexts </li></ul></ul><ul><ul><ul><li>Local contexts are represented as frequency vectors for given set of words </li></ul></ul></ul><ul><ul><ul><li>Cosine between the frequency vectors in the Euclidean space is calculated </li></ul></ul></ul>
  • 13. Semantic Similarity <ul><li>Example of context words frequencies </li></ul>word: flower word: computer 183 rose 165 delivery 124 gift 98 welcome 217 fresh 204 order 87 red ... ... count word 252 technology 185 order 174 new 159 Web 291 Internet 286 PC 146 site ... ... count word
  • 14. Semantic Similarity <ul><li>Example of frequency vectors </li></ul><ul><li>Similarity = cosine(v 1 , v 2 ) </li></ul>v 1 : flower v 2 : computer 5000 4999 ... 3 2 1 0 # 0 amateur 5 apple ... ... 3 alias 2 alligator 0 zap 6 zoo freq. word 5000 4999 ... 3 2 1 0 # 8 amateur 133 apple ... ... 7 alias 0 alligator 3 zap 0 zoo freq. word
  • 15. Cross-Lingual Semantic Similarity <ul><li>We are given two words in different languages L 1 and L 2 </li></ul><ul><li>We have a bilingual glossary G of translation pairs {p ∈ L 1 , q ∈ L 2 } </li></ul><ul><li>Measuring cross-lingual similarity: </li></ul><ul><ul><li>We extract the local contexts of the target words from the Web: C 1 ∈ L 1 and C 2 ∈ L 2 </li></ul></ul><ul><ul><li>We translate the context </li></ul></ul><ul><ul><li>We measure similarity between C 1 * and C 2 </li></ul></ul>C 1 * C 1 G
  • 16. Competitive Linking <ul><li>What is competitive linking ? </li></ul><ul><ul><li>One-to-one bi-directional word alignments algorithm </li></ul></ul><ul><ul><li>Greedy &quot;best first&quot; approach </li></ul></ul><ul><ul><li>Links the most probable pair first, removes it, and repeats the same for the rest </li></ul></ul>
  • 17. Applying Competitive Linking <ul><li>Make all words lowercase </li></ul><ul><li>Remove punctuation </li></ul><ul><li>Remove the stop words : prepositions, pronouns, conjunctions, etc. </li></ul><ul><ul><li>We don't align them </li></ul></ul><ul><li>Align the most similar pair of words </li></ul><ul><ul><li>Using the orthographic similarity combined with the semantic similarity </li></ul></ul><ul><li>Remove the aligned words </li></ul><ul><li>Align the rest of the sentences </li></ul>
  • 18. Our Method – Example <ul><li>Bulgarian sentence </li></ul><ul><li>Russian sentence </li></ul>Процесът на създаването на такива рефлекси е по-сложен, но същността им е еднаква. Процесс создания таких рефлексов сложнее, но существо то же.
  • 19. Out Method – Example <ul><li>Remove the stop words </li></ul><ul><ul><li>Bulgarian: на , на , такива , е , но , им , е </li></ul></ul><ul><ul><li>Russian: таких , но , то </li></ul></ul><ul><li>Align рефлекси and рефлексов (semantic similarity = 0 . 989 ) </li></ul><ul><li>Align по- сложен and сложнее (orthographic similarity = 0.750) </li></ul><ul><li>Align процесът and процесс (orthographic similarity = 0.714) </li></ul><ul><li>Align създаването and создания (orthographic similarity = 0.544) </li></ul><ul><li>Align процесът and процесс (orthographic similarity = 0.536) </li></ul><ul><li>Not aligned: еднаква </li></ul>
  • 20. Our Method – Example процесът на създаването на такива рефлекси е по-сложен но същността им е еднаква процесс создания таких рефлексов сложнее но существо то же
  • 21. Evaluation <ul><li>We evaluated the following algorithms </li></ul><ul><ul><li>BASELINE: the traditional alignment algorithm (IBM model 4) </li></ul></ul><ul><ul><li>LCSR, MEDR, MMEDR: orthographic similarity algorithms </li></ul></ul><ul><ul><li>WEB-ONLY: semantic similarity algorithm </li></ul></ul><ul><ul><li>WEB-AVG: average of WEB-ONLY and MMEDR </li></ul></ul><ul><ul><li>WEB-MAX: maximum of WEB-ONLY and MMEDR </li></ul></ul><ul><ul><li>WEB-CUT: 1 if MMEDR(s1, s2) >= α (0 < α < 1), or WEB-ONLY(s1, s2) otherwise </li></ul></ul>
  • 22. Testing Data and Experiments <ul><li>Testing data set </li></ul><ul><ul><li>A corpus of 5 827 parallel sentences </li></ul></ul><ul><ul><ul><li>Training set: 4 827 sentences </li></ul></ul></ul><ul><ul><ul><li>Tuning set: 500 sentences </li></ul></ul></ul><ul><ul><ul><li>Testing set: 500 sentences </li></ul></ul></ul><ul><li>Experiments </li></ul><ul><ul><li>Manual evaluation of WEB-CUT </li></ul></ul><ul><ul><li>AER for competitive linking </li></ul></ul><ul><ul><li>Translation quality: BLEU / NIST </li></ul></ul>
  • 23. Manual Evaluation of WEB-CUT <ul><li>Aligned the texts of the testing data set </li></ul><ul><ul><li>Used competitive linking and WEB-CUT for α =0.62 </li></ul></ul><ul><ul><li>Obtained 14,246 distinct word pairs </li></ul></ul><ul><li>Manually evaluated the aligned pairs as: </li></ul><ul><ul><li>Correct </li></ul></ul><ul><ul><li>Rough (considered incorrect) </li></ul></ul><ul><ul><li>Wrong (considered incorrect) </li></ul></ul><ul><li>Calculated precision and recall </li></ul><ul><ul><li>For the case MMEDR < 0.62 </li></ul></ul>
  • 24. Manual Evaluation of WEB-CUT <ul><li>Precision-recall curve </li></ul>
  • 25. Evaluation of Alignment Error Rate <ul><li>Gold standard for alignment </li></ul><ul><ul><li>For the first 100 sentences </li></ul></ul><ul><ul><li>Created manually by a linguist </li></ul></ul><ul><ul><li>Stop words and punctuation were removed </li></ul></ul><ul><li>Evaluated the alignment error rate (AER) for competitive linking </li></ul><ul><ul><li>Evaluated for all the algorithms </li></ul></ul><ul><ul><li>LCSR, MEDR, MMEDR, WEB-ONLY, WEB-AVG, WEB-MAX and WEB-CUT </li></ul></ul>
  • 26. Evaluation of Alignment Error Rate <ul><li>AER for competitive linking </li></ul>
  • 27. Evaluation of Translation Quality <ul><li>Built a Russian  Bulgarian statistical machine translation (SMT) system </li></ul><ul><ul><li>Extracted from the training set the distinct word pairs aligned with competitive linking </li></ul></ul><ul><ul><li>Added them twice as additional “sentence” pairs to the training corpus </li></ul></ul><ul><ul><li>Trained log-linear model for SMT with standard feature functions </li></ul></ul><ul><ul><ul><li>Used minimum error rate training on the tuning set </li></ul></ul></ul><ul><ul><li>Evaluated BLUE and NIST score on the testing set </li></ul></ul>
  • 28. Evaluation of Translation Quality <ul><li>Translation quality: BLEU </li></ul>
  • 29. Evaluation of Translation Quality <ul><li>Translation quality: NIST </li></ul>
  • 30. Resources <ul><li>We used the following resources: </li></ul><ul><ul><li>Bulgarian-Russian parallel corpus: 5 827 sentences </li></ul></ul><ul><ul><li>Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words </li></ul></ul><ul><ul><li>A list of 599 Bulgarian / 508 Russian stop words </li></ul></ul><ul><ul><li>Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata </li></ul></ul><ul><ul><li>Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata </li></ul></ul>
  • 31. Conclusion and Future Work <ul><li>Conclusion </li></ul><ul><ul><li>Semantic similarity extracted from the Web can improve statistical machine translation </li></ul></ul><ul><ul><li>For similar languages like Bulgarian and Russian orthographic similarity is useful </li></ul></ul><ul><li>Future Work </li></ul><ul><ul><li>Improve MMED with automatic leaned rules </li></ul></ul><ul><ul><li>Improve the semantic similarity algorithm </li></ul></ul><ul><ul><ul><li>Filter parasite words like &quot;site&quot;, &quot;click&quot;, etc. </li></ul></ul></ul><ul><ul><li>Replace competitive linking with maximum weight bipartite matching </li></ul></ul>
  • 32. Questions ? Improved Word Alignments Using the Web as a Corpus

×