• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Svetlin Nakov - Cognate or False Friend? Ask the Web!
 

Svetlin Nakov - Cognate or False Friend? Ask the Web!

on

  • 1,377 views

Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!, Proceedings of the International Workshop "Acquisition and Management of Multilingual Lexicons", part of the International ...

Nakov S., Nakov P., Paskaleva E., Cognate or False Friend? Ask the Web!, Proceedings of the International Workshop "Acquisition and Management of Multilingual Lexicons", part of the International Conference RANLP 2007, pp. 55-62, ISBN 978-954-452-004-5, Borovets, Bulgaria, 30 September 2007

Statistics

Views

Total Views
1,377
Views on SlideShare
1,371
Embed Views
6

Actions

Likes
0
Downloads
10
Comments
0

1 Embed 6

http://www.slideshare.net 6

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Svetlin Nakov - Cognate or False Friend? Ask the Web! Svetlin Nakov - Cognate or False Friend? Ask the Web! Presentation Transcript

    • Cognate or False Friend? Ask the Web!
      • Svetlin Nakov, Sofia University "St. Kliment Ohridski"
      • Preslav Nakov, University of California, Berkeley
      • Elena Paskaleva, Bulgarian Academy of Sciences
      A Workshop on Acquisition and Management of Multilingual Lexicons
    • Introduction
      • Cognates and false friends
        • Cognates are pair of words in different languages that sound similar and are translations of each other
        • F alse friends are pairs of words in two languages that sound similar but differ in their meaning s
      • The problem
        • Design an algorithm that can distinguish between cognates and false friends
    • Cognates and False Friends
      • Examples of cognates
        • ден in Bulgarian = день in Russian (day )
        • idea in English = идея in Bulgarian (idea)
      • Examples of false friends
        • майка in Bulgarian (mother) ≠ майка in Russian ( vest )
        • prost in German (cheers) ≠ прост in Bulgarian (stupid)
        • gift in German (poison) ≠ gift in English (present)
    • The Paper in One Slide
      • Measuring semantic similarity
        • Analyze the words local contexts
        • Use the Web as a corpus
        • Similarities contexts  similar words
        • Context translation  cross-lingual similarity
      • Evaluation
        • 200 pairs of words
          • 100 cognates and 100 false friends
        • 11pt average precision: 95.84%
    • Contextual Web Similarity
      • What is local context ?
        • Few words before and after the target word
      • The words in the local context of given word are semantically related to it
      • Need to exclude the stop words : prepositions, pronouns, conjunctions, etc.
        • Stop words appear in all contexts
      • Need of sufficiently big corpus
      Same day delivery of fresh flowers , roses, and unique gift baskets from our online boutique . Flower delivery online by local florists for birthday flowers .
    • Contextual Web Similarity
      • Web as a corpus
        • The Web can be used as a corpus to extract the local context for given word
          • The Web is the largest possible corpus
          • Contains big corpora in any language
        • Searching some word in Google can return up to 1 000 excerpts of texts
          • The target word is given along with its local context: few words before and after it
          • Target language can be specified
    • Contextual Web Similarity
      • Web as a corpus
        • Example: Google query for " flower "
      Flowers, plants, roses, & gifts. Flower s delivery with fewer ... Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flower s delivery from florists again. Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ... Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable. Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ... Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
    • Contextual Web Similarity
      • Measuring semantic similarity
        • For given two words their local contexts are extracted from the Web
          • A set of words and their frequencies
        • Semantic similarity is measured as similarity between these local contexts
          • Local contexts are represented as frequency vectors for given set of words
          • Cosine between the frequency vectors in the Euclidean space is calculated
    • Contextual Web Similarity
      • Example of context words frequencies
      word: flower word: computer 183 rose 165 delivery 124 gift 98 welcome 217 fresh 204 order 87 red ... ... count word 252 technology 185 order 174 new 159 Web 291 Internet 286 PC 146 site ... ... count word
    • Contextual Web Similarity
      • Example of frequency vectors
      • Similarity = cosine(v 1 , v 2 )
      v 1 : flower v 2 : computer 5000 4999 ... 3 2 1 0 # 0 amateur 5 apple ... ... 3 alias 2 alligator 0 zap 6 zoo freq. word 5000 4999 ... 3 2 1 0 # 8 amateur 133 apple ... ... 7 alias 0 alligator 3 zap 0 zoo freq. word
    • Cross-Lingual Similarity
      • We are given two words in different languages L 1 and L 2
      • We have a bilingual glossary G of translation pairs {p ∈ L 1 , q ∈ L 2 }
      • Measuring cross-lingual similarity:
        • We extract the local contexts of the target words from the Web: C 1 ∈ L 1 and C 2 ∈ L 2
        • We translate the context
        • We measure distance between C 1 * and C 2
      C 1 * C 1 G
    • Reverse Context Lookup
      • Local context extracted from the Web can contain arbitrary parasite words like " online ", " home ", " search ", " click ", etc.
        • Internet terms appear in any Web page
        • Such words are not likely to be associated with the target word
      • Example (for the word flowers )
        • " send flowers online ", " flowers here " , " order flowers here "
      • Will the word " flowers " appear in the local context of " send ", " online " and " here " ?
    • Reverse Context Lookup
      • If two words are semantically related both should appear in the local contexts of each other
      • Let #{x,y} = number of occurrences of x in the local context of y
      • For any word w and a word from its local context w c , we define their strength of semantic association p(w,w c ) as follows:
        • p(w, w c ) = min{ #(w, w c ), #( w c ,w) }
      • We use p(w,wc) as vector coordinates when measuring semantic similarity
    • Web Similarity Using Seed Words
      • Adaptation of the Fung&Yee'98 algorithm*
        • We have a bilingual glossary G: L 1  L 2 of translation pairs and target words w 1 , w 2
        • We search in Google the co-occurrences of the target words with the glossary entries
        • Compare the co-occurrence vectors
          • for each {p,q} ∈ G compare
          • max (google#("w 1 p") and google#("p w 1 "))
          • with
          • max (google#"w 2 q") and google#("q w 2 "))
      * P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998
    • Evaluation Data Set
      • We use 200 Bulgarian/Russian pairs of words:
        • 100 cognates and 100 false friends
      • Manually assembled by a linguist
      • Manually checked in several large monolingual and bilingual dictionaries
      • Limited to nouns only
    • Experiments
      • We tested few modifications of our contextual Web similarity algorithm
        • Use of TF.IDF weighting
        • Preserve the stop words
        • Use of lemmatization of the context words
        • Use different context size (2, 3, 4 and 5)
        • Use small and large bilingual glossary
      • Compared it with the seed words algorithm
      • Compared with traditional orthographic similarity measures: LCSR and MEDR
    • Experiments
      • BASELINE : random
      • MEDR : minimum edit distance ratio
      • LCSR : longest common subsequence ration
      • SEED: the "seed words" algorithm
      • WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting
      • NO-STOP: WEB3 without stop words removal
      • WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5
      • LEMMA: WEB3 with lemmatization
      • HUGEDICT: WEB3 with the huge glossary
      • REVERSE: the "reverse context lookup" algorithm
      • COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup
    • Resources
      • We used the following resources:
        • Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words
        • Huge bilingual glossary: 59 583 word pairs
        • A list of 599 Bulgarian stop words
        • A list of 508 Russian stop words
        • Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata
        • Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata
    • Evaluation
      • We order the pairs of words from the testing dataset by the calculated similarity
        • False friends are expected to appear on the top and the cognates on the bottom
        • We evaluate the 11pt average precision of the obtained ordering
    • Results (11pt Average Precision) Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms
    • Results (11pt Average Precision) Comparing different context sizes; keeping the stop words
    • Results (11pt Average Precision) Comparing different improvements of the WEB3 algorithm
    • Results (Precision-Recall Graph) Comparing the recall-precision graphs of evaluated algorithms
    • Results: The Ordering for WEB3 100.00% 50.00% yes 0,9684 beauty beauty красота 200 100.00% 50.25% yes 0,9171 flora flora флора 199 100.00% 50.51% yes 0,9028 science science наука 198 100.00% 50.76% yes 0,8916 silver silver сребро / серебро 197 100.00% 51.28% yes 0,8017 finance finance финанси / финансы 19 6 … … … … … … … … 83.00% 82.18% no 0,2130 rubble leg бут 101 82.00% 82.00% no 0,2101 time year година 100 81.00% 81.82% yes 0,2099 volcano volcano вулкан 99 … … … … … … … … 5.00% 100.00% no 0,0182 whip hedge плет / плеть 5 4.00% 100.00% no 0,0175 crud chill мраз / мразь 4 3.00% 100.00% no 0,0143 income livestock добитък / добыток 3 2.00% 100.00% no 0,0130 gaff mottle багрене / багренье 2 1.00% 100.00% no 0,0085 muff gratis муфта 1 R@ r [email_address] Cogn.? Sim. RU Sense BG Sense Candidate r
    • Discussion
      • Our approach is original because:
        • Introduces semantic similarity measure
          • Not orthographic or phonetic
        • Uses the Web as a corpus
          • Does not rely on any preexisting corpora
        • Uses reverse-context lookup
          • Significant improvement in quality
        • Is applied to original problem
          • Classification of almost identically spelled true/false friends
    • Discussion
      • Very good accuracy: over 95%
      • It is not 100% accurate
        • Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences
        • The Web as a corpus introduces noise
        • Google returns the first 1 000 results only
        • Google ranks higher news portals, travel agencies and retail sites than books, articles and forums posts
        • Local context could contains noise
    • Conclusion and Future Work
      • Conclusion
        • Algorithm that can distinguish between cognates and false friends
        • Analyzes words local contexts, using the Web as a corpus
      • Future Work
        • Better glossaries
        • Automatic augmenting the glossary
        • Different language pairs
    • Questions ? Cognate or False Friend? Ask the Web!