SpSim

  • 194 views
Uploaded on

This talk presents SpSim, a new string similarity measure for identifying cognates that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates …

This talk presents SpSim, a new string similarity measure for identifying cognates that is tolerant towards characteristic spelling differences that are automatically extracted from a set of cognates known apriori.

Talk given at EPIA 2011, October 10, 2011, Lisboa

More in: Education , Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
194
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
1
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs Faculdade de Ciˆncias e Tecnologia e da Universidade Nova de Lisboa EPIA 2011, October 10, 2011, Lisboa
  • 2. What are cognates? In linguistics, cognates are words that have a common etymological origin. – WikipediaEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 3. What are cognates? In linguistics, cognates are words that have a common etymological origin. – Wikipedia Example The words etymology (English) and etimologia (Portuguese) both derive from Greek etymolog´ through Latin etymologia. ıaEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 4. What are cognates? I am particularly interested in cognates of different languages, that retain the same meaning, such asEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 5. What are cognates? I am particularly interested in cognates of different languages, that retain the same meaning, such as German symbole themen operative English symbols themes operational French symboles th`mes e op´rationnelle e Spanish s´ ımbolos temas operativa Portuguese s´ ımbolos temas operacional Italian simboli temi operativaEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 6. What are cognates? I am particularly interested in cognates of different languages, that retain the same meaning, such as German demokratische aspekte justiz English democratic aspects justice French d´mocratique e aspects justice Spanish democr´tica a aspectos justicia Portuguese democr´tica a aspectos justi¸a c Italian democratica aspetti giustiziaEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 7. Extracting cognates from parallel corpora Example parallel sentences Os Estados - Membros The Member States shall coordenam as suas pol´ ıticas coordinate their economic policies within the Union . econ´micas no ˆmbito da o a Uni˜o . aEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 8. Extracting cognates from parallel corpora Example parallel sentences Os Estados - Membros The Member States shall coordenam as suas pol´ ıticas coordinate their economic policies within the Union . econ´micas no ˆmbito da o a Uni˜o . a Spelling similarity Cognates typically have similar spellings.EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 9. Extracting cognates from parallel corpora Example parallel sentences Os Estados - Membros The Member States shall coordenam as suas pol´ ıticas coordinate their economic policies within the Union . econ´micas no ˆmbito da o a Uni˜o . a Spelling similarity Cognates typically have similar spellings. Association Translations tend to co-occur systematically in parallel texts, while non-translations co-occur by chance.EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 10. Extracting cognates from parallel corpora Spelling similarity measures EDSim (Edit-Distance-based Similarity) LCSR (Longest Common Subsequence Ratio) and a few others . . . Association measures Dice SCP (Symmetric Conditional Probability) Mutual-Information and many others . . .EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 11. Extracting cognates from parallel corpora Most commonly used spelling similarity measures ED(w1 , w2 ) EDSim(w1 , w2 ) = 1 − max(w 1 , w 2 ) ED(w1 , w2 ) is the Edit Distance between words w1 and w2 . LCS(w1 , w2 ) LCSR(w1 , w2 ) = max(w 1 , w 2 ) LCS(w1 , w2 ) is the length of the Longest Common Subsequence between words w1 and w2 .EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 12. Extracting cognates from parallel corpora Problem with these measures They look at strings too literally! EDSim(photographic, fotogr´fica) = 0.5 a LCSR(photographic, fotogr´fica) = 0.58 a The spelling similarity score should be closer to 1.0 to reflect human judgement.EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 13. How does SpSim work?EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 14. How does SpSim work? First we align the two strings to find differences This takes O(w 1 w 2 ) time, just like computing ED(w 1 , w 2 ) or LCS(w 1 , w 2 ). ˆ ph o t o g r aph i c $ ˆ f o t o g r ´f a i c o $EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 15. How does SpSim work? First we align the two strings to find differences This takes O(w 1 w 2 ) time, just like computing ED(w 1 , w 2 ) or LCS(w 1 , w 2 ). ˆ ph o t o g r aph i c $ ˆ f o t o g r ´f a i c o $ Then we check which differences we may ignore Is “ph f” in the hashtable? Is “aph ´f” in the hashtable? a Is “ o” in the hashtable?EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 16. How does SpSim work? First we align the two strings to find differences This takes O(w 1 w 2 ) time, just like computing ED(w 1 , w 2 ) or LCS(w 1 , w 2 ). ˆ ph o t o g r aph i c $ ˆ f o t o g r ´f a i c o $ Then we check which differences we may ignore Is “ph f” in the hashtable? Is “aph ´f” in the hashtable? a Is “ o” in the hashtable? In learning mode we would insert these differences into the hastable instead.EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 17. How does SpSim work? Finally, we compute SpSim i Di SpSim(w1 , w2 ) = 1 − w1 + w2 Di is the length of each difference that cannot be ignored. If no difference is ignored, then SpSim(w1 , w2 ) = EDSim(w1 , w2 ).EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 18. How does SpSim work? Problem: over-generalization Some differences such as insert an “o” in the Portuguese word are too vague and may occur by chance (ie, even if the words are totally unrelated). ˆ ph o t o g r aph i c $ ˆ f o t o g r ´f a i c o $EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 19. How does SpSim work? Solution: contextualize first and generalize afterwards Contextualized differences are less likely to occur by chance. Example: insert an “o” at the end of the Portuguese word if the English word ends with a “c”. ˆpho t o g raphi c$ ˆfo t o g r´fi a co$ Whenever we find the same difference in a different context we may generalize it. ˆpha s e $ ˆfa s e $ “ˆpho ˆfo” + “ˆpha ˆfa” = “ˆph ˆf”EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 20. Experimental setup Corpora I used a parallel corpus of texts from the European Constitution in five language pairs. Method 1. Obtain a list of putative cognates by thresholding an association measure (Dice). 2. Manually verify all putative cognates. 3. Compare the precision, recall and f-measure of SpSim and EDSim for a series of different threshold values.EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 21. Experimental setup I used Dice to extract the initial list of putative cognates 2 F(x, y ) Dice(x, y ) = F(x) + F(y ) F (x, y ) is the number of co-occurrences in all parallel segments si : F(x, y ) = min(f(x, si ), f(y , si )) i F(x) and F(y ) are the total number of occurrences of x and y in all parallel segments si : F(x) = f(x, si ) ; F(y ) = f(y , si ) i iEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 22. Experimental setup Extracted all pairs of words with Dice ≥ 0.6. Summary of extraction and manual verification Language Pair Accepted Rejected Total German-English 269 878 1147 English-Spanish 399 749 1148 English-French 380 825 1205 English-Portuguese 410 796 1206 French-Italian 635 974 1609EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 23. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 16 examples (edsim > 0.9) spsim learned from 4 examples (edsim > 0.9) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 24. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 57 examples (edsim > 0.8) spsim learned from 25 examples (edsim > 0.8) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 25. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 140 examples (edsim > 0.7) spsim learned from 46 examples (edsim > 0.7) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 26. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 202 examples (edsim > 0.6) spsim learned from 61 examples (edsim > 0.6) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 27. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 246 examples (edsim > 0.5) spsim learned from 75 examples (edsim > 0.5) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 28. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 267 examples (edsim > 0.4) spsim learned from 89 examples (edsim > 0.4) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 29. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 299 examples (edsim > 0.3) spsim learned from 106 examples (edsim > 0.3) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 30. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 346 examples (edsim > 0.2) spsim learned from 147 examples (edsim > 0.2) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 31. Comparing SpSim to EDSim English–Portuguese English–German spsim learned from 380 examples (edsim > 0.1) spsim learned from 219 examples (edsim > 0.1) 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 threshold threshold edsim precision spsim precision edsim precision spsim precision edsim recall spsim recall edsim recall spsim recall edsim f-measure spsim f-measure edsim f-measure spsim f-measureEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 32. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 18 examples (edsim > 0.9) 31 examples (edsim > 0.9) 16 examples (edsim > 0.9) DE-EN FR-IT 4 examples (edsim > 0.9) 14 examples (edsim > 0.9)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 33. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 103 examples (edsim > 0.8) 93 examples (edsim > 0.8) 57 examples (edsim > 0.8) DE-EN FR-IT 25 examples (edsim > 0.8) 124 examples (edsim > 0.8)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 34. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 168 examples (edsim > 0.7) 149 examples (edsim > 0.7) 140 examples (edsim > 0.7) DE-EN FR-IT 46 examples (edsim > 0.7) 251 examples (edsim > 0.7)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 35. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 203 examples (edsim > 0.6) 181 examples (edsim > 0.6) 202 examples (edsim > 0.6) DE-EN FR-IT 61 examples (edsim > 0.6) 362 examples (edsim > 0.6)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 36. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 244 examples (edsim > 0.5) 220 examples (edsim > 0.5) 246 examples (edsim > 0.5) DE-EN FR-IT 75 examples (edsim > 0.5) 449 examples (edsim > 0.5)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 37. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 255 examples (edsim > 0.4) 234 examples (edsim > 0.4) 267 examples (edsim > 0.4) DE-EN FR-IT 89 examples (edsim > 0.4) 502 examples (edsim > 0.4)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 38. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 286 examples (edsim > 0.3) 260 examples (edsim > 0.3) 299 examples (edsim > 0.3) DE-EN FR-IT 106 examples (edsim > 0.3) 538 examples (edsim > 0.3)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 39. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 329 examples (edsim > 0.2) 301 examples (edsim > 0.2) 346 examples (edsim > 0.2) DE-EN FR-IT 147 examples (edsim > 0.2) 581 examples (edsim > 0.2)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 40. Comparing SPSim to EDSim EN-ES EN-FR EN-PT 368 examples (edsim > 0.1) 343 examples (edsim > 0.1) 380 examples (edsim > 0.1) DE-EN FR-IT 219 examples (edsim > 0.1) 622 examples (edsim > 0.1)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 41. Conclusions SpSim learns fastEPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 42. Conclusions SpSim learns fast SpSim has much better recall than EDSim (and LCSR)EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 43. Conclusions SpSim learns fast SpSim has much better recall than EDSim (and LCSR) SpSim has the same time complexity as EDSim and LCSREPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 44. Conclusions SpSim learns fast SpSim has much better recall than EDSim (and LCSR) SpSim has the same time complexity as EDSim and LCSR SpSim is almost as easy to implement as EDSim or LCSREPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs
  • 45. Thanks for listening Questions?EPIA 2011 Measuring Spelling Similarity for Cognate Identification Lu´ Gomes ıs