Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

- A Smattering of Natural Language Pr... by Charlie Greenbacker 3116 views
- Semantic Natural Language Understan... by David Talby 519 views
- Unsupervised Knowledge-Free Word Se... by Alexander Panchenko 525 views
- Detecting Gender by Full Name: Exp... by Alexander Panchenko 1841 views
- Sentiment Index of the Russian Spea... by Alexander Panchenko 682 views
- Вычислительная лексическая семантик... by Alexander Panchenko 1811 views

1,268 views

Published on

On the other hand, precision of the existing extractors still do not meet quality of the handcrafted resources. All these factors motivate the development of novel extraction methods. In this work we developed several similarity measures for semantic relation extraction. The main research question we address, is how to improve precision and coverage of such measures. First, we perform a large-scale study the baseline techniques. Second, we propose four novel measures. One of them significantly outperforms the baselines, the others perform comparably to the state-of-the-art techniques. Finally, we successfully apply one of the novel measures in two text processing systems.

No Downloads

Total views

1,268

On SlideShare

0

From Embeds

0

Number of Embeds

19

Shares

0

Downloads

52

Comments

0

Likes

3

No embeds

No notes for slide

- 1. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Similarity Measures for Semantic Relation Extraction Mont Clair State University, Brown Bag Seminar (USA) Alexander Panchenko Universit´e catholique de Louvain & Ditital Society Laboratory LLC alexander.panchenko@uclouvain.be May 2, 2014 Alexander Panchenko 1/52
- 2. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Alexander Panchenko 2/52
- 3. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 3/52
- 4. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Computational Lexical Semantics * Picture is adapted from Computational Linguistics LINGI2263 course http://www.uclouvain.be/en-cours-2013-LINGI2263.html Alexander Panchenko 4/52
- 5. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Introduction Motivation 1 Synonyms, hypernyms and co-hyponyms are useful for: text similarity (ˇSaric et al., 2012); query expansion (Hsu et al., 2006); question answering (Sun et al., 2005); 2 Manual resource construction is prohibitively expensive. 3 Extractors do not meet quality of the handcrafted resources. Focus Similarity-based semantic relation extraction. Research Question How to improve precision and coverage of such measures? Alexander Panchenko 5/52
- 6. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Resources Deﬁnition A semantic resource is an undirected graph (C, R): nodes C represent terms; edges R represent untyped semantic relations. Alexander Panchenko 6/52
- 7. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Relation Extractors We study extractors based on two components: 1 semantic similarity measures; 2 nearest neighbors procedures. Terms Similarity Measure R S Normalizer S Semantic Similarity Measure Semantic Relations Feature Extractor Text-Based Data kNN Procedure F C Semantic Relation Extractor Alexander Panchenko 7/52
- 8. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Similarity Measures Deﬁnition A semantic similarity measure quantiﬁes semantic relatedness input terms ci , cj with the similarity score sij = sim(ci , cj ): sij = high if ci , cj is a pair of syn, hyper, cohypo 0 otherwise Properties Nonnegativity: 0 ≤ sij ≤ 1; Reﬂexivity: sij = 1 ⇔ ci = cj ; Symmetry: sij = sji ; Triangle inequality: sij ≤ sik + skj Alexander Panchenko 8/52
- 9. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Similarity Measures Many dissimilar pairs, few similar pairs: sij ∼ exp(λ): Similarity distribution of the term “doctor”: Alexander Panchenko 9/52
- 10. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Evaluation of Semantic Similarity Measures 1 Correlations with human judgments: Criterion: Pearson correlation (ρ) и Spearman correlation (r). Datasets: MC, RG, WordSim. 2 Semantic relation ranking: Criterion: Precision, Recall, F-measure. Dataset: BLESS, SN. 3 Semantic relation extraction: Criterion: Precision@k. Data: annotation and/or dictionaries. 4 Application-based evaluation: short text classiﬁcation system (iCOP); lexico-semantic search engine (Serelex). Panchenko A., Similarity Measures for Semantic Relation Extraction. PhD thesis. Universit´e catholique de Louvain. 197 pages, 2013, (Chapter 1). Alexander Panchenko 10/52
- 11. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Correlations with human judgments Alexander Panchenko 11/52
- 12. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Relation Ranking Precision P(k = 50) = 1 7 ≈ 0.86 word, ci word, cj relation type sij aﬁcionado enthusiast syn 0.07197 aﬁcionado fan syn 0.05195 aﬁcionado admirer syn 0.01964 aﬁcionado addict syn 0.01326 aﬁcionado devotee syn 0.01163 aﬁcionado foundling random 0.00777 aﬁcionado fanatic syn 0.00414 aﬁcionado adherent syn 0.00353 aﬁcionado capital random 0.00232 aﬁcionado statute random 0.00029 aﬁcionado blot random 0.00025 aﬁcionado meddler random 0.00005 aﬁcionado enlargement random 0.00003 aﬁcionado bawdyhouse random 0.00000 Alexander Panchenko 12/52
- 13. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 13/52
- 14. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Related publications This work stems from Hearst, M. A. Automatic acquisition of hyponyms from large text corpora. In ACL, pages 539–545, 1992. Selected publications: Panchenko A., Morozova O., Naets H. A Semantic Similarity Measure Based on Lexico-Syntactic Patterns. In Proceedings of KONVENS 2012, pp.174–178, Vienna (Austria), 2012 Panchenko A., Romanov P., Morozova O., Naets H., Philippovich A., Fairon C. Serelex: Search and Visualization of Semantically Related Words. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013), Moscow (Russia), 2013. Alexander Panchenko 14/52
- 15. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications A live demo http://serelex.cental.be/ Alexander Panchenko 15/52
- 16. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-syntactic patterns 18 patterns that extract hypernyms, co-hyponyms and synonyms Alexander Panchenko 16/52
- 17. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Patterns are encoded as FSTs Finite State Transducers (FSTs) Open source corpus processing tool Unitex: http://igm.univ-mlv.fr/~unitex/ Alexander Panchenko 17/52
- 18. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications A pattern encoded as an FST Take into account linguistic variation Unlike string-based patterns (Bollegala et al., 2007) Alexander Panchenko 18/52
- 19. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Patterns extract concordances such diverse {[occupations]} as {[doctors]}, {[engineers]} and {[scientists]}[PATTERN=1] such {non-alcoholic [sodas]} as {[root beer]} and {[cream soda]}[PATTERN=1] {traditional[food]}, such as {[sandwich]},{[burger]}, and {[fry]}[PATTERN=2] Alexander Panchenko 19/52
- 20. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Corpus Corpus Wikipedia+ukWaC: 2.9 · 1012 tokens Extracted concordances Wikipedia – 1.196.468 ukWaC – 2.227.025 WaCypedia+ukWaC – 3.423.493 Alexander Panchenko 20/52
- 21. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Reranking formula Efreq-Rnum-Cfreq-Pnum sij = √ pij · 2 · µb bi∗ + b∗j · P(ci , cj ) P(ci )P(cj ) . P(ci , cj ) = eij ij eij – extraction probability of the pair ci , cj , eij – frequency of co-occurrence of ci and cj in concordances K P(ci ) = fi i fi – probability of the term ci , fi – frequency of ci bi∗ = j:eij ≥β 1 – the number of extractions for term ci with the frequency ≥ β, µb = 1 |C| |C| i=1 bi∗ – the average number of extractions per term pij ∈ [1; 18] – number of distinct patterns which extracted the relation ci , cj Alexander Panchenko 21/52
- 22. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Relation Ranking Precision is comparable or better w.r.t. the baselines; Recall is lower w.r.t. the baselines. Figure : Precision-Recall graphs (the BLESS dataset). Alexander Panchenko 22/52
- 23. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Semantic Relation Extraction Precision@1 ≈ 0.80; “Good” coverage: Alexander Panchenko 23/52
- 24. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 24/52
- 25. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Related publications Panchenko A. A Study of Heterogeneous Similarity Measures for Semantic Relation Extraction. // In JEP-TALN-RECITAL 2012 — Grenoble (France), 2012. Panchenko A., Similarity Measures for Semantic Relation Extraction. PhD thesis. Universit´e catholique de Louvain. 197 pages, 2013: Chapters 2.1, 3.1. Alexander Panchenko 25/52
- 26. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Compared Semantic Similarity Measures 37 distinct measures; Q1: Are the measures are complementary? Q2: If yes, in which respects? Alexander Panchenko 26/52
- 27. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications The Best Single Measures (MC, RG, WordSim, BLESS, SN) Each one extracts many co-hyponyms, e.g.: Canon, Nikon , Lamborghini, Ferrari , Obama, Romney . Alexander Panchenko 27/52
- 28. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Further Results Most dissimilar measures Figure : 21 measures grouped according to their relation distributions. Measures are complementary w.r.t.: lexical coverage; performances; types of semantic relations they extract. Alexander Panchenko 28/52
- 29. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Implementation of the baseline measures Semantic Vectors: https://code.google.com/p/semanticvectors/ S-Space Package: https://code.google.com/p/airhead-research/ WordNet::Similarity: http://wn-similarity.sourceforge.net NLTK: http://nltk.googlecode.com/svn/trunk/doc/ howto/wordnet.html WikiRelate! PatternSim / Serelex: http://serelex.cental.be Web-based metrics: http://cwl-projects.cogsci.rpi.edu/msr LSA: http://lsa.colorado.edu Alexander Panchenko 29/52
- 30. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 30/52
- 31. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Related publications Panchenko A., Morozova O. A Study of Hybrid Similarity Measures for Semantic Relation Extraction. // Innovative Hybrid Approaches to the Processing of Textual Data Workshop, EACL 2012 — Avignon (France), 2012 — pp. 10–18 Panchenko A., Similarity Measures for Semantic Relation Extraction. PhD thesis. Universit´e catholique de Louvain. 197 pages, 2013, (Chapter 4). Panchenko A. A Study of Heterogeneous Similarity Measures for Semantic Relation Extraction. // In JEP-TALN-RECITAL 2012 — Grenoble (France), 2012 — pp. 29–42. Alexander Panchenko 31/52
- 32. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Hybrid vs Single Measures Terms, C simi (a) (b) combination method Scmb S1 SN sim1 S1 simN norm SN ... ...norm norm Scmb knn R Si norm Si knn SingleSimilarityMeasure HybridSimilarityMeasure Relations, Terms, C RRelations, Features Figure : Semantic relation extractor based on: (a) a single similarity measure; (b) a hybrid similarity measure. Alexander Panchenko 32/52
- 33. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications 16 Features = 16 Single Similarity Measures 5 network-based measures : 1 WuPalmer; 2 Leacock and Chodorow; 3 Resnik; 4 Jiang and Conrath; 5 Lin. 3 web-based measures (NGD-Yahoo/Bing/Google); 5 corpus-based measures: 2 distributional (BDA, SDA) 1 lexico-syntactic patterns (PatternSim) 2 other co-occurence based (LSA, NGD-Factiva) 3 deﬁnition-based measures 1 ExtendedLesk; 2 GlossVectors; 3 DefVectors-WktWiki. Alexander Panchenko 33/52
- 34. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Unsupervised Combination Methods 1 Mean: scmb ij = 1 K k=1,K sk ij ; 2 Mean-Nnz: scmb ij = 1 |k:sk ij >0,k=1,K| k=1,K sk ij ; 3 Mean-Zscore: Scmb = 1 K K k=1 Sk −µk σk ; 4 Median: scmb ij = median(s1 ij , . . . , sK ij ); 5 Max: scmb ij = max(s1 ij , . . . , sK ij ); 6 RankFusion: scmb ij = 1 K k=1,K rk ij ; 7 RelationFusion (Panchenko and Morozova, 2012). Alexander Panchenko 34/52
- 35. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Supervised Combination Methods 8 Logit, Logit-L1, Logit-L2. A binary logistic regression; Positive examples – synonyms, hyponyms, co-hyponyms from BLESS/SN; Negative examples – random relations from BLESS/SN; A relation ci , t, cj ∈ R is represented with a vector of pairwise similarities: x = (s1 ij , . . . , sN ij ), N = 2, 16; Category yij : yij = 0 if ci , t, cj is a random relation 1 otherwise Using the model (w1, . . . , wK ) for combination: scmb ij = 1 1 + e−z , z = K k=1 wk sk ij + w0. Alexander Panchenko 35/52
- 36. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Supervised Combination Methods 9 SVM. The weights w and the support vectors SV : w = xi ∈SV αi yi xi . Using the model scmb ij = wT x+b = K k=1 wi sk ij +b. Alexander Panchenko 36/52
- 37. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Hybrid Similarity Measures Precision-Recall graphs calculated on the BLESS dataset: (a) 16 single measures and the best hybrid measure Logit-E15; (b) 8 hybrid measures. Alexander Panchenko 37/52
- 38. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Hybrid Similarity Measure Logit-E15 Figure : Similarity scores between 74 words related to the word “acacia”. Alexander Panchenko 38/52
- 39. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Supervised Hybrid Similarity Measures Alexander Panchenko 39/52
- 40. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Supervised Hybrid Similarity Measures (cont.) Figure : Meta-parameter optimization with the grid search of the C-SVM-radial-E15 measure. Alexander Panchenko 40/52
- 41. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 41/52
- 42. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-Semantic Search Engine “Serelex” Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 42/52
- 43. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-Semantic Search Engine “Serelex” Related publications Panchenko A., Romanov P., Morozova O., Naets H., Philippovich A., Fairon C. Serelex: Search and Visualization of Semantically Related Words. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013), Moscow (Russia), 2013. Panchenko A., Naets H., Brouwers L., Romanov P., Fairon C., Recherche et visualisation de mots s´emantiquement li´es. Actes de la 20e conf´erence sur le Traitement Automatique des Langues Naturelles (TALN’2013). Les Sables d’Olonne, France. pp.747–754, 2013. Alexander Panchenko 43/52
- 44. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-Semantic Search Engine “Serelex” Search for Related Words: the List and the Graph http://serelex.cental.be/ Alexander Panchenko 44/52
- 45. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-Semantic Search Engine “Serelex” Search for Related Words: the List and the Graph Alexander Panchenko 45/52
- 46. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-Semantic Search Engine “Serelex” Search for Related Words: the Images Alexander Panchenko 46/52
- 47. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Lexico-Semantic Search Engine “Serelex” Evaluation of the Serelex Figure : Users’ satisfaction with the top 20 results. Alexander Panchenko 47/52
- 48. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Filename Categorization System “iCOP” Plan 1 The Context and the Problem 2 Pattern-Based Semantic Similarity Measure 3 Comparison of Similarity Measures 4 Hybrid Semantic Similarity Measures 5 Applications of Semantic Similarity Measures Lexico-Semantic Search Engine “Serelex” Filename Categorization System “iCOP” Alexander Panchenko 48/52
- 49. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Filename Categorization System “iCOP” Related publications Panchenko A., Naets H., Beaufort R., Fairon C. Towards Detection of Child Sexual Abuse Media: Classiﬁcation of the Associated Filenames. In Proceedings of the 35th European Conference on Information Retrieval (ECIR 2013). LNCS 7814, pp. 776-779. Springler-Verlag Berlin Heidelberg 2013. Panchenko A, Beaufort R., Fairon C. Detection of Child Sexual Abuse Media on P2P Networks: Normalization and Classiﬁcation of Associated Filenames. In Proceedings of Workshop on Language Resources for Public Security Applications of the 8th International Conference on Language Resources and Evaluation (LREC), 2012 Alexander Panchenko 49/52
- 50. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Filename Categorization System “iCOP” Short text classiﬁcation with Vocabulary Projection Alexander Panchenko 50/52
- 51. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Filename Categorization System “iCOP” Evaluation of the Vocabulary Projection Training Dataset Test Dataset Accuracy Accuracy (voc. projection) Gallery (train) Gallery 96.41 96.83 (+0.42) PirateBay Title+Desc+Tags PirateBay Title+Desc+Tags 98.92 98.86 (–0.06) PirateBay Title+Tags PirateBay Title+Tags 97.73 97.63 (–0.10) Gallery PirateBay Title+Desc+Tags 90.57 91.48 (+0.91) Gallery PirateBay Title+Tags 84.23 88.89 (+4.66) PirateBay Title+Desc+Tags Gallery 88.83 89.04 (+0.21) PirateBay Title+Tags Gallery 91.16 91.30 (+0.14) Table : Performance of an C-SVM linear classiﬁer (10-fold cross validation). Alexander Panchenko 51/52
- 52. The Problem Pattern-Based Measure Comparison Hybrid Measures Applications Filename Categorization System “iCOP” Thank you! Questions? Alexander Panchenko 52/52

No public clipboards found for this slide

Be the first to comment