Document
Upcoming SlideShare
Loading in...5
×
 

Document

on

  • 269 views

 

Statistics

Views

Total Views
269
Views on SlideShare
266
Embed Views
3

Actions

Likes
1
Downloads
4
Comments
0

2 Embeds 3

http://research.digsolab.com 2
http://panchenko.me 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Document Document Presentation Transcript

  • Introduction The Method Evaluation Conclusion References A Graph-Based Approach to Skill Extraction from Text Higher School of Economics, School of Applied Mathematics and Information Science, Nizhny Novgorod, Russia Ilkka Kivim¨ki1 , Alexander Panchenko4,2 , Adrien Dessy1,2 , a Dries Verdegem3 , Pascal Francq1 , C´drick Fairon2 , e Hugues Bersini3 and Marco Saerens1 alexander.panchenko@uclouvain.be 1 ICTEAM, 2 CENTAL, Universit´ catholique de Louvain, Belgium, e Universit´ libre de Bruxelles, Belgium, e 4 Digital Society Laboratory LLC, Russia 3 IRIDIA, December 18, 2013 1 / 46
  • Introduction The Method Evaluation Conclusion References Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 2 / 46
  • Introduction The Method Evaluation Conclusion References Reference paper: Kivimki I., Panchenko A., Dessy A., Verdegem D., Francq P., Bersini H. and Saerens M. ”A Graph-Based Approach to Skill Extraction from Text”. In Proceedings of the 8th Workshop TextGraphs-8 Graph-based Methods for Natural Language Processing. EMNLP 2013: Conference on Empirical Methods in Natural Language Processing. Seattle, USA, October 18-21, 2013. http://aclweb.org/anthology/W/W13/W13-5011.pdf 3 / 46 View slide
  • Introduction The Method Evaluation Conclusion References Expertise retrieval [Balog et al., 2012] Expertise Retrieval vs. Expertise Seeking Expertise retrieval: linking humans to expertise areas, and vice versa from a system-centered perspective. Expertise retrieval has primarily focused on identifying good topical matches between a need for expertise on the one hand and the content of documents associated with candidate experts on the other hand. Expertise seeking: linking humans to expertise areas from a human-centered perspective. Expertise seeking has been mainly investigated in the field of knowledge management where the goal is to utilize human knowledge within an organization as well as possible. 4 / 46 View slide
  • Introduction The Method Evaluation Conclusion References Expertise retrieval [Balog et al., 2012] Expertise retrieval: Expert Profiling vs. Expert Seeking Person: a set of (text) documents generated by an individual. Expertise: a keyword or a a keyphrase, specifying a field of knowledge e.g. “Machine Learning”, “Hadoop”, “NLP”, etc. Expert profiling: given a person, retrieve (profile) its expertise. Person → Expertise Expert retrieval: given an expertise, retrieve persons with such expertise. Expertise → Person 5 / 46
  • Introduction The Method Evaluation Conclusion References Expertise Retrieval: Earlier Work TREC Enterprise Track [Balog et al., 2008] State-of-the-Art overview [Balog et al., 2012] A skill extraction system [Crow and DeSanto, 2004] Skill extraction System [Skomoroch et al., 2012] Expertise retrieval in universities [Balog et al., 2007] Expert finding on DBLP data [Deng et al., 2008] e-Human Resource Management system [Biesalski, 2003] 6 / 46
  • Introduction The Method Evaluation Conclusion References Expertise Retrieval: Earlier Work Skill extraction System [Skomoroch et al., 2012] http://www.freepatentsonline.com/20120197863.pdf 7 / 46
  • Introduction The Method Evaluation Conclusion References Expertise Retrieval: Applications Expertise management systems Knowledge management in enterprises Employee profiling Reviewer selection for articles Recommendation systems of jobs job applicants websites, blog texts, articles 8 / 46
  • Introduction The Method Evaluation Conclusion References Expertise retrieval 9 / 46
  • Introduction The Method Evaluation Conclusion References Expertise retrieval 10 / 46
  • Introduction The Method Evaluation Conclusion References Skill extraction We focus on skill extraction from texts, i.e. associating skills with text documents. 11 / 46
  • Introduction The Method Evaluation Conclusion References Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 12 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 13 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system The Elisit system for skill extraction Original goal of the system: Associate professional skills to people based on texts that they produce (emails, blogs, forums, articles etc.). Tools: List of skills extracted from LinkedIn. The skills are linked to corresponding Wikipedia pages. Method: 1 Find Wikipedia pages relevant to a target document. 2 Use spreading activation on Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 14 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Skill extraction using Wikipedia ↑ 15 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Example 16 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Example 17 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Example 18 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Size of the problem Our current version of English Wikipedia consists of n = 3 983 338 encyclopedia entries m = 247 560 469 links 27 513 of the encyclopedia entries correspond to LinkedIn skills. 19 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system Implementation For computing the similarities between the target document ˇ ur and all Wikipedia pages, we use the Gensim library [Reh˚ˇek and Sojka, 2010]. This part of the Elisit system is called the text2wiki module. Currently the bottleneck of the computation For performing spreading activation, we use the sparse matrix library of SciPy. This part is called the wiki2skill module. 20 / 46
  • Introduction The Method Evaluation Conclusion References Overview of the system The Elisit system At the moment not fully functional... 21 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 22 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Popular Article about Natural Language Understanding 23 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Popular Article about Natural Language Understanding 24 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Blog Article about SEO Marketing 25 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Blog Article about SEO Marketing 26 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Wikipedia Article about Geo Information Systems 27 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Wikipedia Article about Geo Information Systems 28 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Scientific Article about Graph Mining 29 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Scientific Article about Graph Mining 30 / 46
  • Introduction The Method Evaluation Conclusion References Sample Queries Try it. . . Elisit Web Interface http://elisit.cental.be/ Elisit Web Service http://elisit.cental.be:8080/ This is only a demo: not optimized for multiple-user queries, high load, fast response, etc. 31 / 46
  • Introduction The Method Evaluation Conclusion References Association with Wikipedia Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 32 / 46
  • Introduction The Method Evaluation Conclusion References Association with Wikipedia Association with Wikipedia 1. Find Wikipedia pages relevant to a target document. We compute the similarity between the input document and all Wikipedia pages. We tried four different models: 1 2 3 4 TF-IDF (300,000 dimensions) LogEntropy (300,000 dimensions) LogEntropy + LSA (200 dimensions) LogEntropy + LDA (200 topics) ⇒ the target document is represented as a semantic vector of size n, the number of Wikipedia pages (inspired by ESA [Gabrilovich and Markovitch, 2007]). 33 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 34 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia 2. Use Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 35 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia 2. Use Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 35 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia 2. Use Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 35 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia 2. Use Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 35 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia 2. Use Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 35 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia 2. Use Wikipedia’s hyperlink network to find skills that are “close” or “central” to these relevant pages. 35 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia Formalization of spreading activation by Shrager et al. [1987]: If a(0) is a vector of initial activations, then after each time step t, the vector of activations is a(t) = γa(t − 1) + λWT a(t − 1) + c(t) Parameters T , the number of time steps γ ∈ [0, 1] is a decay factor λ ∈ [0, 1] is a friction factor c(t) is an activation source vector The link weight, element wij of W determines the amount of activation that is spread from i to j. 36 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia a(t) = γa(t − 1) + λWT a(t − 1) + c(t) Thorough model selection difficult because of the size of the problem We experimented with three versions of the model: model 1: a(t) = WT a(t − 1) model 2: a(t) = WT a(t − 1) + a(t − 1) model 3: a(t) = WT a(t − 1) + a(0) In addition, W is constrained to be row-stochastic. More focus on the selection of the link weights than other parameters. 37 / 46
  • Introduction The Method Evaluation Conclusion References Spreading activation in Wikipedia Spreading activation in Wikipedia Observation from initial results: Hubs get easily activated even if they are not relevant. Common phenomenon with large graphs [Brand, 2005; von Luxburg et al., 2010] Solution: we bias the spreading to avoid hubs by wij = πjα (i,k)∈E α πk πj is a popularity index of j (degree / PageRank / HITS). If α = 0, no biasing; if α < 0 popular nodes are avoided. Biased random walks have e.g. shorter return times than unbiased random walks [Fronczak and Fronczak, 2009]. 38 / 46
  • Introduction The Method Evaluation Conclusion References Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 39 / 46
  • Introduction The Method Evaluation Conclusion References Evaluation of system We evaluated the biasing strategy by seeing how well the system activates related skills, defined by LinkedIn. ≤ 20                          40 / 46
  • Introduction The Method Evaluation Conclusion References Evaluation of system We tested the biasing strategy by seeing how well the system activates related skills, defined by LinkedIn. α 0 -0.2 -0.4 -0.6 -0.8 -1 din 0.119 0.206 0.225 0.238 0.213 0.169 Pre@5 PR 0.119 0.238 0.263 0.225 0.181 0.156 HITS 0.119 0.206 0.169 0.119 0.075 0.063 din 0.156 0.222 0.203 0.200 0.191 0.178 Pre@10 PR 0.156 0.216 0.200 0.197 0.197 0.197 HITS 0.156 0.213 0.150 0.141 0.113 0.091 din 0.154 0.172 0.185 0.186 0.171 0.154 R-Pre PR 0.154 0.193 0.204 0.193 0.185 0.172 HITS 0.154 0.185 0.148 0.119 0.109 0.097 din 0.439 0.469 0.503 0.511 0.515 0.493 Rec@100 PR HITS 0.439 0.439 0.469 0.494 0.498 0.476 0.517 0.418 0.524 0.384 0.518 0.336 Table : The effect of the biasing parameter α and the choice of popularity index on the results in the evaluation of the module. E.g., the top 5 most activated skills of all the ≈ 27 000 skills contain 1-2 of the ≤ 20 related skills, on average. Also, biasing definitely improves retrieval results. 41 / 46
  • Introduction The Method Evaluation Conclusion References Evaluation of system We also ran a test for comparing the different language models. VSM TF-IDF LogEntropy LogEnt + LSA LogEnt + LDA Pre@5 0.231 0.216 0.180 0.193 Pre@10 0.214 0.212 0.181 0.174 R-Pre 0.190 0.193 0.163 0.159 Rec@100 0.516 0.525 0.491 0.470 Table : Comparison of the different vector space models of the system in the performance of the whole system. 42 / 46
  • Introduction The Method Evaluation Conclusion References Table of Contents 1 Expertise retrieval and skill extraction 2 The Elisit system for skill extraction Overview of the system Sample Queries Association with Wikipedia Spreading activation in Wikipedia 3 Evaluation of system 4 Conclusion and future work 43 / 46
  • Introduction The Method Evaluation Conclusion References Conclusion The Elisit system extracts explicit skills that are related to an arbitrary text input. Combination of ESA-style conceptual mapping and spreading activation on the Wikipedia network Evaluation experiments suggest that using popularity-biased spreading activation improves retrieval results. 44 / 46
  • Introduction The Method Evaluation Conclusion References Future work Improvement of link weights, e.g. by computing content similarity of the Wikipedia pages trying other structural similarity measures using the category memberhips of pages Comparison with other strategies More sophisticated (e.g. hierarchical) representation of results. Also, the methodology could be applied for other purposes, e.g. a general topic model by replacing skills with topics. 45 / 46
  • Introduction The Method Evaluation Conclusion References References Krisztian Balog, Toine Bogers, Leif Azzopardi, Maarten De Rijke, and Antal Van Den Bosch. Broad expertise retrieval in sparse data environments. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages 551–558. ACM, 2007. Krisztian Balog, Paul Thomas, Nick Craswell, Ian Soboroff, Peter Bailey, and Arjen P De Vries. Overview of the trec 2008 enterprise track. Technical report, DTIC Document, 2008. Krisztian Balog, Yi Fang, Maarten de Rijke, Pavel Serdyukov, and Luo Si. Expertise retrieval. Foundations and Trends in Information Retrieval, 6(2-3):127–256, 2012. Ernst Biesalski. Knowledge management and e-human resource management. FGWM 2003, 2003. M. Brand. A random walks perspective on maximizing satisfaction and profit. Proceedings of the 2005 SIAM International Conference on Data Mining, 2005. Dan Crow and John DeSanto. A hybrid approach to concept extraction and recognition-based matching in the domain of human resources. In Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE International Conference on, pages 535–541. IEEE, 2004. Hongbo Deng, Irwin King, and Michael R Lyu. Formal models for expert finding on dblp bibliography data. In Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on, pages 163–172. IEEE, 2008. Agata Fronczak and Piotr Fronczak. Biased random walks in complex networks: The role of local navigation rules. Physical Review E, 80(1):016107, 2009. Evgeniy Gabrilovich and Shaul Markovitch. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In IJCAI’07: Proceedings of the 20th international joint conference on Artifical intelligence, pages 1606–1611, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. ˇ ur Radim Reh˚ˇek and Petr Sojka. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010. ELRA. Jeff Shrager, Tad Hogg, and Bernardo A Huberman. Observation of phase transitions in spreading activation networks. Science, 236(4805):1092–1094, 1987. U. von Luxburg, A. Radl, and M. Hein. Getting lost in space: large sample analysis of the commute distance. Proceedings of the 23th Neural Information Processing Systems conference (NIPS 2010), pages 2622–2630, 2010. 46 / 46