Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Semantic Analysis using Wikipedia Taxonomy


Published on

This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes the different components of the semantic analyzer using Wikipedia and Dbpedia as data sets.

Published in: Technology, Education
  • Be the first to comment

Semantic Analysis using Wikipedia Taxonomy

  1. 1. Creating a taxonomy for Wikipedia Patrick Nicolas Feb 11, 2012
  2. 2. Introduction The goal of the study is to build a Taxonomy Graph for the 3+ millions Wikipedia entries by leveraging the WordNet hyponyms as a training set. This model can used in a wide variety of commercial applications from extracting context extraction and automated Wiki classification to text summarization. Notes: • Definitions and notations are defined in the appendices • The presentation assumes the reader has basic knowledge in information retrieval, Natural Language Processing and Machine Learning. Copyright Patrick Nicolas 2012 - All rights reserved 2
  3. 3. Process The computation flow for the generation of taxonomy for Wikipedia is summarized in the following 5 simple steps. 1. Extract abstract & categories from Wikipedia datasets 2. Generate the Hypernyms lineages for Wikipedia entries which overlap with WordNet synsets 3. Extract, reduce and ordered N-Grams and their tags (NNP, NN,.) from each Wikipedia abstract. 4. Create a training set of weighted graphs of each Wikipedia abstract that has a corresponding hypernyms hierarchy 5. Optimize and apply the model for generating taxonomy lineages for each Wikipedia entry Copyright Patrick Nicolas 2012 - All rights reserved 3
  4. 4. Semantic Data Sources Terms Frequency Corpora Reuters corpus and Google N-Grams frequency is used to compute the inverse document frequency values. Word Net Hypernyms WordNet database of Synsets is used to generate hierarchy of hypernyms. entity/physical entity/object/location/region/district/country/European country/Italy Wikipedia Datasets Entry (label), long abstract and categories are to be extracted from the Wikipedia reference database. Copyright Patrick Nicolas 2012 - All rights reserved 4
  5. 5. N-Grams Extraction Model The relevancy (or weight ω) of a N-Gram to the context of a document depends on syntactic, semantic and probabilistic features. Frequency N-Gram in document Similarity of N-Gram with Categories β fD N-Gram tag N-Gram α Term 1 Semantic Definition? … Frequency of terms ρ Frequency N-Gram in categories abstracts Term n idf φ Contained in 1st sentence? Frequency N-Gram in Universe (Corpus) Fig. 1 Illustration of features of N-Gram Extraction Model Copyright Patrick Nicolas 2012 - All rights reserved 5
  6. 6. Computation Flow The computation flow is broken down in ‘plug & play’ processing units to enable design of experiments and audit. N-Grams Corpus idf Freq. N-Grams Abstract Wikipedia Datasets WordNet Synsets Categories Weighted N-Grams N-Gram tags Abstract Semantic match label Labeled Lineage Normalized N-Gram Weights Hypernyms Taxonomy Graph Trained Model Fig. 2 Typical computation flow for generation of taxonomy Copyright Patrick Nicolas 2012 - All rights reserved 6
  7. 7. NGrams Frequency Analysis Let’s define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of the N-Gram within the corpus C is expressed as. The inverse document frequency (IDF) is computed as Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms wj j =1,n with a frequency count(wj) with a document D. The frequency of the N-Gram is computed as Copyright Patrick Nicolas 2012 - All rights reserved 7
  8. 8. Weighting N-Grams Most of Wikipedia concepts are well described in the first sentence of each abstract. Therefore we can attribute a great weight to NGrams that are contained in the first sentence. The frequency f lD of a N-Grams in the 1st sentence of a document is defined as A simple regression analysis showed that a square root function provide a more accurate contribution (weight) of a N-Gram in a document D. Copyright Patrick Nicolas 2012 - All rights reserved 8
  9. 9. Tagging N-Grams Although Conditional Random Fields is the predominant discriminative classifier to predict sentence boundaries, token tags we found out that the Maximum Entropy for binary features were more appropriate to classify the first term in a sentence (NNP or NN). The model features functions ft (w) => {0,1} are extracted by maximizing the entropy H(p) of the probability of a word, w, has a specific tag t. Subjected to the constraints.. Copyright Patrick Nicolas 2012 - All rights reserved 9
  10. 10. Wikipedia Tags Distribution We extract the tags of Wikipedia entries (1 to 4-Grams) in the context of their abstracts. The distribution of the frequency of the tags shows that the proper nouns (stemmed as NNP tags) are the predominant tags. The frequency distribution is used as prior probability for finding a Wikipedia entry of a specific tag. Copyright Patrick Nicolas 2012 - All rights reserved 10
  11. 11. Tag Predictive Model We use a multinomial Naïve Bayes to predict the tag of any given Wikipedia entry. Let’s defined a set of classes Ck = { w(n) | tg(w(n)) = k } of Wikipedia entries of specific tags (CNNP NN) & p(t| Ck) the prior probability of a tag t to belong to a class. The likelihood a given Wikipedia entry as a tag k is Copyright Patrick Nicolas 2012 - All rights reserved 11
  12. 12. Taxonomy Weighted Graph Let’s define: • taxonomy class (or Taxa) as a graph node representing a Hypernym (i.e. class=‘person’) • taxonomy instance as entity name (i.e. instance=‘Peter’ or Peter IS-A a Person) • Taxonomy lineage as the list of ancestors (Hypernyms) of an instance Fig. Example of taxonomy lineage Copyright Patrick Nicolas 2012 - All rights reserved 12
  13. 13. Document taxonomy Any document can be represented as a weighted graph of taxonomy classes and instances. Fig. Example of taxonomy graph Copyright Patrick Nicolas 2012 - All rights reserved 13
  14. 14. Propagation Rule for Taxonomy Weights The flow model is applied to the taxonomy weighted graph to compute the weight of each taxonomy class from the normalized weight of semantic N-Grams. The weights of taxonomy classes are normalized with the root ‘entity’ (ω =1 ). The taxonomy instances (N-Grams) are ordered & normalized by their respective weights ω( wk(n) ) Fig. Weight propagation in Taxonomy Graph Copyright Patrick Nicolas 2012 - All rights reserved 14
  15. 15. Normalized Taxonomy Weight in Wikipedia We analyze the distribution of weights along the taxonomy lineage for all Wikipedia entries Copyright Patrick Nicolas 2012 - All rights reserved 15
  16. 16. Lineage Weights Estimator The training using the initial set of WordNet hypernyms shows that the distribution of normalized weights ωkalong the taxonomy lineage for a specific similarity class C, can be approximated with polynomial function (spline). This estimator is used in the classification of the taxonomy lineages of a Wikipedia abstract. Copyright Patrick Nicolas 2012 - All rights reserved 16
  17. 17. Similarity Metrics In order to train a model using labeled WordNet hypernyms, a similarity (or distance) metrics need to be defined. Let’s consider 2 taxonomy lineages Vi and Vk of respective length n(k) and n(j) Cosine Distance Shortest Path Distance Copyright Patrick Nicolas 2012 - All rights reserved 17
  18. 18. Taxonomy Generation Model Let consider m classes of taxonomy lineage similarity and labeled lineage VH . A class Ciis defined by A taxonomy lineage Vj is classified using Naïve Bayes. Copyright Patrick Nicolas 2012 - All rights reserved 18
  19. 19. Appendix: notation Copyright Patrick Nicolas 2012 - All rights reserved 19
  20. 20. Appendix: References • “Introduction to Information Retrieval”C. Manning, P Raghavan, H Schūtze Cambridge University Press • “Elements of Statistical Learning” T Hastie, R Tibshirani, J Friedman Springer • “Semantic Taxonomy Induction from Heterogeneous Evidence” R Snow, D Jurafsky, A Ng • “A Study on Linking Wikipedia Categories to WordNet synsets using text similarity” A Toral, O Fernandez, E Agirre, R Muňoz • “Regularization Predicts While Discovery Taxonomy” Y. Mroueh, T Poggio, L Rosasco • “Natural Language Semantics Term Project” M Tao. • “A Maximum Entropy Approach to Natural Language Processing” A Berger, V Della Pietra, S Della Pietra. Copyright Patrick Nicolas 2012 - All rights reserved 20