Semantic Analysis using Wikipedia Taxonomy
 

Semantic Analysis using Wikipedia Taxonomy

on

  • 1,204 views

This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes ...

This is an introduction to an algorithm and methodology to extract semantics from one or several documents using Natural Language Processing and Machine learning techniques. The presentation describes the different components of the semantic analyzer using Wikipedia and Dbpedia as data sets.

Statistics

Views

Total Views
1,204
Slideshare-icon Views on SlideShare
1,175
Embed Views
29

Actions

Likes
0
Downloads
20
Comments
0

11 Embeds 29

http://patricknicolas.blogspot.com 7
http://www.linkedin.com 6
http://patricknicolas.blogspot.ca 4
http://patricknicolas.blogspot.in 3
http://patricknicolas.blogspot.it 2
http://patricknicolas.blogspot.co.uk 2
http://patricknicolas.blogspot.ru 1
http://patricknicolas.blogspot.cz 1
http://patricknicolas.blogspot.fr 1
http://patricknicolas.blogspot.de 1
http://patricknicolas.blogspot.be 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Semantic Analysis using Wikipedia Taxonomy Semantic Analysis using Wikipedia Taxonomy Presentation Transcript

    • Creating a taxonomy for Wikipedia Patrick Nicolas Feb 11, 2012 http://patricknicolas.blogspot.com http://www.slideshare.net/pnicolas https://github.com/prnicolas
    • Introduction The goal of the study is to build a Taxonomy Graph for the 3+ millions Wikipedia entries by leveraging the WordNet hyponyms as a training set. This model can used in a wide variety of commercial applications from extracting context extraction and automated Wiki classification to text summarization. Notes: • Definitions and notations are defined in the appendices • The presentation assumes the reader has basic knowledge in information retrieval, Natural Language Processing and Machine Learning. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 2
    • Process The computation flow for the generation of taxonomy for Wikipedia is summarized in the following 5 simple steps. 1. Extract abstract & categories from Wikipedia datasets 2. Generate the Hypernyms lineages for Wikipedia entries which overlap with WordNet synsets 3. Extract, reduce and ordered N-Grams and their tags (NNP, NN,.) from each Wikipedia abstract. 4. Create a training set of weighted graphs of each Wikipedia abstract that has a corresponding hypernyms hierarchy 5. Optimize and apply the model for generating taxonomy lineages for each Wikipedia entry Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 3
    • Semantic Data Sources Terms Frequency Corpora Reuters corpus and Google N-Grams frequency is used to compute the inverse document frequency values. Word Net Hypernyms WordNet database of Synsets is used to generate hierarchy of hypernyms. entity/physical entity/object/location/region/district/country/European country/Italy Wikipedia Datasets Entry (label), long abstract and categories are to be extracted from the Wikipedia reference database. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 4
    • N-Grams Extraction Model The relevancy (or weight ω) of a N-Gram to the context of a document depends on syntactic, semantic and probabilistic features. Frequency N-Gram in document Similarity of N-Gram with Categories β fD N-Gram tag N-Gram α Term 1 Semantic Definition? … Frequency of terms ρ Frequency N-Gram in categories abstracts Term n idf φ Contained in 1st sentence? Frequency N-Gram in Universe (Corpus) Fig. 1 Illustration of features of N-Gram Extraction Model Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 5
    • Computation Flow The computation flow is broken down in ‘plug & play’ processing units to enable design of experiments and audit. N-Grams Corpus idf Freq. N-Grams Abstract Wikipedia Datasets WordNet Synsets Categories Weighted N-Grams N-Gram tags Abstract Semantic match label Labeled Lineage Normalized N-Gram Weights Hypernyms Taxonomy Graph Trained Model Fig. 2 Typical computation flow for generation of taxonomy Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 6
    • NGrams Frequency Analysis Let’s define an N-Gram, w(n) (i.e. w(3) for a 3-Gram). The frequency of the N-Gram within the corpus C is expressed as. The inverse document frequency (IDF) is computed as Let w(n) be a N-Gram with a frequency count(w(n)) composed of terms wj j =1,n with a frequency count(wj) with a document D. The frequency of the N-Gram is computed as Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 7
    • Weighting N-Grams Most of Wikipedia concepts are well described in the first sentence of each abstract. Therefore we can attribute a great weight to NGrams that are contained in the first sentence. The frequency f lD of a N-Grams in the 1st sentence of a document is defined as A simple regression analysis showed that a square root function provide a more accurate contribution (weight) of a N-Gram in a document D. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 8
    • Tagging N-Grams Although Conditional Random Fields is the predominant discriminative classifier to predict sentence boundaries, token tags we found out that the Maximum Entropy for binary features were more appropriate to classify the first term in a sentence (NNP or NN). The model features functions ft (w) => {0,1} are extracted by maximizing the entropy H(p) of the probability of a word, w, has a specific tag t. Subjected to the constraints.. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 9
    • Wikipedia Tags Distribution We extract the tags of Wikipedia entries (1 to 4-Grams) in the context of their abstracts. The distribution of the frequency of the tags shows that the proper nouns (stemmed as NNP tags) are the predominant tags. The frequency distribution is used as prior probability for finding a Wikipedia entry of a specific tag. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 10
    • Tag Predictive Model We use a multinomial Naïve Bayes to predict the tag of any given Wikipedia entry. Let’s defined a set of classes Ck = { w(n) | tg(w(n)) = k } of Wikipedia entries of specific tags (CNNP NN) & p(t| Ck) the prior probability of a tag t to belong to a class. The likelihood a given Wikipedia entry as a tag k is Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 11
    • Taxonomy Weighted Graph Let’s define: • taxonomy class (or Taxa) as a graph node representing a Hypernym (i.e. class=‘person’) • taxonomy instance as entity name (i.e. instance=‘Peter’ or Peter IS-A a Person) • Taxonomy lineage as the list of ancestors (Hypernyms) of an instance Fig. Example of taxonomy lineage Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 12
    • Document taxonomy Any document can be represented as a weighted graph of taxonomy classes and instances. Fig. Example of taxonomy graph Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 13
    • Propagation Rule for Taxonomy Weights The flow model is applied to the taxonomy weighted graph to compute the weight of each taxonomy class from the normalized weight of semantic N-Grams. The weights of taxonomy classes are normalized with the root ‘entity’ (ω =1 ). The taxonomy instances (N-Grams) are ordered & normalized by their respective weights ω( wk(n) ) Fig. Weight propagation in Taxonomy Graph Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 14
    • Normalized Taxonomy Weight in Wikipedia We analyze the distribution of weights along the taxonomy lineage for all Wikipedia entries Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 15
    • Lineage Weights Estimator The training using the initial set of WordNet hypernyms shows that the distribution of normalized weights ωkalong the taxonomy lineage for a specific similarity class C, can be approximated with polynomial function (spline). This estimator is used in the classification of the taxonomy lineages of a Wikipedia abstract. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 16
    • Similarity Metrics In order to train a model using labeled WordNet hypernyms, a similarity (or distance) metrics need to be defined. Let’s consider 2 taxonomy lineages Vi and Vk of respective length n(k) and n(j) Cosine Distance Shortest Path Distance Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 17
    • Taxonomy Generation Model Let consider m classes of taxonomy lineage similarity and labeled lineage VH . A class Ciis defined by A taxonomy lineage Vj is classified using Naïve Bayes. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 18
    • Appendix: notation Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 19
    • Appendix: References • “Introduction to Information Retrieval”C. Manning, P Raghavan, H Schūtze Cambridge University Press • “Elements of Statistical Learning” T Hastie, R Tibshirani, J Friedman Springer • “Semantic Taxonomy Induction from Heterogeneous Evidence” R Snow, D Jurafsky, A Ng • “A Study on Linking Wikipedia Categories to WordNet synsets using text similarity” A Toral, O Fernandez, E Agirre, R Muňoz • “Regularization Predicts While Discovery Taxonomy” Y. Mroueh, T Poggio, L Rosasco • “Natural Language Semantics Term Project” M Tao. • “A Maximum Entropy Approach to Natural Language Processing” A Berger, V Della Pietra, S Della Pietra. Copyright Patrick Nicolas 2012 - All rights reserved http://patricknicolas.blogspot.com 20