MINING NAME ENTITY FROM
WIKIPEDIA
GROUP MEMBER
- NIKHIL BAROTE
- KUNJ THAKKAR
- SHIVANI PODDAR
- ANKIT SHARMA
 In many search domains, both contents and searches are
frequently tied to named entities such as a person, a
company or similar.
 One challenge from an information retrieval point of view is
that a single entity can have more than one way of referring
to it.
 In this project we describe how to use Wikipedia contents
to automatically generate a dictionary of named entities
and synonyms that are all referring to the same entity.
 we can find named entities and their synonyms with a high
degree of accuracy with our approach.
 There are four Wikipedia features that are in particular
attractive as a mining source when building a large
collection of NEs:
1.INTERNAL LINKS
2.REDIRECT LINKS
3.EXTERNAL LINKS
4.CATEGORIES
 Generic Named Entity Recognition
The generic named entity recognition is only classifying a Wikipedia entry
as an entity or not. It starts out by looking at the title of the entry, since as
mentioned earlier, most of the article titles are nouns, and the only nouns
we are interested in are the proper nouns.
 Category Based Named-Entity Recognition
It is a subtask of information extraction that seeks to locate and classify
elements in text into pre-defined categories such as the names of persons,
organizations, locations, expressions of times, quantities, monetary values,
percentages, etc.
 Synonym extraction
After a set of NEs have been identified, we want to find their synonyms.
We intend to use the internal links, redirects and disambiguation pages
for this, and we can easily extract all of these after we have the NEs.
This will give us a list of captions, all used on links to a particular entity.
 Generic Named Entity Recognition Algorithm
To classify the entries we implemented an algorithm using the
following steps when given a title, T, and the text of an entry:
1. Remove any domain suffix from T
2. Tokenize T into n units, w1;w2; :::;wn
3. Remove any wi from W where wi is included in S
4. Classify as an entity if any of these conditions holds
true:
• ∑ C(wi) = n and n >= 2
• ∑ D(wi) >= 2
• ∑ E(T)/N(T) >= α
 A domain suffix is the text enclosed in parentheses that follows
the title of entries with multiple senses.
 They are used to disambiguate between the senses, but
since they are not part of the Extracting entity name, we
must first strip them from the title. Next we strip all wi
which are found in S, which is a list of stop words.
1. C=1 if any li ∊ [A::Z], 0 otherwise
2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise
3. D returns 1 if the parameter has multiple capital
letters, 0 otherwise C is a function that returns 1 if the
parameter is capitalized, and 0 otherwise, while D is a
function that that returns 1 if the parameter has
multiple capital letters, and 0 otherwise. a is a variable
used as a threshold for the third condition.
Search System
 First we take unigrams , bigrams & trigrams from our query
document
 We look for them in our synonym database & We will get a
list of doc_titles & corresponding doc_ids.
 Now we look for words in window centered at current
word And we look at candidate documents & their doc_ids
(window size is set beforehand).
 We use vector space model to match our query document
to these candidates.
 We pick candidates with score greater than already set
threshold.Now we look for category for these entities in our
database
 Zesch et al. evaluate the usefulness of Wikipedia as a lexical
semantic resource, and compares it to more traditional
resources, such as dictionaries, thesauri, semantic wordnets, etc.
 Bunescu and Pa¸sca study how to use Wikipedia for detecting
and disambiguating NEs in open domain text.
 R. C. Bunescu and M. Pasca. Using encyclopedic knowledge for
named entity disambiguation. In Proceedings of
EACL’2006, 2006.
 R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically
annotated Wikipedia XML corpus. In Proceedings of
BTW’2007, 2007.
 T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and
accessing Wikipedia as a lexical semantic resource. In
Proceedings of Biannual Conference of the Society for
Computational Linguistics and Language Technology, 2007.
 R. Baeza-Yates and B. Ribeiro-Neto. Modern Information
Retrieval. Addison Wesley, 1999.
THANK YOU!

Information_retrieval_and_extraction_IIIT

  • 1.
    MINING NAME ENTITYFROM WIKIPEDIA GROUP MEMBER - NIKHIL BAROTE - KUNJ THAKKAR - SHIVANI PODDAR - ANKIT SHARMA
  • 2.
     In manysearch domains, both contents and searches are frequently tied to named entities such as a person, a company or similar.  One challenge from an information retrieval point of view is that a single entity can have more than one way of referring to it.  In this project we describe how to use Wikipedia contents to automatically generate a dictionary of named entities and synonyms that are all referring to the same entity.  we can find named entities and their synonyms with a high degree of accuracy with our approach.
  • 3.
     There arefour Wikipedia features that are in particular attractive as a mining source when building a large collection of NEs: 1.INTERNAL LINKS 2.REDIRECT LINKS 3.EXTERNAL LINKS 4.CATEGORIES
  • 4.
     Generic NamedEntity Recognition The generic named entity recognition is only classifying a Wikipedia entry as an entity or not. It starts out by looking at the title of the entry, since as mentioned earlier, most of the article titles are nouns, and the only nouns we are interested in are the proper nouns.  Category Based Named-Entity Recognition It is a subtask of information extraction that seeks to locate and classify elements in text into pre-defined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.  Synonym extraction After a set of NEs have been identified, we want to find their synonyms. We intend to use the internal links, redirects and disambiguation pages for this, and we can easily extract all of these after we have the NEs. This will give us a list of captions, all used on links to a particular entity.
  • 5.
     Generic NamedEntity Recognition Algorithm To classify the entries we implemented an algorithm using the following steps when given a title, T, and the text of an entry: 1. Remove any domain suffix from T 2. Tokenize T into n units, w1;w2; :::;wn 3. Remove any wi from W where wi is included in S 4. Classify as an entity if any of these conditions holds true: • ∑ C(wi) = n and n >= 2 • ∑ D(wi) >= 2 • ∑ E(T)/N(T) >= α  A domain suffix is the text enclosed in parentheses that follows the title of entries with multiple senses.
  • 6.
     They areused to disambiguate between the senses, but since they are not part of the Extracting entity name, we must first strip them from the title. Next we strip all wi which are found in S, which is a list of stop words. 1. C=1 if any li ∊ [A::Z], 0 otherwise 2. D=1 if |Q| >= 2 where Q = ∑ C(li), 0 otherwise 3. D returns 1 if the parameter has multiple capital letters, 0 otherwise C is a function that returns 1 if the parameter is capitalized, and 0 otherwise, while D is a function that that returns 1 if the parameter has multiple capital letters, and 0 otherwise. a is a variable used as a threshold for the third condition.
  • 7.
    Search System  Firstwe take unigrams , bigrams & trigrams from our query document  We look for them in our synonym database & We will get a list of doc_titles & corresponding doc_ids.  Now we look for words in window centered at current word And we look at candidate documents & their doc_ids (window size is set beforehand).  We use vector space model to match our query document to these candidates.  We pick candidates with score greater than already set threshold.Now we look for category for these entities in our database
  • 9.
     Zesch etal. evaluate the usefulness of Wikipedia as a lexical semantic resource, and compares it to more traditional resources, such as dictionaries, thesauri, semantic wordnets, etc.  Bunescu and Pa¸sca study how to use Wikipedia for detecting and disambiguating NEs in open domain text.
  • 10.
     R. C.Bunescu and M. Pasca. Using encyclopedic knowledge for named entity disambiguation. In Proceedings of EACL’2006, 2006.  R. Schenkel, F. M. Suchanek, and G. Kasneci. YAWN: Asemantically annotated Wikipedia XML corpus. In Proceedings of BTW’2007, 2007.  T. Zesch, I. Gurevych, and M. M¨uhlh¨auser. Analyzing and accessing Wikipedia as a lexical semantic resource. In Proceedings of Biannual Conference of the Society for Computational Linguistics and Language Technology, 2007.  R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.
  • 11.