Upcoming SlideShare
×

# SetExpansion on Named Entities

173 views

Published on

Published in: Technology, News & Politics
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
173
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
2
0
Likes
0
Embeds 0
No embeds

No notes for slide

### SetExpansion on Named Entities

1. 1. SET EXPANSION ON NAMED ENTITIES GROUP# 38 PROJECT# 3 ANKIT CHOUDHARY(201206570) LOVLEAN ARORA(201305590) SAKSHAM MAHESHWARI(201130184) AMAN JAIN(201101132)
2. 2. What is Set Expansion? ● In simple terms we can define set expansion is basically determining the set to which given named entities belongs to ● For ex: Sachin Tendulkar and Rahul Dravid belongs to set of Indian Cricket Team ● Define: Set expansion refers to expanding a given partial set of objects into a more complete set. In set expansion, the user issues a query consisting of small number of seeds x1,x2,...xk (assumption we will be given atleast three valid seeds) where each xi is a member of some target set Si. The answer to query is a listing of other probable elements of Si.
3. 3. Why Set Expansion ? ● With such a huge expansion of data/service providers, the need of users has been shifted from detailed query to simple query ● Now user wants desired results in quick time with some words as query ● If some user want to get list of Indian cricket players, he can just pass sachin tendulkar, rahul dravid as input and will get list of cricketers from set expansion technique ● Ex: – Input : {Sachin Tendulkar, Rahul Dravid} – Ouput : {Sourav Ganguly, VVS Laxman, Sachin Tendulkar, Rahul Dravid}
4. 4. Related Work ● Our system works on Wikipedia data and currently Wikipedia has no such features ● Wikipedia just provide title based search ● Google Sets: – Google Sets has been used for a number of purposes in research community, including deriving features for named-entity recognition and evaluation of question answering systems. – Shortcomings: Google Sets is a proprietary that may be changed any time ● SEAL (Set Expander for Any Language): Exploits semi-structured nature of web pages to find seed and wrapper around them. Wrappers are further used to search other related entities ● Others like Boo!Wa! System based on Web wrapper technologies to extract and rank entities iteratively, is also there in this race
5. 5. Approach ● Our entire is work is distributed over two parts: 1. Indexing 2. Searching Some external tools like POS(Part of Speech) Tagger we are applying on final retrieved document names to refine our results and constrained under named entities
6. 6. Indexing ● For Index preparation, we have gone through some specifics like, tokenization, stop word removal, stemming, diacritics normalization ● We focus on following fields provided by Wiki data to get our results – Titles, Categories, Infobox, Body Text, External References (order in decreasing order of their weights) and build some primary and secondary indexes on them
7. 7. Searching ● Document Fetcher – Retrieving relevant top 10 documents for each seed ● Attribute Classifier – Crawling each document based on Category, Infobox/Taxobox and Introductory Text ● Ranker – Rank the Set of documents corresponding to the attributes with highest weightage given to Category followed by Infobox and then Text. ● POS Tagger – Retrieving only the named entities that belong to the set thus obtained
8. 8. Complete Architecture
9. 9. Tests ● Input: Lagaan talaash ● Results: ● 3 idiots, sarfarosh champion (2003 film), lagaan, p.k. (film), afsana pyaar ka, delhi belly (film), dhoom 3, jo jeeta wohi sikandar, nation awakes, ghajini (2008 film), ready (2011 film), welcome (2007 film), luck by chance, elements trilogy
10. 10. Applications ● Set Expansion on Wiki data itself ● General Knowledge – For ex: if you want to know list of diseases and you know only few diseases like malaria and cholera, you just give them as inputs and you will get variety of diseases in results ● Comparisons between named entities ● Search Result suggestion on Wiki like Google and many more to come...
11. 11. Conclusion ● In this project, we were supposed to expand the set of named entities and we think we are quite successful in it ● Yah, its possible that our results may not up to the mark in some cases, but it covers the most general results as expected ● There is lot of scope in future for this project and we are planning to gets our hand dirty ● We can try to handle whole Wiki data more efficiently and God knows may be our tool will be used by billions :)
12. 12. References ● http://en.wikipedia.org/wiki/Collaborative_filtering ● https://www.cs.cmu.edu/afs/cs/Web/People/wcohen/posts cript/icdm2007.pdf ● http://aclweb.org/anthology//P/P09/P091050.pdfhttp://su2 010-projekt.googlecode.com/svn- history/r115/trunk/literatura/melville2002content.pdf ● http://www.cs.uic.edu/~lzhang3/paper/set_expansion.pdf
13. 13. Thank You☺