Entity linking in Advertisements
Team: Mentor:
Rounak Patni Pulkit Goel
Kumar Rishabh
Rohit Jain
Siva kumar
Goals
•Identify important entities within the
advertisements.
•Link them to corresponding wikipedia pages.
•Identify relevant concepts in order to
disambigute entity.
Benefits of Wikipedia
•Ever-expanding number of Pages in Corpus
Wikipedia
•A rigorous structure but with low coverage
which emulates real world data very well.
•Many number of entities including proper
names unlikely to be found in any other
collection.
•Redirect pages or disambiguation pages.
Process Overview
•Parser Module - This module parses the the
given webpage page and produces two
documents namely the Advertisments itself and
the Document which will later be used to in the
final steps to disambigute results of the search
module.
•Tokenizer Module - Converts the
advertisments into a list of tokens.
•POS Tagger Module- It is used for marking up a
word in an Ad particular part of speech
Process Overview
•Parsing Module – Returns advertisements in
tree format.
•Noun Phrase Extraction Module - Extract NP
from the tree generated in the previous process.
•Noun Phrase Ranking – Ranks NP using a
heuristic function.
Process Overview
•Entity/Keyword Extraction Module:- Probable
entity and keywords are extracted from the
highest ranked NP.
•Search Module – Returns a list of relevant
documents. The seach module is basically a
inverted index of the wiki dump. We extract only
the titles and summary of the page.
•Filtering of results – Finds out most likely/close
wiki page.
Entity Detection
•Basic Technique for entity detection is chunk
detection via shallow parsing.
•This technique reduces the key-words to be
searched in the corpus, improving performance
and accuracy.
Evaluation and Results
•Advertisement: An Apple a day keeps the
doctor away Wiki Page: Apple(fruit)
•Advertisement: Apple innovates relentlessly to
make great products , buy an apple Wiki Page:
Apple Corporation
•Advertisement: Royal Stag , its your life make it
large Wiki Page: royal stag
Conclusions
• It is possible to use NLP techniques to narrow
down list of words to be searched in the search
engine.
•Context can be extracted from the
advertisement itslef using NLP techniques.
•The search module gives satifactory results on a
simple inverted index created using page titles
and summary.
References
•M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni, “Locality-sensitive
hashing scheme based on p-stable distributions,―Symposium on
Computational Geometry pp. 253–262, 2004.
•A.Z. Broder, “On the resemblance and containment of documents,―
Proc. Compression and Complexity of Sequences, pp. 21–29, Positano Italy,
1997
•A. Andoni and P. Indyk, “Near-optimal hashing algorithms for
approximate nearest neighbor in high dimensions,―Comm. ACM
51:1, pp. 117– 122, 2008.
Thank You !!

Entity linking in advertisements

  • 1.
    Entity linking inAdvertisements Team: Mentor: Rounak Patni Pulkit Goel Kumar Rishabh Rohit Jain Siva kumar
  • 2.
    Goals •Identify important entitieswithin the advertisements. •Link them to corresponding wikipedia pages. •Identify relevant concepts in order to disambigute entity.
  • 3.
    Benefits of Wikipedia •Ever-expandingnumber of Pages in Corpus Wikipedia •A rigorous structure but with low coverage which emulates real world data very well. •Many number of entities including proper names unlikely to be found in any other collection. •Redirect pages or disambiguation pages.
  • 4.
    Process Overview •Parser Module- This module parses the the given webpage page and produces two documents namely the Advertisments itself and the Document which will later be used to in the final steps to disambigute results of the search module. •Tokenizer Module - Converts the advertisments into a list of tokens. •POS Tagger Module- It is used for marking up a word in an Ad particular part of speech
  • 5.
    Process Overview •Parsing Module– Returns advertisements in tree format. •Noun Phrase Extraction Module - Extract NP from the tree generated in the previous process. •Noun Phrase Ranking – Ranks NP using a heuristic function.
  • 6.
    Process Overview •Entity/Keyword ExtractionModule:- Probable entity and keywords are extracted from the highest ranked NP. •Search Module – Returns a list of relevant documents. The seach module is basically a inverted index of the wiki dump. We extract only the titles and summary of the page. •Filtering of results – Finds out most likely/close wiki page.
  • 8.
    Entity Detection •Basic Techniquefor entity detection is chunk detection via shallow parsing. •This technique reduces the key-words to be searched in the corpus, improving performance and accuracy.
  • 9.
    Evaluation and Results •Advertisement:An Apple a day keeps the doctor away Wiki Page: Apple(fruit) •Advertisement: Apple innovates relentlessly to make great products , buy an apple Wiki Page: Apple Corporation •Advertisement: Royal Stag , its your life make it large Wiki Page: royal stag
  • 10.
    Conclusions • It ispossible to use NLP techniques to narrow down list of words to be searched in the search engine. •Context can be extracted from the advertisement itslef using NLP techniques. •The search module gives satifactory results on a simple inverted index created using page titles and summary.
  • 11.
    References •M. Datar, N.Immorlica, P. Indyk, and V.S. Mirrokni, “Locality-sensitive hashing scheme based on p-stable distributions,―Symposium on Computational Geometry pp. 253–262, 2004. •A.Z. Broder, “On the resemblance and containment of documents,― Proc. Compression and Complexity of Sequences, pp. 21–29, Positano Italy, 1997 •A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions,―Comm. ACM 51:1, pp. 117– 122, 2008.
  • 12.