Your SlideShare is downloading. ×
  • Like
Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.


Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content


Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content by Bordino

Penguins in-sweaters-or-serendipitous-entity-search-on-user-generated-content by Bordino

Published in Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads


Total Views
On SlideShare
From Embeds
Number of Embeds



Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide
  • 图片的故事来自于互联网用户在搜索石油泄漏(oil spill)的时候,意外发现的搜索结果里面有一条关于企鹅需要穿毛衣这样的信息,并且对这个信息感兴趣。
    文章通过Entity Search的技术从yahoo! Answer和维基百科搜索这样的结果给用户,增加用户体验。
  • Those two datasets are user-generated content.
    Represents the content of each data source as an entity network
    The challenge including:
    Extracting entities from different datasets
    Building a meaningfull similarity measure
  • 2. This approach employs a resolution model based on a rich set of both content-sensitive and content-independent features, derived from Wikipedia and various other data sources including web behavioral data.
  • Taking the (order-insensitive) concatenation of all the documents in C where e appears
    Extract lexicon by tokenizing every document, removing stop words and applying Porter’s stemming algorithm on the obtained tokens
  • The two graphs are almost fully connected.
    The largest connected component spans 92.5% of the nodes in YA, and 95.78% in WP.
    This is due to the presence of popular entities that appear ubiquitously in the two datasets.
    These entities represent very common concepts, which are not particular to the subject of a document.
    These entities will be removed from the entities networks as they are not likely to be relevant to the input entity.
    Reduce the candicate entities space by restricting to the pairs of entities that co-occur in at least one document.
  • eta = 0.9
    alpha = 0.15, get worsening results. So random walks with no jump.
    Stop criterion:
    The F-norm of the difference between two successive iterations is < 10E-6
    Reach the maximum of 30 iterations
  • Lazy random walks algorithm with restart achieves 67% on WP and 72% on YA
    The combination of WP and YA achieves high accuracy 74% and the Mean Average Precision with 78.2%.
  • Top: For each query, retrieve the 5 entities that occur most frequently in the top 5 search results provided by two major commercial search engines
    Top Nwq: Similar to previous case, but excluding the Wikipedia page of the input entity (if present) from the set of results returned by the search engines. The performance improved and implied that entities from WP’s entity networks contributed to the serendipitous of search results.
    Rel: Return the top 5 entities in the related-query suggestions provide
    Rel + Top: Return the union of the sets of entity recommendations provided by Top and Rel
    Value in parentheses is always almost as high as the corresponding serendipity value, confirms that, the methods proposals by this paper indeed retrieving a considerable fraction of results that are both unexpected and relevant


  • 1. Penguins in Sweaters, or Serendipitous Entity Search on User-generated-Content chenwq 2014/04/16 Mounia Lalmas et al. (Yahoo! Labs, CIKM 2013 Best Paper )
  • 2. Mounia Lalmas @mounialalmas mounia-lalmas mounialalmas Principal Research Scientist at Yahoo! Labs Professor of Information Retrieval at the Department of Computer Science at Queen Mary, University of London Her research focuses on three main areas: user engagement social media and search.
  • 3. Contents 1/23 1 3 What/why serendipitous search How to build serendipitous search system Experiments setting and analysis
  • 4. Why/when do penguins wear sweaters? Entity Search Building an entity-driven serendipitous search system based on enriched entity networks extracted from Wikipedia and Yahoo! Answers Serendipity Finding something good or useful while not specifically looking for it Serendipitous search systems provide relevant and interesting results 2/23
  • 5. What is entity search How people become entitiesHow people become entities 3/23
  • 6. What is entity search Entities Extraction Proximity Measure between two entities Entities Ranking according to their proximity to a query entity 4/23
  • 7. What is Serendipity “making fortunate discoveries by accident” M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems by coverage and serendipity. IRecSys 2010. Serendipity = unexpectedness + relevance “Expected” result baselines from web search Serendipity = interestingness + relevance Result interestingness given the query Personal interest in result P. Andre, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: Serendipity and its role in web search. SIGCHI 2009. 5/23
  • 8. What is Serendipity Intuition from recsys: unexpectedness usefulness u(RSi) 6/23
  • 9. What connections between entities do web community knowledge portals offer? WHAT WHY How do they contribute to an interesting, serendipitous browsing experience? Why/when do penguins wear sweaters? 6/23
  • 10. Why/when do penguins wear sweaters? community-driven question & answer portal •67M questions & 262M answers •2 years [2010/2011] •English-language community-driven encyclopedia •3 795 865 articles •from end of December 2011 •English Wikipedia minimally curated opinions, gossip, personal info variety of points of view minimally curated opinions, gossip, personal info variety of points of view curated high-quality knowledge variety of niche topics curated high-quality knowledge variety of niche topics 7/23
  • 11. Contents 1 3 What/why serendipitous search How to build serendipitous search system Experiments setting and analysis 8/23
  • 12. Entity & Relationship Extraction Entity defined as any concept having a Wikipedia page 1. Identify surface forms[http] , 2. resolve to Wikipedia entities[Zhou] , 3. rank entities using aboutness score[Paranjpe] ; Zhou Y, Nie L, Rouhani-Kalleh O, et al. Resolving surface forms to wikipedia topics[C]//Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010: 1335-1343. D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. CIKM 2009. Relationship: Cosine similarity of tf/idf vectors (concatenation of documents where entity appears) 9/23
  • 13. Entity & Relationship Extraction Aboutness Relationship 10/23
  • 14. Entity Networks Dataset # Nodes # Edges # Isolated Yahoo! Answers 896,799 112,595,138 69,856 Wikipedia 1,754,069 237,058,218 82,381 Wikipedia Yahoo Answers 11/23
  • 15. Retrieval Algorithm: Lazy Random walk with restart[Chung] [1] Chung F R K. Spectral graph theory[M]. American Mathematical Soc., 1997. 12/23
  • 16. Rank Aggregation For a given query, combine the results from different search engines Simple median-rank aggregation[Sculley] A B C D E C D E A B C A D B E Sculley D. Rank Aggregation for Similar Items[C]//SDM. 2007. 13/23
  • 17. Contents 1 3 What/why serendipitous search How to build serendipitous search system Experiments setting and analysis 14/23
  • 18. Retrieval Wikipedia Yahoo! Answers Combined Precision @ 5 0.668 0.724 0.744 MAP 0.716 0.762 0.782 3 label per query-result pair Yahoo! Answers Jon Rubinstein Timothy Cook Kane Kramer Steve Wozniak Jerry York Wikipedia System 7 PowerPC G4 SuperDrive Power Macintosh Power Computing Corp. Steve Jobs  Annotator agreement (overlap): 85%  Average overlap in top 5 results: 12% 15/23
  • 19. What connections between entities do web community knowledge portals offer? WHAT WHY How do they contribute to an interesting, serendipitous browsing experience? Why/when do penguins wear sweaters? 16/23
  • 20. • Sentiment – using SentiStrength compute positive & negative scores – compute attitude and sentimentality – Entity-level scores • Quality – Flesch Reading Ease score Attitude (Polarity) Sentimentality (Strength) Readability  Topical Category – Yahoo Content Taxonomy Entity Networks with Implicit Metadata 17/23
  • 21. Entity Networks with Metadata Table 5: Serendipitous across different runs | relevant & unexpected | / | unexpected | number of serendipitous results out of all of the unexpected results retrieved | relevant & unexpected | / | retrieved | serendipitous out of all retrieved 18/23
  • 22. User-perceived Quality 1. Which result is more relevant to the query? 2. If someone is interested in the query, would they also be interested in these results? 3. Even if you are not interested in the query, are these results interesting to you personally? 4. Would you learn anything new about the query? 19/23
  • 23. Entity Networks with Metadata Data General +Topic Which result is more WP 0.162 0.194 relevant to the query? YA 0.336 0.374 Comb 0.201 0.222 If someone is interested in WP 0.162 0.176 the query, would they also YA 0.312 0.343 be interested in the result? Comb 0.184 0.222 Even if you are not interested WP 0.139 0.144 in the query, is the result YA 0.324 0.359 interesting to you personally? Comb 0.168 0.198 Would you learn anything WP 0.167 0.164 new about the query from YA 0.307 0.346 this result? Comb 0.184 0.203 Topical category constraint promote results of same topic as query entity Sentiment and Readability constraints hurt performance Table 6: Similarity (Kendall’s tau-b[Fagin] ) between result sets and reference ranking Fagin R, Kumar R, Mahdian M, et al. Comparing and aggregating rankings with ties[C]//Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. ACM, 2004: 47-58. 22/23