Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content


Published on

In many cases, when browsing the Web users are searching for speci c information or answers to concrete questions.
Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further.
What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted
from two sources of user-generated content - Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers,
a more unconstrained question/answering forum - in promoting serendipitous search. In this work, the content of
each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing
quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and
the desirable metadata attributes in serendipitous search.

Published in: Technology, Education

Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content

  1. 1. Penguins in Sweaters, or Serendipitous Entity Search on User-generated-Content I l a r i a B o r d i n o , Ye l e n a M e j o va , a n d M o u n i a L a l m a s ( Ya h o o L a b s ) ACM International Conference on Information and 
Knowledge Management (CIKM 2013) O c t o b e r 2 9 th, 2 0 1 3
  2. 2. Why/when do penguins wear sweaters? Serendipity Entity Search finding something good or useful while not specifically looking for it, serendipitous search systems provide relevant and interesting results we build an entity-driven serendipitous search system based on enriched entity networks extracted from Wikipedia and Yahoo! Answers 2
  3. 3. 1. What connections between entities do web community knowledge portals offer? 2. How do they contribute to an interesting, serendipitous browsing experience? WHAT WHY 3
  4. 4. Yahoo Answers vs community-driven question & answer portal  67M questions & 262M answers  2 years [2010/2011]  English-language minimally curated opinions, gossip, personal info variety of points of view Wikipedia community-driven encyclopedia • 3 795 865 articles • from end of December 2011 • English Wikipedia curated high-quality knowledge variety of niche topics 4
  5. 5. Entity & Relationship Extraction  Entity: any concept having a Wikipedia page Use an internal tool to (1) identify surface forms, (2) resolve to Wikipedia entities, (3) rank entities using aboutness score; Relationship: Cosine similarity of tf/idf vectors (concatenation of documents where entity appears) W. Zhao, J. Jiang, J. Weng, J. He, E.P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. ECIR 2011. D. Paranjpe. Learning document aboutness from implicit user feedback and document structure. CIKM 2009. 5
  6. 6. Dataset Features Dataset # Nodes # Edges # Isolated Yahoo! Answers 896,799 112,595,138 69,856 1,754,069 237,058,218 82,381 Wikipedia  Sentiment › using SentiStrength compute positive & negative scores › compute attitude and sentimentality [Kucuktunc’12] › Entity-level scores  Topical Category  Quality – Yahoo Content Taxonomy › Flesch Reading Ease score Attitude (Polarity) Sentimentality (Strength) Readability 6
  7. 7. Wikipedia Yahoo Answers 7
  8. 8. Retrieval  Algorithm: Lazy Random walk with restart Justin Bieber, Nicki Minaj, Katy Perry, Shakira, Eminem, Lady Gaga, Jose Mourinho, Selena Gomez, Kim Kardashian, Miley Cyrus, Robert Pattinson, Adele (singer), Steve Jobs, Osama bin Laden, Ron Paul, Twitter, Facebook, Netflix, IPad, IPhone, Touchpad, Kindle, Olympic Games, Cricket, FIFA, Tennis, Mount Everest, Eiffel Tower, Oxford Street, Nubcrburgring, Haiti, Chile, Libya, Egypt, Middle East, Earthquake, Oil spill, Tsunami, Subprime mortgage crisis, Bailout, Terrorism, Asperger syndrome, McDonal's, Vitamin D, Appendicitis, Cholera, Influenza, Pertussis, Vaccine, Childbirth 3 label per query-result pair Wikipedia Yahoo! Answers Combined Precision @ 5 0.668 0.724 0.744 MAP 0.716 0.762 0.782  Annotator agreement (overlap): 0.85  Average overlap in top 5 results: 12% Steve Jobs Yahoo! Answers Jon Rubinstein Timothy Cook Kane Kramer Steve Wozniak Jerry York Wikipedia System 7 PowerPC G4 SuperDrive Power Macintosh Power Computing Corp. 8
  9. 9. Serendipity “making fortunate discoveries by accident” Serendipity = unexpectedness + relevance “Expected” result baselines from web search Serendipity = interestingness + relevance Result interestingness given the query Personal interest in result M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems by coverage and serendipity. IRecSys 2010. P. Andre, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: Serendipity and its role in web search. SIGCHI 2009. 9
  10. 10. Baseline Data General High Read. Top: 5 entities that occur most frequently WP 0.63 (0.58) 0.56 (0.53) in the top 5 search results provided by YA 0.69 (0.63) 0.71 (0.65) Bing and Google Comb 0.70 (0.61) 0.68 (0.61) Top –WP: same as above, but excluding WP 0.63 (0.58) 0.56 (0.54) the Wikipedia page from the set of YA 0.70 (0.64) 0.71 (0.66) results Comb 0.71 (0.64) 0.68 (0.63) Rel: top 5 entities in the related query WP 0.64 (0.61) 0.57 (0.56) suggestions provided by Bing and Google YA 0.70 (0.65) 0.71 (0.66) Comb 0.72 (0.67) 0.69 (0.65) WP 0.61 (0.54) 0.55 (0.51) YA 0.68 (0.57) 0.69 (0.59) Comb 0.68 (0.55) 0.66 (0.56) Rel + Top: union of Top and Rel | relevant & unexpected | / | unexpected | number of serendipitous results out of all of the unexpected results retrieved | relevant & unexpected | / | retrieved | serendipitous out of all retrieved 10
  11. 11. User-perceived Quality 1. Which result is more relevant to the query? 2. If someone is interested in the query, would they also be interested in these results? 3. Even if you are not interested in the query, are these results interesting to you personally? 4. Would you learn anything new about the query? 11
  12. 12. Interestingness  Labelers provide pairwise comparisons between results; Combine into a reference ranking and Compare result ranking to optimal (Kendall’s tau-b) Agreement: Relevance (83%), Query interest (81%), Personal interest (76%), Learning something new (81%)  Interesting > Relevant Oil Spill  Sweaters for Penguins Robert Pattinson  Water for Elephants WP  Relevant > Interesting Egypt  Ptolemaic Kingdom WP WP & YA Egypt  Cairo Conference WP Netflix  Blu-ray Disc YA J. Arguello, F. Diaz, J. Callan, and B. Carterette. A methodology for evaluating aggregated search results. ECIR 2011. 12
  13. 13. Data General +Topic Which result is more WP 0.162 0.194 relevant to the query? YA 0.336 0.374 Comb 0.201 0.222 If someone is interested in WP 0.162 0.176 the query, would they also YA 0.312 0.343 be interested in the result? Comb 0.184 0.222 Even if you are not interested WP 0.139 0.144 in the query, is the result YA 0.324 0.359 interesting to you personally? Comb 0.168 0.198 Would you learn anything WP 0.167 0.164 new about the query from YA 0.307 0.346 this result? Comb 0.184 0.203 Similarity (Kendall’s tau-b) between result sets and reference ranking Topical category constraint promote results of same topic as query entity Sentiment and Readability constraints hurt performance 13
  14. 14. What did we learn? 1. What connections between entities do web community knowledge portals offer? 2. How do they contribute to an interesting, serendipitous browsing experience? ≠ ANSWERS > ANSWERS 14
  15. 15. 15 Yahoo Confidential & Proprietary