In many cases, when browsing the Web users are searching for specic information or answers to concrete questions.
Sometimes, though, users find unexpected, yet interesting and useful results, and are encouraged to explore further.
What makes a result serendipitous? We propose to answer this question by exploring the potential of entities extracted
from two sources of user-generated content - Wikipedia, a user-curated online encyclopedia, and Yahoo! Answers,
a more unconstrained question/answering forum - in promoting serendipitous search. In this work, the content of
each data source is represented as an entity network, which is further enriched with metadata about sentiment, writing
quality, and topical category. We devise an algorithm based on lazy random walk with restart to retrieve entity recommendations from the networks. We show that our method provides novel results from both datasets, compared to standard web search engines. However, unlike previous research, we find that choosing highly emotional entities does not increase user interest for many categories of entities, suggesting a more complex relationship between topic matter and
the desirable metadata attributes in serendipitous search.
Penguins in Sweaters, or Serendipitous Entity Search on User-generated Content
1. Penguins in Sweaters,
or Serendipitous Entity Search
on User-generated-Content
I l a r i a B o r d i n o , Ye l e n a M e j o va , a n d M o u n i a L a l m a s
( Ya h o o L a b s )
ACM International Conference on Information and Knowledge
Management (CIKM 2013)
O c t o b e r 2 9 th, 2 0 1 3
2. Why/when do penguins wear sweaters?
Serendipity
Entity
Search
finding something good or useful while not
specifically looking for it, serendipitous search
systems provide relevant and interesting results
we build an entity-driven serendipitous search system
based on enriched entity networks extracted from
Wikipedia and Yahoo! Answers
2
3. 1. What connections between
entities do web community
knowledge portals offer?
2. How do they contribute to an
interesting, serendipitous
browsing experience?
WHAT
WHY
3
4. Yahoo Answers
vs
community-driven question &
answer portal
67M questions & 262M
answers
2 years [2010/2011]
English-language
minimally curated
opinions, gossip, personal info
variety of points of view
Wikipedia
community-driven
encyclopedia
• 3 795 865 articles
• from end of
December 2011
• English Wikipedia
curated
high-quality knowledge
variety of niche topics
4
5. Entity & Relationship Extraction
Entity: any concept having a Wikipedia page
Use an internal tool to
(1) identify surface forms,
(2) resolve to Wikipedia entities,
(3) rank entities using aboutness score;
Relationship: Cosine similarity of tf/idf vectors
(concatenation of documents where entity appears)
W. Zhao, J. Jiang, J. Weng, J. He, E.P. Lim, H. Yan, and X. Li. Comparing twitter and traditional
media using topic models. ECIR 2011.
D. Paranjpe. Learning document aboutness from implicit user feedback and document structure.
CIKM 2009.
5
8. Retrieval
Algorithm: Lazy Random walk with restart
Justin Bieber, Nicki Minaj, Katy Perry, Shakira, Eminem, Lady Gaga,
Jose Mourinho, Selena Gomez, Kim Kardashian, Miley Cyrus, Robert
Pattinson, Adele (singer), Steve Jobs, Osama bin Laden, Ron Paul,
Twitter, Facebook, Netflix, IPad, IPhone, Touchpad, Kindle, Olympic
Games, Cricket, FIFA, Tennis, Mount Everest, Eiffel Tower, Oxford
Street, Nubcrburgring, Haiti, Chile, Libya, Egypt, Middle East,
Earthquake, Oil spill, Tsunami, Subprime mortgage crisis, Bailout,
Terrorism, Asperger syndrome, McDonal's, Vitamin D, Appendicitis,
Cholera, Influenza, Pertussis, Vaccine, Childbirth
3 label per query-result pair
Wikipedia
Yahoo! Answers
Combined
Precision @ 5
0.668
0.724
0.744
MAP
0.716
0.762
0.782
Annotator agreement
(overlap): 0.85
Average overlap in top 5
results: 12%
Steve Jobs
Yahoo! Answers
Jon Rubinstein
Timothy Cook
Kane Kramer
Steve Wozniak
Jerry York
Wikipedia
System 7
PowerPC G4
SuperDrive
Power Macintosh
Power Computing Corp.
8
9. Serendipity
“making fortunate discoveries by accident”
Serendipity = unexpectedness + relevance
“Expected” result baselines from web search
Serendipity = interestingness + relevance
Result interestingness given the query
Personal interest in result
M. Ge, C. Delgado-Battenfeld, and D. Jannach. Beyond accuracy: evaluating recommender systems
by coverage and serendipity. IRecSys 2010.
P. Andre, J. Teevan, and S. T. Dumais. From x-rays to silly putty via uranus: Serendipity and its role in
web search. SIGCHI 2009.
9
10. Baseline
Data
General
High Read.
Top: 5 entities that occur most
frequently
WP
0.63 (0.58)
0.56 (0.53)
in the top 5 search results provided by
YA
0.69 (0.63)
0.71 (0.65)
Bing and Google
Comb
0.70 (0.61)
0.68 (0.61)
Top –WP: same as above, but excluding WP
0.63 (0.58)
0.56 (0.54)
the Wikipedia page from the set of
YA
0.70 (0.64)
0.71 (0.66)
results
Comb
0.71 (0.64)
0.68 (0.63)
Rel: top 5 entities in the related query
WP
0.64 (0.61)
0.57 (0.56)
suggestions provided by Bing and
Google
YA
0.70 (0.65)
0.71 (0.66)
Comb
0.72 (0.67)
0.69 (0.65)
WP
0.61 (0.54)
0.55 (0.51)
YA
0.68 (0.57)
0.69 (0.59)
Comb
0.68 (0.55)
0.66 (0.56)
Rel + Top: union of Top and Rel
| relevant & unexpected | / | unexpected |
number of serendipitous results out of all
of the unexpected results retrieved
| relevant & unexpected | / | retrieved |
serendipitous out of all retrieved
10
11. User-perceived Quality
1. Which result is more relevant to the query?
2. If someone is interested in the query, would they
also be interested in these results?
3. Even if you are not interested in the query, are
these results interesting to you personally?
4. Would you learn anything new about the query?
11
12. Interestingness
Labelers provide pairwise comparisons between results;
Combine into a reference ranking and Compare result ranking
to optimal (Kendall’s tau-b)
Agreement:
Relevance (83%), Query interest (81%),
Personal interest (76%), Learning something new (81%)
Interesting > Relevant
Oil Spill
Sweaters for Penguins
Robert Pattinson
Water for Elephants
WP
Relevant > Interesting
Egypt
Ptolemaic Kingdom
WP
WP & YA
Egypt Cairo Conference WP
Netflix Blu-ray Disc YA
J. Arguello, F. Diaz, J. Callan, and B. Carterette. A methodology for evaluating
aggregated search results. ECIR 2011.
12
13. Data
General +Topic
Which result is more
WP
0.162
0.194
relevant to the query?
YA
0.336
0.374
Comb
0.201
0.222
If someone is interested in
WP
0.162
0.176
the query, would they also
YA
0.312
0.343
be interested in the result?
Comb
0.184
0.222
Even if you are not
interested
WP
0.139
0.144
in the query, is the result
YA
0.324
0.359
interesting to you
personally?
Comb
0.168
0.198
Would you learn anything
WP
0.167
0.164
new about the query from
YA
0.307
0.346
this result?
Comb
0.184
0.203
Similarity (Kendall’s tau-b) between result sets
and reference ranking
Topical
category
constraint
promote results
of same topic
as query entity
Sentiment and
Readability
constraints
hurt performance
13
14. What did we learn?
1. What connections between entities
do web community knowledge
portals offer?
2. How do they contribute to
an interesting, serendipitous
browsing experience?
≠
ANSWERS
>
ANSWERS
14