The Power Social Media
VP, Yahoo! Research
Barcelona, Spain & Santiago, Chile
Today is the
The Internet and the Web today
Web 2.0 and Social Media
Example: Social Search
The Wisdom of the Crowds
Internet and the Web
Internet and the Web Today
Between 1 and 2.5 billion people connected
– 5 billion estimated for 2015
1.8 billion mobile phones today
– 500 million expected to have mobile broadband in 2010
Internet traffic has increased 20 times in last 5 years
Today there are more than 185 million Web servers
– 50% Apache, 34% Windows
The Web is in practice unbounded
– Dynamic pages are unbounded
– Static pages are over 12 billion?
• Web 2.0, social networks
– Fragmentation of content ownership
– Fragmentation of the access (age, topic, etc.)
– Fragmentation of the right to access
• Increase of the Semantic Web
– RDF, microformats, metadata in general
• Increase of Internet advertising associated to search/content
Advertising and the Web 2.0
The power of the mouth to mouth
The power of the influential bloggers
– Positive (Dove)
– Negative (HSBC)
Presence in virtual(?) worlds (Second Life)
Yahoo! Scale (2007)
24 languages, 20 countries
> 4 billion page views per day (largest in the world)
> 500 million unique users each month (half the Internet users!)
> 250 million mail users (1 million new accounts a day)
95 million groups members
7 million moderators
4 billion music videos streamed in 2005
20 Pb of storage (20M Gb)
– US Library of congress every day (28M books, 20TB)
15 Tb of data processed per day
7 billion song ratings
2 billion photos stored
2 billion Mail+Messenger sent per day
The Web: A Play in Three Acts
Public “ Th e ”
Personal “ My ”
Social “ O u r”
Web 2.0: Ingredients
Phot os Tags
Bookm arks Audio
Some Social Networks
– Directed collaborative topical discussions
– Buddy list
– Topically focused communities
MySpace, Facebook, Friendster, Orkut
– Friendship network
– Collaborative bookmarking
Flickr, You Tube
– Photo/video sharing and tagging
– People answering people
Web 2.0 in Yahoo!
Sit ios sociales t uvieron 115M visit ant es únicos, 56M “ m enores de 35” .
• Yahoo! Groups 8 million, 1 of each 10 members
• Del.icio.us 2 million users
• Flickr 1 million pictures per day
• Yahoo! Respuestas 100M users, 150M answers
• Messenger 85M unique users
(dat os del 2007)
Why do people come online?
To be informed
To be entertained
Increasingly… to be part of new forms of participation,
belonging and sharing
To be part of social media
– also referred as Social Networks
S o c ia l
N e t w o rk s
Ma in ly y o u n g
p e o p le ( 1 3 -2 5 )
Mo b ile u s e
Who are they?
Ag e % Re p re s e n t a t iv e in t e re s t s
What makes Flickr special?
1. User Generated Content
Content not licensed from providers such as Corbis or Getty, but rather
contributed by users.
2. User Organized Content
Content is tagged, described, organized, discovered, etc. not by “editors” but
by the users themselves.
3. User Distributed Content
Flickr achieved distribution across the internet, not through “business deals”
per se, but rather through the Flickr community which distributed Flickr
content on 3rd-party blogs.
4. User Developed Functionality
Flickr exposed APIs (PHP, Perl, etc.) that allowed the community of
developers to build against the Flickr platform.
Entire ecosystem created by less than ten employees…
aided by millions in the Flickr community.
Visualizing Tags: Tag Cloud from Flickr
A Digression: Computer Vision is hard
In t e rn e t UGC ( Us e r Ge n e ra t e d Co n t e n t )
Ha v e y o u e x p e rie n c e d UGC? Ty pp ess oof f Co nn t enn t
Ty e Co t e t
No Ye s Mu lt ip le Ch o ic e
Pho to s ,
Im a g e s
P u b lis h e r Te x t
Vid e o s
Co n s u m e r Mu s ic
An im a t io n , Fla s h
Ot h e rs
Source: National Internet Development Agency Report in June, 2006 (South Korea)
Simple acts create value and opportunity
Usin g a syst e m of
u se r -a ssig n e d
r a t in g s, LAUN CH ca st
b u ild s u p a p r of ile of
p r e f e r e n ce s f or e a ch
in d ivid u a l. .
Use r s ca n t h e n Th e m or e r a t in g s
sh a r e t h e ir u se r s m a k e , t h e
cu st om r a d io m or e
st a t ion w it h in t e llig e n t t h e
f r ie n d s t h r ou g h r a d io b e com e s.
Ya h oo!
M e sse n g e r W e h a ve ove r 6
t a k in g a ll t h e b illion r a t in g s
h a ssle ou t of
d iscove r in g LAUN CH ca st =
n e w m u sic m u sic t h a t list e n s
t o you
Next generation products will blur distinctions between
Creators, Synthesizers, and Consumers
Every act of consumption is an implicit act of production
that requires no incremental effort…
Listening itself implicitly creates a radio station…
Millions of users of Flickr share and tag each
others’ photographs (why???)
Fernando Flores: Blogs
– Look into the future
Individual or collaborative
– Community newspaper: www.elmorrocotudo.cl
Power law distribution
The Knowledge Challenge
Enabling users to share knowledge with their community to create a
better search experience
Exam ple Number of Results
Query: Vacat ion Chile
Vacation Chile 26,800,000
Query: “ Everyt hing Ricardo knows about Chile”
“Everything Ricardo knows about Chile” 0
The kinds of queries that rely on domain expertise…
“Do you know a reputable plumber in Southampton?”
“Where is the cool nightlife in Trento?”
“What political blogs do you think I’d enjoy reading?”
“Where can I buy a cool pair of shoes?”
These kinds of queries are ill-served by today’s search
engines, but are ironically the most valuable (i.e.
How do we capture the people’s experience?
Social Powered Search: Yahoo! Answers
Democratize process of “voting”
(whether explicit or implicit)
Move out of the purview of webmasters and hand
control back to users
Allow dynamic assignment to various authorities of
trust, new degree of freedom
“Better Search Through People”
Challenges in Social Search
How do we use UGC for better search?
What’s the ratings and reputation system?
How do you cope with (social) spam?
What are the incentive mechanisms
The bigger challenge: Where else can you
leverage the power of the people?
European search vision
Knowledge - the next challenge
Making knowledge pay
Poorly formed questions
P. Jurczyk, E. Agichtein: “Discovering authorities in Q.A. communities by using link
What are the Problems?
Which questions are legitimate?
What is the incentive system?
How do we validate answers?
What is the role of the community?
What is the reputation system?
What are the challenges?
Community of users
– Social system
Incentives and reputations
– Economic system
Poorly phrased, “gramatically” limited queries
– Language analysis
Improving user experience from past data
– Data mining
What are the sciences?
Information retrieval & language processing
Data Mining Six Degrees of Separation
Sociology and human-computer interaction
The Wisdom of the Crowds
The Rationale behind Web Mining
The Wisdom of Crowds
- James Surowiecki - 2004
– “Under the right circumstances, groups are remarkably
• Importance of diversity, independence and decentralization
– “large groups of people are smarter than an elite
few, no matter how brilliant—they are better at
solving problems, fostering innovation, coming to
wise decisions, even predicting the future”.
• How to deploy this in the next generation of social search and
– SEMEDIA video retrieval EU Project
(with BBC, Glasgow U., Smoke & Mirrors, Joaneeum & UPF)
The wisdom of the crowds can be used to search
The principle is not new – anchor text is used in
“standard” search: when indexing a document D, include
anchor text from links pointing to D
Arm o n k, NY-b a s e d c o m p u t e r
g ia n t IBM a n n o u n c e d t o d a y
www.ib m .c o m
Big Blu e t o d a y a n n o u n c e d
Jo e ’s c o m p u t e r h a rd wa re lin ks
re c o rd p ro fit s fo r t h e q u a rt e r
Co m p a q HP IBM
Quality and Frequency
Chris Anderson: “The Long Tail”. Hyperion, 2006.
Quality and Quantity
Chris Anderson: “The Long Tail”. Hyperion, 2006.
Chris Martin from Coldplay in The Rolling Stone, Fortieth Aniversary, July 2007.
“ W e t h in k it 's a ll
a b o u t q u a lit y o v e r
q u a n t it y n o w ,
b e c a u s e t h e re 's s o
m u c h n o is e
e v e ry w h e re , t h e re 's
n o p o in t in p u t t in g
a n y t h in g o u t u n le s s
it 's f u c k in g
a m a z in g . ”
The Push for Quality
¼ questions want an
opinion: informal polls
¾ questions seek for
information or advice
Q. Su, D. Pavlov, J.-H. Chow, W. C. Baker. “Internet-scale collection of
at least one
There are top contributors ...
... but they don't have all the answers
What about real quality?
Question quality and answer quality are not independent
and can be predicted reasonable well (Castillo et al, 2008)
Influence Leadership (Bopal et al, 2008)
Influence of social graph in particular actions
– Social graph: Yahoo! Instant Messenger
– Actions log: Yahoo! Movies
• Action = user u rated movie m at time t
– joined through common users identifiers
Started from Yahoo! Instant Messenger subgraph of
“most active” users (110M nodes) and 21M ratings from
– Ended with 217.5K nodes, 221.4K edges and 1.8M ratings.
Leaders vs. Tribe leaders
The Wisdom of Crowds
Crucial for Search Ranking
Text content: Web Writers
– not only for the Web!
Links: Web Publishers
Annotations: Web 2.0 Users
– Tags, bookmarks, comments, ratings, etc.
Queries: All Web Users!
– Queries and actions
Query Intention (Broder, 2000)
Mining Queries for ...
Improved Web Search
User Driven Design
– Information Scent
– The Web Site that the Users Want
– The Web Site that You should Have
– Improve content & structure
85 Bootstrap of pseudo-semantic resources
Query Mining: Relating Similar Queries
Implicit Knowledge (Baeza-Yates et al, 2007)
Some Open Issues
• Implicit social network
– Any fundamental similarities?
• How to evaluate with partial knowledge?
– Data volume amplifies the problem
• User aggregation vs. personalization
– Optimize common tasks: help more people
– Move away from privacy issues
The Web is scientifically young
It is intellectually diverse
– The human element
– The social element
The technology mirrors the economic,
legal and sociological reality
Mirror of the Society
Exports/Imports vs. Domain Links
Web Spam Challenge:
Baeza-Yates & Castillo, WWW2006
• UK Web Collection
• Training set with thousands of
What’s next? Fourth generation:
From Information Retrieval
to Information Supply
Explicit Act ive
dem and for inform at ion
Increase use supply
inform at ion
of cont ext driven by
driven by a
user query user
act ivit y and
We are at Web 2.0 beta
People wants to get tasks done
– Where I do go for a original holiday with 1,000
Take in account the context of the task
I want to book a vacation in Tuscany.