6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data
• Aski...
6/28/13
2
- 4 -
4
Big Data
§  Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable ti...
6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Va...
6/28/13
4
- 8 -
8
Too Much Data Available
§  The Web is a database!
§  Data does not imply information
§  Many analyses...
6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + l...
6/28/13
6
- 13 -
Quantity
Quality
User-
generated
Traditional
publishing
What is in the Web? How Good it is?
- 14 -
14
Wha...
6/28/13
7
- 15 -
15
Noise and Spam
§  Noise may come from many places:
§  Instruments that measure
§  How we interpret ...
6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
6/28/13
9
- 19 -
Web Data Trends
•  User Generated Content
– Massive (quality vs. quantity)
– Social Networks
– Real time ...
6/28/13
10
- 21 -
21
Web Data Mining
•  Content: text & multimedia mining
•  Structure: link analysis, graph mining
•  Usa...
6/28/13
11
- 23 -
Popularity
- 24 -
Flickr: Geo-tagged pictures
24
24
6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of
successful products and communitie...
6/28/13
13
- 30 -
30
Scalability
§  How to scale?
§  Doubling the data in the best case will double the time
§  Time co...
6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 c...
6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public...
6/28/13
16
- 36 -
36
Sparsity
§  The Long Tail is always Sparse
§  Why there is a long tail?
§  When the crowd dominate...
6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user ...
6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, ...
6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
“shwarzneger” example
45
- 46 -
Empoweri...
6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds?
Aggregate data in the “right way”
When data is sp...
6/28/13
21
- 49 -
•  The long tail is important not only for e-
commerce, but because we are all there
•  Personalization ...
6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006
6/28/13
23
Contact: rbaeza@acm.org
Thanks to many people at Yahoo! Labs
ASIST 2012
Book of the
Year Award
Questions?
Upcoming SlideShare
Loading in …5
×

Big data in the web

384
-1

Published on

Keynote talk presented at 25th International Conference on Advanced Information Systems Engineering

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
384
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Big data in the web

  1. 1. 6/28/13 1 Big Data in The Web Ricardo Baeza-Yates Yahoo! Labs Barcelona & Santiago de Chile - 3 - Agenda • Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail • Issues and Examples • Concluding Remarks
  2. 2. 6/28/13 2 - 4 - 4 Big Data §  Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time §  Large volume and growth §  Petabytes to exabytes §  Growth is estimated in 3 exabytes per day §  Structured vs. non-structured data §  Diversity §  Types, formats, complexity, topics, etc. §  Best Public Data Example: The Web §  Content: text, multimedia §  Structure: graphs §  Usage: real time streams - 5 - 5 Big Data §  Focus on analytics §  Many storage technologies: §  DBs, DWs, distributed file systems, … §  Many processing technologies: §  Cloud computing, map-reduce (Hadoop), … §  Data mining, clustering, classification, … §  Machine learning, A/B testing, NLP, … §  Simulation §  Several technology providers §  Initial best practices (see TDWI report, 2011) §  Main challenges: scalability, online
  3. 3. 6/28/13 3 - 6 - 6 Big Data: The Five V’s Characteristic Data Issue Computing Issue Volume Scale, Redundancy Scalability Variety Heterogeneity, Complexity Adaptability, Extensibility Veracity Completeness, Bias, Sparsity, Noise, Spam Reliability, Trust Velocity Real time Online Value Usefulness, Privacy Business dependent - 7 - 7 Asking the Right Questions §  Problem Driven §  What data we need? How much? §  How we collect it? How we store and transfer it? §  Understanding the Data §  How sparse is the data? How much noise? §  There is redundancy? There are biases? §  There is spam? Any outliers? §  Analyzing the Data §  Any privacy issues? Do we need to anonymize? §  How well our algorithms scale? §  Can we visualize the results?
  4. 4. 6/28/13 4 - 8 - 8 Too Much Data Available §  The Web is a database! §  Data does not imply information §  Many analyses for the sake of it (data driven) §  Analyzing data is not CS per se §  Publish in the right forum! §  Big Data or Right Data? - 9 - 9 The Different Facets of the Web
  5. 5. 6/28/13 5 - 11 - 11 The Structure of the Web - 12 - Big Data in the Web Metadata RDF Wikipedia ODP Flickr Text Anchors + links Y! Answers Logs (Clicks+Queries) Explicit Implicit Wordnet UGC Private Scale Blogs, Groups Quality?
  6. 6. 6/28/13 6 - 13 - Quantity Quality User- generated Traditional publishing What is in the Web? How Good it is? - 14 - 14 What else is in the Web?
  7. 7. 6/28/13 7 - 15 - 15 Noise and Spam §  Noise may come from many places: §  Instruments that measure §  How we interpret the data (example later) §  Spam is everywhere - 16 - 16 Web Spam Deceiving text, links, clicks… due to an economic incentive Depending on the goal and the data, spam is easier to generate Depending on the type & target data, spam is easier to fight Disincentives for spammers? •  Social •  Economical Web Spam is NOT Mail Spam
  8. 8. 6/28/13 8 - 17 - 17 - 18 - Content and Metadata Trends [Ramakrishnan and Tomkins 2007]
  9. 9. 6/28/13 9 - 19 - Web Data Trends •  User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors) •  Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access •  Viability – Business model based in advertising - 20 - The Wisdom of Crowds •  James Surowiecki, a New Yorker columnist, published this book in 2004 – “Under the right circumstances, groups are remarkably intelligent” •  Importance of diversity, independence and decentralization “large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”. Aggregating data
  10. 10. 6/28/13 10 - 21 - 21 Web Data Mining •  Content: text & multimedia mining •  Structure: link analysis, graph mining •  Usage: log analysis, query mining •  Relate all of the above – Web characterization – Particular applications - 22 - Flickr: Clustering Pictures 22
  11. 11. 6/28/13 11 - 23 - Popularity - 24 - Flickr: Geo-tagged pictures 24 24
  12. 12. 6/28/13 12 - 27 - “Crowd Sourcing” Web-based “peer production” has produced a number of successful products and communities •  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ... Can this form of production be harnessed for other ends? •  Existing successes are hard to replicate at will Amazon Mechanical Turk (AMT) •  Like outsourcing, but in a micro-distributed fashion •  Thousands of “turkers” working on hundreds of “HITS” (tasks) •  Rates are typically few cents per task •  Quality of their work is positively evaluated (e.g. in IR) - 28 - The Wisdom of (Large) Crowds – Crucial for Search Ranking – Text: Web Writers & Editors • not only for the Web! – Links: Web Publishers – Tags: Web Taggers – Queries: All Web Users! • Queries and actions (or no action!)‫‏‬ The crowd implicitly knows the experts!
  13. 13. 6/28/13 13 - 30 - 30 Scalability §  How to scale? §  Doubling the data in the best case will double the time §  Time complexity vs. result quality trade-off §  Example: entity detection in linear time at almost state of the art quality §  That implies that there exists a text size n* for which the linear algorithm will produce more correct entities §  Distributed parallel processing §  Map-reduce not always works §  Parallelism is problem dependent §  Online processing needs a different approach - 31 - 31 Redundancy and Bias §  There is any dependency in the data? §  There is any duplication? §  Lexical duplication in the Web is around 25% §  Semantic duplication is larger §  Are there any biases? §  Example 1: clicks in search engines §  Bias to the ranking and the interface §  There is a ranking bias in the Web content §  Example 2: tag recommendation
  14. 14. 6/28/13 14 - 32 - We can suggest tags: nice but .... - 33 - Privacy Example: AOL Query Logs Release Incident No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”. Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006 33
  15. 15. 6/28/13 15 - 34 - Risks of Privacy (ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001) K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010. Data Protection Directive in EU 34 - 35 - Risks of Privacy: Query Logs Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] •  Gender: 84% •  Age (±10): 79% •  Location (ZIP3): 35% Vanity Queries: [Jones et al, CIKM 2008] •  Partial name: 8.9% •  Complete: 1.2% More information: •  A Survey of query log privacy-enhancing techniques from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
  16. 16. 6/28/13 16 - 36 - 36 Sparsity §  The Long Tail is always Sparse §  Why there is a long tail? §  When the crowd dominates §  Empowering the tail §  Example: Relations from Query Logs - 38 - The Wisdom of Crowds – Popularity – Diversity – Quality – Coverage Long tail Heavy tail
  17. 17. 6/28/13 17 - 39 - The Long Tail Most measures in the Web follow a power law - 42 - People Interests 42 Heavy tail of user interests Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, … Normal people Weirdos One explanation
  18. 18. 6/28/13 18 - 43 - Many queries, each asked very few times, make up a large fraction of all queries Applies to word usage, web page access, … We are all partially eclectic People Interests Broder, Gabrilovich, Goel, Pang; WSDM 2009 The reality Heavy tail of user interests - 44 - Example: Click Distribution User interaction is a power law! (Zipf’s principle of minimal effort)
  19. 19. 6/28/13 19 - 45 - When the crowd dominates Kills the long tail See (obsolete now) “shwarzneger” example 45 - 46 - Empowering the Tail The Filter “Bubble”, Eli Pariser •  Avoid the Poor get Poorer Syndrome Solutions: •  Diversity •  Novelty •  Serendipity 46 Explore & Exploit
  20. 20. 6/28/13 20 - 47 - How to Circumvent Sparsity? Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse Aggregate users around same intent, task, facet, …. Change granularity “ad hoc” •  Middle age men •  Fans of Messi 47 - 48 - 48 Example: Mining Geo/time Data •  Optimal Touristic Paths from Flickr •  Good for tourists and locals De Choudhury et al, HT 2010
  21. 21. 6/28/13 21 - 49 - •  The long tail is important not only for e- commerce, but because we are all there •  Personalization vs. Contextualization User interaction is another long tail People Interests Aggregating in the Long Tail - 69 - 69 Epilogue l The Web is scientifically young l The Web is intellectually diverse l The technology mirrors the economic, legal and sociological reality l  Data must be interesting! (Gerhard Weikum) l  Problem driven l  Plenty of challenges
  22. 22. 6/28/13 22 - 70 - 70 Mirror of Society - 71 - 71 Exports/Imports vs. Domain Links Baeza-Yates & Castillo, WWW2006
  23. 23. 6/28/13 23 Contact: rbaeza@acm.org Thanks to many people at Yahoo! Labs ASIST 2012 Book of the Year Award Questions?

×