Your SlideShare is downloading. ×
Big data in the web
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Big data in the web

355

Published on

Keynote talk presented at 25th International Conference on Advanced Information Systems Engineering

Keynote talk presented at 25th International Conference on Advanced Information Systems Engineering

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
355
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. 6/28/13 1 Big Data in The Web Ricardo Baeza-Yates Yahoo! Labs Barcelona & Santiago de Chile - 3 - Agenda • Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail • Issues and Examples • Concluding Remarks
  • 2. 6/28/13 2 - 4 - 4 Big Data §  Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time §  Large volume and growth §  Petabytes to exabytes §  Growth is estimated in 3 exabytes per day §  Structured vs. non-structured data §  Diversity §  Types, formats, complexity, topics, etc. §  Best Public Data Example: The Web §  Content: text, multimedia §  Structure: graphs §  Usage: real time streams - 5 - 5 Big Data §  Focus on analytics §  Many storage technologies: §  DBs, DWs, distributed file systems, … §  Many processing technologies: §  Cloud computing, map-reduce (Hadoop), … §  Data mining, clustering, classification, … §  Machine learning, A/B testing, NLP, … §  Simulation §  Several technology providers §  Initial best practices (see TDWI report, 2011) §  Main challenges: scalability, online
  • 3. 6/28/13 3 - 6 - 6 Big Data: The Five V’s Characteristic Data Issue Computing Issue Volume Scale, Redundancy Scalability Variety Heterogeneity, Complexity Adaptability, Extensibility Veracity Completeness, Bias, Sparsity, Noise, Spam Reliability, Trust Velocity Real time Online Value Usefulness, Privacy Business dependent - 7 - 7 Asking the Right Questions §  Problem Driven §  What data we need? How much? §  How we collect it? How we store and transfer it? §  Understanding the Data §  How sparse is the data? How much noise? §  There is redundancy? There are biases? §  There is spam? Any outliers? §  Analyzing the Data §  Any privacy issues? Do we need to anonymize? §  How well our algorithms scale? §  Can we visualize the results?
  • 4. 6/28/13 4 - 8 - 8 Too Much Data Available §  The Web is a database! §  Data does not imply information §  Many analyses for the sake of it (data driven) §  Analyzing data is not CS per se §  Publish in the right forum! §  Big Data or Right Data? - 9 - 9 The Different Facets of the Web
  • 5. 6/28/13 5 - 11 - 11 The Structure of the Web - 12 - Big Data in the Web Metadata RDF Wikipedia ODP Flickr Text Anchors + links Y! Answers Logs (Clicks+Queries) Explicit Implicit Wordnet UGC Private Scale Blogs, Groups Quality?
  • 6. 6/28/13 6 - 13 - Quantity Quality User- generated Traditional publishing What is in the Web? How Good it is? - 14 - 14 What else is in the Web?
  • 7. 6/28/13 7 - 15 - 15 Noise and Spam §  Noise may come from many places: §  Instruments that measure §  How we interpret the data (example later) §  Spam is everywhere - 16 - 16 Web Spam Deceiving text, links, clicks… due to an economic incentive Depending on the goal and the data, spam is easier to generate Depending on the type & target data, spam is easier to fight Disincentives for spammers? •  Social •  Economical Web Spam is NOT Mail Spam
  • 8. 6/28/13 8 - 17 - 17 - 18 - Content and Metadata Trends [Ramakrishnan and Tomkins 2007]
  • 9. 6/28/13 9 - 19 - Web Data Trends •  User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors) •  Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access •  Viability – Business model based in advertising - 20 - The Wisdom of Crowds •  James Surowiecki, a New Yorker columnist, published this book in 2004 – “Under the right circumstances, groups are remarkably intelligent” •  Importance of diversity, independence and decentralization “large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”. Aggregating data
  • 10. 6/28/13 10 - 21 - 21 Web Data Mining •  Content: text & multimedia mining •  Structure: link analysis, graph mining •  Usage: log analysis, query mining •  Relate all of the above – Web characterization – Particular applications - 22 - Flickr: Clustering Pictures 22
  • 11. 6/28/13 11 - 23 - Popularity - 24 - Flickr: Geo-tagged pictures 24 24
  • 12. 6/28/13 12 - 27 - “Crowd Sourcing” Web-based “peer production” has produced a number of successful products and communities •  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ... Can this form of production be harnessed for other ends? •  Existing successes are hard to replicate at will Amazon Mechanical Turk (AMT) •  Like outsourcing, but in a micro-distributed fashion •  Thousands of “turkers” working on hundreds of “HITS” (tasks) •  Rates are typically few cents per task •  Quality of their work is positively evaluated (e.g. in IR) - 28 - The Wisdom of (Large) Crowds – Crucial for Search Ranking – Text: Web Writers & Editors • not only for the Web! – Links: Web Publishers – Tags: Web Taggers – Queries: All Web Users! • Queries and actions (or no action!)‫‏‬ The crowd implicitly knows the experts!
  • 13. 6/28/13 13 - 30 - 30 Scalability §  How to scale? §  Doubling the data in the best case will double the time §  Time complexity vs. result quality trade-off §  Example: entity detection in linear time at almost state of the art quality §  That implies that there exists a text size n* for which the linear algorithm will produce more correct entities §  Distributed parallel processing §  Map-reduce not always works §  Parallelism is problem dependent §  Online processing needs a different approach - 31 - 31 Redundancy and Bias §  There is any dependency in the data? §  There is any duplication? §  Lexical duplication in the Web is around 25% §  Semantic duplication is larger §  Are there any biases? §  Example 1: clicks in search engines §  Bias to the ranking and the interface §  There is a ranking bias in the Web content §  Example 2: tag recommendation
  • 14. 6/28/13 14 - 32 - We can suggest tags: nice but .... - 33 - Privacy Example: AOL Query Logs Release Incident No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”. Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006 33
  • 15. 6/28/13 15 - 34 - Risks of Privacy (ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001) K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010. Data Protection Directive in EU 34 - 35 - Risks of Privacy: Query Logs Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] •  Gender: 84% •  Age (±10): 79% •  Location (ZIP3): 35% Vanity Queries: [Jones et al, CIKM 2008] •  Partial name: 8.9% •  Complete: 1.2% More information: •  A Survey of query log privacy-enhancing techniques from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
  • 16. 6/28/13 16 - 36 - 36 Sparsity §  The Long Tail is always Sparse §  Why there is a long tail? §  When the crowd dominates §  Empowering the tail §  Example: Relations from Query Logs - 38 - The Wisdom of Crowds – Popularity – Diversity – Quality – Coverage Long tail Heavy tail
  • 17. 6/28/13 17 - 39 - The Long Tail Most measures in the Web follow a power law - 42 - People Interests 42 Heavy tail of user interests Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, … Normal people Weirdos One explanation
  • 18. 6/28/13 18 - 43 - Many queries, each asked very few times, make up a large fraction of all queries Applies to word usage, web page access, … We are all partially eclectic People Interests Broder, Gabrilovich, Goel, Pang; WSDM 2009 The reality Heavy tail of user interests - 44 - Example: Click Distribution User interaction is a power law! (Zipf’s principle of minimal effort)
  • 19. 6/28/13 19 - 45 - When the crowd dominates Kills the long tail See (obsolete now) “shwarzneger” example 45 - 46 - Empowering the Tail The Filter “Bubble”, Eli Pariser •  Avoid the Poor get Poorer Syndrome Solutions: •  Diversity •  Novelty •  Serendipity 46 Explore & Exploit
  • 20. 6/28/13 20 - 47 - How to Circumvent Sparsity? Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse Aggregate users around same intent, task, facet, …. Change granularity “ad hoc” •  Middle age men •  Fans of Messi 47 - 48 - 48 Example: Mining Geo/time Data •  Optimal Touristic Paths from Flickr •  Good for tourists and locals De Choudhury et al, HT 2010
  • 21. 6/28/13 21 - 49 - •  The long tail is important not only for e- commerce, but because we are all there •  Personalization vs. Contextualization User interaction is another long tail People Interests Aggregating in the Long Tail - 69 - 69 Epilogue l The Web is scientifically young l The Web is intellectually diverse l The technology mirrors the economic, legal and sociological reality l  Data must be interesting! (Gerhard Weikum) l  Problem driven l  Plenty of challenges
  • 22. 6/28/13 22 - 70 - 70 Mirror of Society - 71 - 71 Exports/Imports vs. Domain Links Baeza-Yates & Castillo, WWW2006
  • 23. 6/28/13 23 Contact: rbaeza@acm.org Thanks to many people at Yahoo! Labs ASIST 2012 Book of the Year Award Questions?

×