SlideShare a Scribd company logo
6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
• Big Data
• Asking the Right Questions
• Wisdom of Crowds in the Web
• The Long Tail
• Issues and Examples
• Concluding Remarks
6/28/13
2
- 4 -
4
Big Data
§  Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
§  Large volume and growth
§  Petabytes to exabytes
§  Growth is estimated in 3 exabytes per day
§  Structured vs. non-structured data
§  Diversity
§  Types, formats, complexity, topics, etc.
§  Best Public Data Example: The Web
§  Content: text, multimedia
§  Structure: graphs
§  Usage: real time streams
- 5 -
5
Big Data
§  Focus on analytics
§  Many storage technologies:
§  DBs, DWs, distributed file systems, …
§  Many processing technologies:
§  Cloud computing, map-reduce (Hadoop), …
§  Data mining, clustering, classification, …
§  Machine learning, A/B testing, NLP, …
§  Simulation
§  Several technology providers
§  Initial best practices (see TDWI report, 2011)
§  Main challenges: scalability, online
6/28/13
3
- 6 -
6
Big Data: The Five V’s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
§  Problem Driven
§  What data we need? How much?
§  How we collect it? How we store and transfer it?
§  Understanding the Data
§  How sparse is the data? How much noise?
§  There is redundancy? There are biases?
§  There is spam? Any outliers?
§  Analyzing the Data
§  Any privacy issues? Do we need to anonymize?
§  How well our algorithms scale?
§  Can we visualize the results?
6/28/13
4
- 8 -
8
Too Much Data Available
§  The Web is a database!
§  Data does not imply information
§  Many analyses for the sake of it (data driven)
§  Analyzing data is not CS per se
§  Publish in the right forum!
§  Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
6/28/13
6
- 13 -
Quantity
Quality
User-
generated
Traditional
publishing
What is in the Web? How Good it is?
- 14 -
14
What else is in the Web?
6/28/13
7
- 15 -
15
Noise and Spam
§  Noise may come from many places:
§  Instruments that measure
§  How we interpret the data (example later)
§  Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicks…
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
•  Social
•  Economical
Web Spam is NOT Mail Spam
6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
6/28/13
9
- 19 -
Web Data Trends
•  User Generated Content
– Massive (quality vs. quantity)
– Social Networks
– Real time (people + physical sensors)
•  Impact
– Fragmentation of ownership
– Fragmentation of access (longer heavy tail)
– Fragmentation of right to access
•  Viability
– Business model based in advertising
- 20 -
The Wisdom of Crowds
•  James Surowiecki, a New Yorker columnist,
published this book in 2004
– “Under the right circumstances, groups are
remarkably intelligent”
•  Importance of diversity, independence and
decentralization
“large groups of people are smarter than an elite few,
no matter how brilliant—they are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the future”.
Aggregating data
6/28/13
10
- 21 -
21
Web Data Mining
•  Content: text & multimedia mining
•  Structure: link analysis, graph mining
•  Usage: log analysis, query mining
•  Relate all of the above
– Web characterization
– Particular applications
- 22 -
Flickr: Clustering Pictures
22
6/28/13
11
- 23 -
Popularity
- 24 -
Flickr: Geo-tagged pictures
24
24
6/28/13
12
- 27 -
“Crowd Sourcing”
Web-based “peer production” has produced a number of
successful products and communities
•  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
•  Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
•  Like outsourcing, but in a micro-distributed fashion
•  Thousands of “turkers” working on hundreds of “HITS” (tasks)
•  Rates are typically few cents per task
•  Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
– Crucial for Search Ranking
– Text: Web Writers & Editors
• not only for the Web!
– Links: Web Publishers
– Tags: Web Taggers
– Queries: All Web Users!
• Queries and actions (or no action!)‫‏‬
The crowd implicitly
knows the experts!
6/28/13
13
- 30 -
30
Scalability
§  How to scale?
§  Doubling the data in the best case will double the time
§  Time complexity vs. result quality trade-off
§  Example: entity detection in linear time at almost state
of the art quality
§  That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
§  Distributed parallel processing
§  Map-reduce not always works
§  Parallelism is problem dependent
§  Online processing needs a different approach
- 31 -
31
Redundancy and Bias
§  There is any dependency in the data?
§  There is any duplication?
§  Lexical duplication in the Web is around 25%
§  Semantic duplication is larger
§  Are there any biases?
§  Example 1: clicks in search engines
§  Bias to the ranking and the interface
§  There is a ranking bias in the Web content
§  Example 2: tag recommendation
6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from “numb
fingers” to “60 single men”.
Other queries: “landscapers in Lilburn, Ga,” several
people with the last name Arnold and “homes sold
in shadow lake subdivision gwinnett county
georgia.”
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friends’ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
“address the collection of data
itself and not just how the
data is used”, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
•  Gender: 84%
•  Age (±10): 79%
•  Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
•  Partial name: 8.9%
•  Complete: 1.2%
More information:
•  A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
6/28/13
16
- 36 -
36
Sparsity
§  The Long Tail is always Sparse
§  Why there is a long tail?
§  When the crowd dominates
§  Empowering the tail
§  Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
– Popularity
– Diversity
– Quality
– Coverage
Long tail
Heavy tail
6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, …
Normal
people
Weirdos
One explanation
6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, …
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipf’s principle
of minimal effort)
6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
“shwarzneger” example
45
- 46 -
Empowering the Tail
The Filter “Bubble”, Eli Pariser
•  Avoid the Poor get Poorer Syndrome
Solutions:
•  Diversity
•  Novelty
•  Serendipity
46
Explore & Exploit
6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of “ad-hoc” crowds?
Aggregate data in the “right way”
When data is sparse
Aggregate users around same intent, task, facet, ….
Change granularity “ad hoc”
•  Middle age men
•  Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
•  Optimal Touristic Paths from Flickr
•  Good for tourists and locals
De Choudhury et al, HT 2010
6/28/13
21
- 49 -
•  The long tail is important not only for e-
commerce, but because we are all there
•  Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
l The Web is scientifically young
l The Web is intellectually diverse
l The technology mirrors the economic, legal and
sociological reality
l  Data must be interesting! (Gerhard Weikum)
l  Problem driven
l  Plenty of challenges
6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006
6/28/13
23
Contact: rbaeza@acm.org
Thanks to many people at Yahoo! Labs
ASIST 2012
Book of the
Year Award
Questions?

More Related Content

What's hot

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big Data
Martin Patrick
 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on education
Craig Cunningham
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
David Smith
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
Brian Vetruba
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
Paige Jaeger
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
Anant Narayanan
 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)
KR_Barker
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...Frederick Zarndt
 
googlization of information
googlization of informationgooglization of information
googlization of informationrajat00001in
 
NCTI
NCTINCTI
NCTI
Lucy Gray
 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)
KR_Barker
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
Nelson Piedra
 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)
KR_Barker
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
bakers84
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
Yesha
 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1
KR_Barker
 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data Visualization
JournovationSU
 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
KR_Barker
 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)
KR_Barker
 

What's hot (19)

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big Data
 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on education
 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)
 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
 
googlization of information
googlization of informationgooglization of information
googlization of information
 
NCTI
NCTINCTI
NCTI
 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)
 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)
 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1
 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data Visualization
 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)
 

Viewers also liked

Lesson five
Lesson fiveLesson five
Lesson fivecoxx201
 
Lesson six
Lesson sixLesson six
Lesson sixcoxx201
 
Обзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрииОбзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрии
ktoropetsky
 

Viewers also liked (8)

Problem b6
Problem b6Problem b6
Problem b6
 
Abyrvalg
AbyrvalgAbyrvalg
Abyrvalg
 
Lesson five
Lesson fiveLesson five
Lesson five
 
Lesson six
Lesson sixLesson six
Lesson six
 
Biosilver 22
Biosilver 22Biosilver 22
Biosilver 22
 
Homepure
HomepureHomepure
Homepure
 
Обзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрииОбзор принципов и технических решений многофазной расходометрии
Обзор принципов и технических решений многофазной расходометрии
 
E guard كيونت
E guard كيونتE guard كيونت
E guard كيونت
 

Similar to Big data in the web

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Thinkful
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
Matthew Russell
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
Mia
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiLaks Lakshmanan
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
TJ Stalcup
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
Ehren Foss
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
Tony Dobaj
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
Clement Levallois
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
James Hendler
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
heyramzz
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
Abe Usher
 
Data Science
Data Science Data Science
Data Science
nick483808
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
KayKay751113
 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdf
BrajKishor45
 

Similar to Big data in the web (20)

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-ii
 
Tf gsds
Tf gsdsTf gsds
Tf gsds
 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
 
Data Science
Data Science Data Science
Data Science
 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdf
 

Recently uploaded

GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
名前 です男
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
DianaGray10
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
danishmna97
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 

Recently uploaded (20)

GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
みなさんこんにちはこれ何文字まで入るの?40文字以下不可とか本当に意味わからないけどこれ限界文字数書いてないからマジでやばい文字数いけるんじゃないの?えこ...
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6UiPath Test Automation using UiPath Test Suite series, part 6
UiPath Test Automation using UiPath Test Suite series, part 6
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
How to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptxHow to Get CNIC Information System with Paksim Ga.pptx
How to Get CNIC Information System with Paksim Ga.pptx
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 

Big data in the web

  • 1. 6/28/13 1 Big Data in The Web Ricardo Baeza-Yates Yahoo! Labs Barcelona & Santiago de Chile - 3 - Agenda • Big Data • Asking the Right Questions • Wisdom of Crowds in the Web • The Long Tail • Issues and Examples • Concluding Remarks
  • 2. 6/28/13 2 - 4 - 4 Big Data §  Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time §  Large volume and growth §  Petabytes to exabytes §  Growth is estimated in 3 exabytes per day §  Structured vs. non-structured data §  Diversity §  Types, formats, complexity, topics, etc. §  Best Public Data Example: The Web §  Content: text, multimedia §  Structure: graphs §  Usage: real time streams - 5 - 5 Big Data §  Focus on analytics §  Many storage technologies: §  DBs, DWs, distributed file systems, … §  Many processing technologies: §  Cloud computing, map-reduce (Hadoop), … §  Data mining, clustering, classification, … §  Machine learning, A/B testing, NLP, … §  Simulation §  Several technology providers §  Initial best practices (see TDWI report, 2011) §  Main challenges: scalability, online
  • 3. 6/28/13 3 - 6 - 6 Big Data: The Five V’s Characteristic Data Issue Computing Issue Volume Scale, Redundancy Scalability Variety Heterogeneity, Complexity Adaptability, Extensibility Veracity Completeness, Bias, Sparsity, Noise, Spam Reliability, Trust Velocity Real time Online Value Usefulness, Privacy Business dependent - 7 - 7 Asking the Right Questions §  Problem Driven §  What data we need? How much? §  How we collect it? How we store and transfer it? §  Understanding the Data §  How sparse is the data? How much noise? §  There is redundancy? There are biases? §  There is spam? Any outliers? §  Analyzing the Data §  Any privacy issues? Do we need to anonymize? §  How well our algorithms scale? §  Can we visualize the results?
  • 4. 6/28/13 4 - 8 - 8 Too Much Data Available §  The Web is a database! §  Data does not imply information §  Many analyses for the sake of it (data driven) §  Analyzing data is not CS per se §  Publish in the right forum! §  Big Data or Right Data? - 9 - 9 The Different Facets of the Web
  • 5. 6/28/13 5 - 11 - 11 The Structure of the Web - 12 - Big Data in the Web Metadata RDF Wikipedia ODP Flickr Text Anchors + links Y! Answers Logs (Clicks+Queries) Explicit Implicit Wordnet UGC Private Scale Blogs, Groups Quality?
  • 6. 6/28/13 6 - 13 - Quantity Quality User- generated Traditional publishing What is in the Web? How Good it is? - 14 - 14 What else is in the Web?
  • 7. 6/28/13 7 - 15 - 15 Noise and Spam §  Noise may come from many places: §  Instruments that measure §  How we interpret the data (example later) §  Spam is everywhere - 16 - 16 Web Spam Deceiving text, links, clicks… due to an economic incentive Depending on the goal and the data, spam is easier to generate Depending on the type & target data, spam is easier to fight Disincentives for spammers? •  Social •  Economical Web Spam is NOT Mail Spam
  • 8. 6/28/13 8 - 17 - 17 - 18 - Content and Metadata Trends [Ramakrishnan and Tomkins 2007]
  • 9. 6/28/13 9 - 19 - Web Data Trends •  User Generated Content – Massive (quality vs. quantity) – Social Networks – Real time (people + physical sensors) •  Impact – Fragmentation of ownership – Fragmentation of access (longer heavy tail) – Fragmentation of right to access •  Viability – Business model based in advertising - 20 - The Wisdom of Crowds •  James Surowiecki, a New Yorker columnist, published this book in 2004 – “Under the right circumstances, groups are remarkably intelligent” •  Importance of diversity, independence and decentralization “large groups of people are smarter than an elite few, no matter how brilliant—they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the future”. Aggregating data
  • 10. 6/28/13 10 - 21 - 21 Web Data Mining •  Content: text & multimedia mining •  Structure: link analysis, graph mining •  Usage: log analysis, query mining •  Relate all of the above – Web characterization – Particular applications - 22 - Flickr: Clustering Pictures 22
  • 11. 6/28/13 11 - 23 - Popularity - 24 - Flickr: Geo-tagged pictures 24 24
  • 12. 6/28/13 12 - 27 - “Crowd Sourcing” Web-based “peer production” has produced a number of successful products and communities •  Wikipedia, Y! Answers, YouTube, Flickr, Digg, ... Can this form of production be harnessed for other ends? •  Existing successes are hard to replicate at will Amazon Mechanical Turk (AMT) •  Like outsourcing, but in a micro-distributed fashion •  Thousands of “turkers” working on hundreds of “HITS” (tasks) •  Rates are typically few cents per task •  Quality of their work is positively evaluated (e.g. in IR) - 28 - The Wisdom of (Large) Crowds – Crucial for Search Ranking – Text: Web Writers & Editors • not only for the Web! – Links: Web Publishers – Tags: Web Taggers – Queries: All Web Users! • Queries and actions (or no action!)‫‏‬ The crowd implicitly knows the experts!
  • 13. 6/28/13 13 - 30 - 30 Scalability §  How to scale? §  Doubling the data in the best case will double the time §  Time complexity vs. result quality trade-off §  Example: entity detection in linear time at almost state of the art quality §  That implies that there exists a text size n* for which the linear algorithm will produce more correct entities §  Distributed parallel processing §  Map-reduce not always works §  Parallelism is problem dependent §  Online processing needs a different approach - 31 - 31 Redundancy and Bias §  There is any dependency in the data? §  There is any duplication? §  Lexical duplication in the Web is around 25% §  Semantic duplication is larger §  Are there any biases? §  Example 1: clicks in search engines §  Bias to the ranking and the interface §  There is a ranking bias in the Web content §  Example 2: tag recommendation
  • 14. 6/28/13 14 - 32 - We can suggest tags: nice but .... - 33 - Privacy Example: AOL Query Logs Release Incident No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from “numb fingers” to “60 single men”. Other queries: “landscapers in Lilburn, Ga,” several people with the last name Arnold and “homes sold in shadow lake subdivision gwinnett county georgia.” Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friends’ medical ailments and loves her three dogs. A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006 33
  • 15. 6/28/13 15 - 34 - Risks of Privacy (ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001) K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries Federal Trade Commission in US: Privacy policies should “address the collection of data itself and not just how the data is used”, Dec 2010. Data Protection Directive in EU 34 - 35 - Risks of Privacy: Query Logs Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] •  Gender: 84% •  Age (±10): 79% •  Location (ZIP3): 35% Vanity Queries: [Jones et al, CIKM 2008] •  Partial name: 8.9% •  Complete: 1.2% More information: •  A Survey of query log privacy-enhancing techniques from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
  • 16. 6/28/13 16 - 36 - 36 Sparsity §  The Long Tail is always Sparse §  Why there is a long tail? §  When the crowd dominates §  Empowering the tail §  Example: Relations from Query Logs - 38 - The Wisdom of Crowds – Popularity – Diversity – Quality – Coverage Long tail Heavy tail
  • 17. 6/28/13 17 - 39 - The Long Tail Most measures in the Web follow a power law - 42 - People Interests 42 Heavy tail of user interests Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, … Normal people Weirdos One explanation
  • 18. 6/28/13 18 - 43 - Many queries, each asked very few times, make up a large fraction of all queries Applies to word usage, web page access, … We are all partially eclectic People Interests Broder, Gabrilovich, Goel, Pang; WSDM 2009 The reality Heavy tail of user interests - 44 - Example: Click Distribution User interaction is a power law! (Zipf’s principle of minimal effort)
  • 19. 6/28/13 19 - 45 - When the crowd dominates Kills the long tail See (obsolete now) “shwarzneger” example 45 - 46 - Empowering the Tail The Filter “Bubble”, Eli Pariser •  Avoid the Poor get Poorer Syndrome Solutions: •  Diversity •  Novelty •  Serendipity 46 Explore & Exploit
  • 20. 6/28/13 20 - 47 - How to Circumvent Sparsity? Wisdom of “ad-hoc” crowds? Aggregate data in the “right way” When data is sparse Aggregate users around same intent, task, facet, …. Change granularity “ad hoc” •  Middle age men •  Fans of Messi 47 - 48 - 48 Example: Mining Geo/time Data •  Optimal Touristic Paths from Flickr •  Good for tourists and locals De Choudhury et al, HT 2010
  • 21. 6/28/13 21 - 49 - •  The long tail is important not only for e- commerce, but because we are all there •  Personalization vs. Contextualization User interaction is another long tail People Interests Aggregating in the Long Tail - 69 - 69 Epilogue l The Web is scientifically young l The Web is intellectually diverse l The technology mirrors the economic, legal and sociological reality l  Data must be interesting! (Gerhard Weikum) l  Problem driven l  Plenty of challenges
  • 22. 6/28/13 22 - 70 - 70 Mirror of Society - 71 - 71 Exports/Imports vs. Domain Links Baeza-Yates & Castillo, WWW2006
  • 23. 6/28/13 23 Contact: rbaeza@acm.org Thanks to many people at Yahoo! Labs ASIST 2012 Book of the Year Award Questions?