SlideShare a Scribd company logo
6/28/13
1
Big Data
in
The Web
Ricardo Baeza-Yates
Yahoo! Labs
Barcelona & Santiago de Chile
- 3 -
Agenda
ā€¢ā€ÆBig Data
ā€¢ā€ÆAsking the Right Questions
ā€¢ā€ÆWisdom of Crowds in the Web
ā€¢ā€ÆThe Long Tail
ā€¢ā€ÆIssues and Examples
ā€¢ā€ÆConcluding Remarks
6/28/13
2
- 4 -
4
Big Data
Ā§ļ‚§ā€Æ Capture, transfer, store, search, share, analyze,
and visualize large data in reasonable time
Ā§ļ‚§ā€Æ Large volume and growth
Ā§ļ‚§ā€Æ Petabytes to exabytes
Ā§ļ‚§ā€Æ Growth is estimated in 3 exabytes per day
Ā§ļ‚§ā€Æ Structured vs. non-structured data
Ā§ļ‚§ā€Æ Diversity
Ā§ļ‚§ā€Æ Types, formats, complexity, topics, etc.
Ā§ļ‚§ā€Æ Best Public Data Example: The Web
Ā§ļ‚§ā€Æ Content: text, multimedia
Ā§ļ‚§ā€Æ Structure: graphs
Ā§ļ‚§ā€Æ Usage: real time streams
- 5 -
5
Big Data
Ā§ļ‚§ā€Æ Focus on analytics
Ā§ļ‚§ā€Æ Many storage technologies:
Ā§ļ‚§ā€Æ DBs, DWs, distributed file systems, ā€¦
Ā§ļ‚§ā€Æ Many processing technologies:
Ā§ļ‚§ā€Æ Cloud computing, map-reduce (Hadoop), ā€¦
Ā§ļ‚§ā€Æ Data mining, clustering, classification, ā€¦
Ā§ļ‚§ā€Æ Machine learning, A/B testing, NLP, ā€¦
Ā§ļ‚§ā€Æ Simulation
Ā§ļ‚§ā€Æ Several technology providers
Ā§ļ‚§ā€Æ Initial best practices (see TDWI report, 2011)
Ā§ļ‚§ā€Æ Main challenges: scalability, online
6/28/13
3
- 6 -
6
Big Data: The Five Vā€™s
Characteristic Data Issue Computing Issue
Volume Scale,
Redundancy
Scalability
Variety Heterogeneity,
Complexity
Adaptability,
Extensibility
Veracity Completeness, Bias,
Sparsity, Noise, Spam
Reliability,
Trust
Velocity Real time Online
Value Usefulness,
Privacy
Business
dependent
- 7 -
7
Asking the Right Questions
Ā§ļ‚§ā€Æ Problem Driven
Ā§ļ‚§ā€Æ What data we need? How much?
Ā§ļ‚§ā€Æ How we collect it? How we store and transfer it?
Ā§ļ‚§ā€Æ Understanding the Data
Ā§ļ‚§ā€Æ How sparse is the data? How much noise?
Ā§ļ‚§ā€Æ There is redundancy? There are biases?
Ā§ļ‚§ā€Æ There is spam? Any outliers?
Ā§ļ‚§ā€Æ Analyzing the Data
Ā§ļ‚§ā€Æ Any privacy issues? Do we need to anonymize?
Ā§ļ‚§ā€Æ How well our algorithms scale?
Ā§ļ‚§ā€Æ Can we visualize the results?
6/28/13
4
- 8 -
8
Too Much Data Available
Ā§ļ‚§ā€Æ The Web is a database!
Ā§ļ‚§ā€Æ Data does not imply information
Ā§ļ‚§ā€Æ Many analyses for the sake of it (data driven)
Ā§ļ‚§ā€Æ Analyzing data is not CS per se
Ā§ļ‚§ā€Æ Publish in the right forum!
Ā§ļ‚§ā€Æ Big Data or Right Data?
- 9 -
9
The Different Facets of the Web
6/28/13
5
- 11 -
11
The Structure of the Web
- 12 -
Big Data in the Web
Metadata
RDF
Wikipedia ODP
Flickr
Text
Anchors + links
Y! Answers
Logs (Clicks+Queries)
Explicit Implicit
Wordnet
UGC
Private
Scale
Blogs,
Groups
Quality?
6/28/13
6
- 13 -
Quantity
Quality
User-
generated
Traditional
publishing
What is in the Web? How Good it is?
- 14 -
14
What else is in the Web?
6/28/13
7
- 15 -
15
Noise and Spam
Ā§ļ‚§ā€Æ Noise may come from many places:
Ā§ļ‚§ā€Æ Instruments that measure
Ā§ļ‚§ā€Æ How we interpret the data (example later)
Ā§ļ‚§ā€Æ Spam is everywhere
- 16 -
16
Web Spam
Deceiving text, links, clicksā€¦
due to an economic incentive
Depending on the goal and the data,
spam is easier to generate
Depending on the type & target data,
spam is easier to fight
Disincentives for spammers?
ā€¢ā€Æ Social
ā€¢ā€Æ Economical
Web Spam is NOT Mail Spam
6/28/13
8
- 17 -
17
- 18 -
Content and Metadata Trends
[Ramakrishnan and Tomkins 2007]
6/28/13
9
- 19 -
Web Data Trends
ā€¢ā€Æ User Generated Content
ā€“ā€ÆMassive (quality vs. quantity)
ā€“ā€ÆSocial Networks
ā€“ā€ÆReal time (people + physical sensors)
ā€¢ā€Æ Impact
ā€“ā€ÆFragmentation of ownership
ā€“ā€ÆFragmentation of access (longer heavy tail)
ā€“ā€ÆFragmentation of right to access
ā€¢ā€Æ Viability
ā€“ā€ÆBusiness model based in advertising
- 20 -
The Wisdom of Crowds
ā€¢ā€Æ James Surowiecki, a New Yorker columnist,
published this book in 2004
ā€“ā€Æā€œUnder the right circumstances, groups are
remarkably intelligentā€
ā€¢ā€Æ Importance of diversity, independence and
decentralization
ā€œlarge groups of people are smarter than an elite few,
no matter how brilliantā€”they are better at solving
problems, fostering innovation, coming to wise
decisions, even predicting the futureā€.
Aggregating data
6/28/13
10
- 21 -
21
Web Data Mining
ā€¢ā€Æ Content: text & multimedia mining
ā€¢ā€Æ Structure: link analysis, graph mining
ā€¢ā€Æ Usage: log analysis, query mining
ā€¢ā€Æ Relate all of the above
ā€“ā€ÆWeb characterization
ā€“ā€ÆParticular applications
- 22 -
Flickr: Clustering Pictures
22
6/28/13
11
- 23 -
Popularity
- 24 -
Flickr: Geo-tagged pictures
24
24
6/28/13
12
- 27 -
ā€œCrowd Sourcingā€
Web-based ā€œpeer productionā€ has produced a number of
successful products and communities
ā€¢ā€Æ Wikipedia, Y! Answers, YouTube, Flickr, Digg, ...
Can this form of production be harnessed for other ends?
ā€¢ā€Æ Existing successes are hard to replicate at will
Amazon Mechanical Turk (AMT)
ā€¢ā€Æ Like outsourcing, but in a micro-distributed fashion
ā€¢ā€Æ Thousands of ā€œturkersā€ working on hundreds of ā€œHITSā€ (tasks)
ā€¢ā€Æ Rates are typically few cents per task
ā€¢ā€Æ Quality of their work is positively evaluated (e.g. in IR)
- 28 -
The Wisdom of (Large) Crowds
ā€“ā€ÆCrucial for Search Ranking
ā€“ā€ÆText: Web Writers & Editors
ā€¢ā€Ænot only for the Web!
ā€“ā€ÆLinks: Web Publishers
ā€“ā€ÆTags: Web Taggers
ā€“ā€ÆQueries: All Web Users!
ā€¢ā€ÆQueries and actions (or no action!)ā€«ā€ā€¬
The crowd implicitly
knows the experts!
6/28/13
13
- 30 -
30
Scalability
Ā§ļ‚§ā€Æ How to scale?
Ā§ļ‚§ā€Æ Doubling the data in the best case will double the time
Ā§ļ‚§ā€Æ Time complexity vs. result quality trade-off
Ā§ļ‚§ā€Æ Example: entity detection in linear time at almost state
of the art quality
Ā§ļ‚§ā€Æ That implies that there exists a text size n* for which
the linear algorithm will produce more correct entities
Ā§ļ‚§ā€Æ Distributed parallel processing
Ā§ļ‚§ā€Æ Map-reduce not always works
Ā§ļ‚§ā€Æ Parallelism is problem dependent
Ā§ļ‚§ā€Æ Online processing needs a different approach
- 31 -
31
Redundancy and Bias
Ā§ļ‚§ā€Æ There is any dependency in the data?
Ā§ļ‚§ā€Æ There is any duplication?
Ā§ļ‚§ā€Æ Lexical duplication in the Web is around 25%
Ā§ļ‚§ā€Æ Semantic duplication is larger
Ā§ļ‚§ā€Æ Are there any biases?
Ā§ļ‚§ā€Æ Example 1: clicks in search engines
Ā§ļ‚§ā€Æ Bias to the ranking and the interface
Ā§ļ‚§ā€Æ There is a ranking bias in the Web content
Ā§ļ‚§ā€Æ Example 2: tag recommendation
6/28/13
14
- 32 -
We can suggest tags: nice but ....
- 33 -
Privacy Example:
AOL Query Logs Release Incident
No. 4417749 conducted hundreds of searches over a
three-month period on topics ranging from ā€œnumb
fingersā€ to ā€œ60 single menā€.
Other queries: ā€œlandscapers in Lilburn, Ga,ā€ several
people with the last name Arnold and ā€œhomes sold
in shadow lake subdivision gwinnett county
georgia.ā€
Data trail led to Thelma Arnold, a 62-year-old widow
who lives in Lilburn, Ga., frequently researches her
friendsā€™ medical ailments and loves her three dogs.
A Face Is Exposed for AOL Searcher No. 4417749,
By MICHAEL BARBARO and TOM ZELLER Jr,
The New York Times, Aug 9 2006
33
6/28/13
15
- 34 -
Risks of Privacy
(ZIP code, date of birth, gender)
is enough to identify 87% of
US citizens using public DB
(Sweeney, 2001)
K-anonymity
Suppress or generalize attributes until
each entry is identical to at least k-1
other entries
Federal Trade Commission in
US: Privacy policies should
ā€œaddress the collection of data
itself and not just how the
data is usedā€, Dec 2010.
Data Protection Directive in EU
34
- 35 -
Risks of Privacy: Query Logs
Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007]
ā€¢ā€Æ Gender: 84%
ā€¢ā€Æ Age (Ā±10): 79%
ā€¢ā€Æ Location (ZIP3): 35%
Vanity Queries: [Jones et al, CIKM 2008]
ā€¢ā€Æ Partial name: 8.9%
ā€¢ā€Æ Complete: 1.2%
More information:
ā€¢ā€Æ A Survey of query log privacy-enhancing techniques
from a policy perspective [Cooper, ACM TWEB 2008]
A good anonymization is still an open problem
6/28/13
16
- 36 -
36
Sparsity
Ā§ļ‚§ā€Æ The Long Tail is always Sparse
Ā§ļ‚§ā€Æ Why there is a long tail?
Ā§ļ‚§ā€Æ When the crowd dominates
Ā§ļ‚§ā€Æ Empowering the tail
Ā§ļ‚§ā€Æ Example: Relations from Query Logs
- 38 -
The Wisdom of Crowds
ā€“ā€ÆPopularity
ā€“ā€ÆDiversity
ā€“ā€ÆQuality
ā€“ā€ÆCoverage
Long tail
Heavy tail
6/28/13
17
- 39 -
The Long Tail
Most measures in the Web follow a power law
- 42 -
People
Interests
42
Heavy tail of user interests
Many queries, each asked very few times, make
up a large fraction of all queries
Movies watched, blogs read, words used, ā€¦
Normal
people
Weirdos
One explanation
6/28/13
18
- 43 -
Many queries, each asked very few times, make
up a large fraction of all queries
Applies to word usage, web page access, ā€¦
We are all partially eclectic
People
Interests
Broder, Gabrilovich, Goel, Pang; WSDM 2009
The reality
Heavy tail of user interests
- 44 -
Example: Click Distribution
User interaction
is a
power law!
(Zipfā€™s principle
of minimal effort)
6/28/13
19
- 45 -
When the crowd dominates
Kills the long tail
See (obsolete now)
ā€œshwarznegerā€ example
45
- 46 -
Empowering the Tail
The Filter ā€œBubbleā€, Eli Pariser
ā€¢ā€Æ Avoid the Poor get Poorer Syndrome
Solutions:
ā€¢ā€Æ Diversity
ā€¢ā€Æ Novelty
ā€¢ā€Æ Serendipity
46
Explore & Exploit
6/28/13
20
- 47 -
How to Circumvent Sparsity?
Wisdom of ā€œad-hocā€ crowds?
Aggregate data in the ā€œright wayā€
When data is sparse
Aggregate users around same intent, task, facet, ā€¦.
Change granularity ā€œad hocā€
ā€¢ā€Æ Middle age men
ā€¢ā€Æ Fans of Messi
47
- 48 -
48
Example: Mining Geo/time Data
ā€¢ā€Æ Optimal Touristic Paths from Flickr
ā€¢ā€Æ Good for tourists and locals
De Choudhury et al, HT 2010
6/28/13
21
- 49 -
ā€¢ā€Æ The long tail is important not only for e-
commerce, but because we are all there
ā€¢ā€Æ Personalization vs. Contextualization
User interaction is another long tail
People
Interests
Aggregating in the Long Tail
- 69 -
69
Epilogue
lļ¬ā€ÆThe Web is scientifically young
lļ¬ā€ÆThe Web is intellectually diverse
lļ¬ā€ÆThe technology mirrors the economic, legal and
sociological reality
lļ¬ā€Æ Data must be interesting! (Gerhard Weikum)
lļ¬ā€Æ Problem driven
lļ¬ā€Æ Plenty of challenges
6/28/13
22
- 70 -
70
Mirror of Society
- 71 -
71
Exports/Imports vs. Domain Links
Baeza-Yates & Castillo, WWW2006
6/28/13
23
Contact: rbaeza@acm.org
Thanks to many people at Yahoo! Labs
ASIST 2012
Book of the
Year Award
Questions?

More Related Content

What's hot

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big Data
Martin Patrick
Ā 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on education
Craig Cunningham
Ā 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
David Smith
Ā 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
Brian Vetruba
Ā 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
Paige Jaeger
Ā 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
Anant Narayanan
Ā 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)
KR_Barker
Ā 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...Frederick Zarndt
Ā 
googlization of information
googlization of informationgooglization of information
googlization of informationrajat00001in
Ā 
NCTI
NCTINCTI
NCTI
Lucy Gray
Ā 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)
KR_Barker
Ā 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
Nelson Piedra
Ā 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)
KR_Barker
Ā 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
bakers84
Ā 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
Yesha
Ā 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1
KR_Barker
Ā 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data Visualization
JournovationSU
Ā 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
KR_Barker
Ā 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)
KR_Barker
Ā 

What's hot (19)

Teaching information: from Google Search to Big Data
Teaching information: from Google Search to Big DataTeaching information: from Google Search to Big Data
Teaching information: from Google Search to Big Data
Ā 
Filtering for in and of on education
Filtering for in and of on educationFiltering for in and of on education
Filtering for in and of on education
Ā 
Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?Researchers, Discovery and the Internet: What Next?
Researchers, Discovery and the Internet: What Next?
Ā 
HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?HathiTrust--a GovDocs Repository?
HathiTrust--a GovDocs Repository?
Ā 
Google & garbage lsta 2012
Google & garbage lsta 2012Google & garbage lsta 2012
Google & garbage lsta 2012
Ā 
Enterprise Scale Knowledge Graphs
Enterprise Scale Knowledge GraphsEnterprise Scale Knowledge Graphs
Enterprise Scale Knowledge Graphs
Ā 
Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)Introduction to Digital Life (March 2017)
Introduction to Digital Life (March 2017)
Ā 
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
2013 ifla satellite zarndt et al [crowdsourcing the world's cultural heritage...
Ā 
googlization of information
googlization of informationgooglization of information
googlization of information
Ā 
NCTI
NCTINCTI
NCTI
Ā 
The Reputation Economy (March 2016)
The Reputation Economy (March 2016)The Reputation Economy (March 2016)
The Reputation Economy (March 2016)
Ā 
Where the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom GruberWhere the Social Web Meets the Semantic Web. Tom Gruber
Where the Social Web Meets the Semantic Web. Tom Gruber
Ā 
Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)Beer and Branding for Graduate BioSciences (Oct 2016)
Beer and Branding for Graduate BioSciences (Oct 2016)
Ā 
Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...Artificial Intelligence and the Coming Revolution of Family History - Present...
Artificial Intelligence and the Coming Revolution of Family History - Present...
Ā 
Sharing on the Net
Sharing on the NetSharing on the Net
Sharing on the Net
Ā 
Online Identity- Part 1
Online Identity- Part 1Online Identity- Part 1
Online Identity- Part 1
Ā 
Interactive Data Visualization
Interactive Data VisualizationInteractive Data Visualization
Interactive Data Visualization
Ā 
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
The Reputation Economy: Managing Your Online Identity in the Age of Google- N...
Ā 
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)Your Online Identity: Discovering, Controlling, Managing (January 2016)
Your Online Identity: Discovering, Controlling, Managing (January 2016)
Ā 

Viewers also liked

Tommi kramer 2013-06-21-caise-re2-kramer
Tommi kramer   2013-06-21-caise-re2-kramerTommi kramer   2013-06-21-caise-re2-kramer
Tommi kramer 2013-06-21-caise-re2-kramercaise2013vlc
Ā 
Ignacio panach ormeƱo et-al_caise2013
Ignacio panach   ormeƱo et-al_caise2013Ignacio panach   ormeƱo et-al_caise2013
Ignacio panach ormeƱo et-al_caise2013caise2013vlc
Ā 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnetcaise2013vlc
Ā 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013caise2013vlc
Ā 
Christian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-seChristian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-secaise2013vlc
Ā 
Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_caise2013vlc
Ā 

Viewers also liked (6)

Tommi kramer 2013-06-21-caise-re2-kramer
Tommi kramer   2013-06-21-caise-re2-kramerTommi kramer   2013-06-21-caise-re2-kramer
Tommi kramer 2013-06-21-caise-re2-kramer
Ā 
Ignacio panach ormeƱo et-al_caise2013
Ignacio panach   ormeƱo et-al_caise2013Ignacio panach   ormeƱo et-al_caise2013
Ignacio panach ormeƱo et-al_caise2013
Ā 
Henning agt talk-caise-semnet
Henning agt   talk-caise-semnetHenning agt   talk-caise-semnet
Henning agt talk-caise-semnet
Ā 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013
Ā 
Christian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-seChristian gierds 2013-06-20-c ai-se
Christian gierds 2013-06-20-c ai-se
Ā 
Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_Baryannis c ai-se2013_wssl_
Baryannis c ai-se2013_wssl_
Ā 

Similar to Keynote baezayates

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Thinkful
Ā 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Thinkful
Ā 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
Matthew Russell
Ā 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
Mia
Ā 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiLaks Lakshmanan
Ā 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Thinkful
Ā 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
TJ Stalcup
Ā 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
Ehren Foss
Ā 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
Tony Dobaj
Ā 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
TJ Stalcup
Ā 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorialeswcsummerschool
Ā 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
Clement Levallois
Ā 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
James Hendler
Ā 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
Sandip Tipayle Patil
Ā 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
heyramzz
Ā 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
Abe Usher
Ā 
Data Science
Data Science Data Science
Data Science
nick483808
Ā 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
KayKay751113
Ā 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdf
BrajKishor45
Ā 

Similar to Keynote baezayates (20)

Getting Started in Data Science
Getting Started in Data ScienceGetting Started in Data Science
Getting Started in Data Science
Ā 
Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)Career in Data Science (July 2017, DTLA)
Career in Data Science (July 2017, DTLA)
Ā 
Privacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social WebPrivacy, Ethics, and Future Uses of the Social Web
Privacy, Ethics, and Future Uses of the Social Web
Ā 
Data visualisations as a gateway to programming
Data visualisations as a gateway to programmingData visualisations as a gateway to programming
Data visualisations as a gateway to programming
Ā 
Kdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-iiKdd12 tutorial-inf-part-ii
Kdd12 tutorial-inf-part-ii
Ā 
Tf gsds
Tf gsdsTf gsds
Tf gsds
Ā 
Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)Getting started in Data Science (April 2017, Los Angeles)
Getting started in Data Science (April 2017, Los Angeles)
Ā 
Intro to Data Science
Intro to Data ScienceIntro to Data Science
Intro to Data Science
Ā 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
Ā 
The technical case for a semantic web
The technical case for a semantic webThe technical case for a semantic web
The technical case for a semantic web
Ā 
Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science Thinkful DC - Intro to Data Science
Thinkful DC - Intro to Data Science
Ā 
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data TutorialESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
ESWC SS 2012 - Friday Keynote Marko Grobelnik: Big Data Tutorial
Ā 
A Primer on Text Mining for Business
A Primer on Text Mining for BusinessA Primer on Text Mining for Business
A Primer on Text Mining for Business
Ā 
Broad Data (India 2015)
Broad Data (India 2015)Broad Data (India 2015)
Broad Data (India 2015)
Ā 
Data mining with big data implementation
Data mining with big data implementationData mining with big data implementation
Data mining with big data implementation
Ā 
Big data tutorial_part4
Big data tutorial_part4Big data tutorial_part4
Big data tutorial_part4
Ā 
Big Data and the Social Sciences
Big Data and the Social SciencesBig Data and the Social Sciences
Big Data and the Social Sciences
Ā 
Data Science
Data Science Data Science
Data Science
Ā 
Data Science Presentation.pdf
Data Science Presentation.pdfData Science Presentation.pdf
Data Science Presentation.pdf
Ā 
Big data analytics by braj.pdf
Big data analytics by braj.pdfBig data analytics by braj.pdf
Big data analytics by braj.pdf
Ā 

More from caise2013vlc

Markus keuneke partial data-models
Markus keuneke   partial data-modelsMarkus keuneke   partial data-models
Markus keuneke partial data-modelscaise2013vlc
Ā 
Jelena zdravkovic c ai-se 2013 capability caas
Jelena zdravkovic  c ai-se 2013 capability caasJelena zdravkovic  c ai-se 2013 capability caas
Jelena zdravkovic c ai-se 2013 capability caascaise2013vlc
Ā 
Sagar sen caise2013final
Sagar sen caise2013finalSagar sen caise2013final
Sagar sen caise2013finalcaise2013vlc
Ā 
David aguilera presentation
David aguilera   presentationDavid aguilera   presentation
David aguilera presentationcaise2013vlc
Ā 
Sonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_finalSonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_finalcaise2013vlc
Ā 
Suriadi caise2013 slides
Suriadi caise2013 slidesSuriadi caise2013 slides
Suriadi caise2013 slidescaise2013vlc
Ā 
Fadila caise2013 vf
Fadila caise2013 vfFadila caise2013 vf
Fadila caise2013 vfcaise2013vlc
Ā 
Michael mrissa c aise
Michael mrissa c aiseMichael mrissa c aise
Michael mrissa c aisecaise2013vlc
Ā 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013caise2013vlc
Ā 
Ramezani taghiabadi temporal compliance checking 2
Ramezani taghiabadi   temporal compliance checking 2Ramezani taghiabadi   temporal compliance checking 2
Ramezani taghiabadi temporal compliance checking 2caise2013vlc
Ā 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handoutscaise2013vlc
Ā 
Sonja meyer caise 2013
Sonja meyer caise 2013Sonja meyer caise 2013
Sonja meyer caise 2013caise2013vlc
Ā 
Tony clark caise 13-presentation
Tony clark  caise 13-presentationTony clark  caise 13-presentation
Tony clark caise 13-presentationcaise2013vlc
Ā 
Miguel goulao 2013 c-aise
Miguel goulao 2013 c-aiseMiguel goulao 2013 c-aise
Miguel goulao 2013 c-aisecaise2013vlc
Ā 
Jorge cardoso caise-usdl-tosca-2013-06-18c
Jorge cardoso   caise-usdl-tosca-2013-06-18cJorge cardoso   caise-usdl-tosca-2013-06-18c
Jorge cardoso caise-usdl-tosca-2013-06-18ccaise2013vlc
Ā 
Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_caise2013vlc
Ā 
Peter sawyer caise
Peter sawyer  caisePeter sawyer  caise
Peter sawyer caisecaise2013vlc
Ā 
Scekic caise13-
Scekic caise13-Scekic caise13-
Scekic caise13-caise2013vlc
Ā 
Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3caise2013vlc
Ā 

More from caise2013vlc (20)

Caise panel
Caise panelCaise panel
Caise panel
Ā 
Markus keuneke partial data-models
Markus keuneke   partial data-modelsMarkus keuneke   partial data-models
Markus keuneke partial data-models
Ā 
Jelena zdravkovic c ai-se 2013 capability caas
Jelena zdravkovic  c ai-se 2013 capability caasJelena zdravkovic  c ai-se 2013 capability caas
Jelena zdravkovic c ai-se 2013 capability caas
Ā 
Sagar sen caise2013final
Sagar sen caise2013finalSagar sen caise2013final
Sagar sen caise2013final
Ā 
David aguilera presentation
David aguilera   presentationDavid aguilera   presentation
David aguilera presentation
Ā 
Sonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_finalSonja kabicher fuchs presentation-caise13_final
Sonja kabicher fuchs presentation-caise13_final
Ā 
Suriadi caise2013 slides
Suriadi caise2013 slidesSuriadi caise2013 slides
Suriadi caise2013 slides
Ā 
Fadila caise2013 vf
Fadila caise2013 vfFadila caise2013 vf
Fadila caise2013 vf
Ā 
Michael mrissa c aise
Michael mrissa c aiseMichael mrissa c aise
Michael mrissa c aise
Ā 
Razvan petrusel presentation caise 2013
Razvan petrusel   presentation caise 2013Razvan petrusel   presentation caise 2013
Razvan petrusel presentation caise 2013
Ā 
Ramezani taghiabadi temporal compliance checking 2
Ramezani taghiabadi   temporal compliance checking 2Ramezani taghiabadi   temporal compliance checking 2
Ramezani taghiabadi temporal compliance checking 2
Ā 
Ferreira c ai-se2013-final-handouts
Ferreira   c ai-se2013-final-handoutsFerreira   c ai-se2013-final-handouts
Ferreira c ai-se2013-final-handouts
Ā 
Sonja meyer caise 2013
Sonja meyer caise 2013Sonja meyer caise 2013
Sonja meyer caise 2013
Ā 
Tony clark caise 13-presentation
Tony clark  caise 13-presentationTony clark  caise 13-presentation
Tony clark caise 13-presentation
Ā 
Miguel goulao 2013 c-aise
Miguel goulao 2013 c-aiseMiguel goulao 2013 c-aise
Miguel goulao 2013 c-aise
Ā 
Jorge cardoso caise-usdl-tosca-2013-06-18c
Jorge cardoso   caise-usdl-tosca-2013-06-18cJorge cardoso   caise-usdl-tosca-2013-06-18c
Jorge cardoso caise-usdl-tosca-2013-06-18c
Ā 
Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_Kerrstin klemishc c-aise2013_
Kerrstin klemishc c-aise2013_
Ā 
Peter sawyer caise
Peter sawyer  caisePeter sawyer  caise
Peter sawyer caise
Ā 
Scekic caise13-
Scekic caise13-Scekic caise13-
Scekic caise13-
Ā 
Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3Malinda scalability c_ai_se_2013_v3
Malinda scalability c_ai_se_2013_v3
Ā 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
Ā 
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
UiPathCommunity
Ā 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
Ā 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
Ā 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
Ā 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
Ā 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
Ā 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
Ā 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
Ā 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
Ā 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
Ā 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
Ā 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
Ā 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
Ā 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
Ā 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
Ā 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
Ā 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
Ā 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
Ā 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
Ā 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
Ā 
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder ā€“ active learning and UiPath LLMs for do...
Ā 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Ā 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Ā 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ā 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Ā 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Ā 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
Ā 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
Ā 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Ā 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Ā 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Ā 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
Ā 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
Ā 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
Ā 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Ā 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Ā 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Ā 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Ā 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Ā 

Keynote baezayates

  • 1. 6/28/13 1 Big Data in The Web Ricardo Baeza-Yates Yahoo! Labs Barcelona & Santiago de Chile - 3 - Agenda ā€¢ā€ÆBig Data ā€¢ā€ÆAsking the Right Questions ā€¢ā€ÆWisdom of Crowds in the Web ā€¢ā€ÆThe Long Tail ā€¢ā€ÆIssues and Examples ā€¢ā€ÆConcluding Remarks
  • 2. 6/28/13 2 - 4 - 4 Big Data Ā§ļ‚§ā€Æ Capture, transfer, store, search, share, analyze, and visualize large data in reasonable time Ā§ļ‚§ā€Æ Large volume and growth Ā§ļ‚§ā€Æ Petabytes to exabytes Ā§ļ‚§ā€Æ Growth is estimated in 3 exabytes per day Ā§ļ‚§ā€Æ Structured vs. non-structured data Ā§ļ‚§ā€Æ Diversity Ā§ļ‚§ā€Æ Types, formats, complexity, topics, etc. Ā§ļ‚§ā€Æ Best Public Data Example: The Web Ā§ļ‚§ā€Æ Content: text, multimedia Ā§ļ‚§ā€Æ Structure: graphs Ā§ļ‚§ā€Æ Usage: real time streams - 5 - 5 Big Data Ā§ļ‚§ā€Æ Focus on analytics Ā§ļ‚§ā€Æ Many storage technologies: Ā§ļ‚§ā€Æ DBs, DWs, distributed file systems, ā€¦ Ā§ļ‚§ā€Æ Many processing technologies: Ā§ļ‚§ā€Æ Cloud computing, map-reduce (Hadoop), ā€¦ Ā§ļ‚§ā€Æ Data mining, clustering, classification, ā€¦ Ā§ļ‚§ā€Æ Machine learning, A/B testing, NLP, ā€¦ Ā§ļ‚§ā€Æ Simulation Ā§ļ‚§ā€Æ Several technology providers Ā§ļ‚§ā€Æ Initial best practices (see TDWI report, 2011) Ā§ļ‚§ā€Æ Main challenges: scalability, online
  • 3. 6/28/13 3 - 6 - 6 Big Data: The Five Vā€™s Characteristic Data Issue Computing Issue Volume Scale, Redundancy Scalability Variety Heterogeneity, Complexity Adaptability, Extensibility Veracity Completeness, Bias, Sparsity, Noise, Spam Reliability, Trust Velocity Real time Online Value Usefulness, Privacy Business dependent - 7 - 7 Asking the Right Questions Ā§ļ‚§ā€Æ Problem Driven Ā§ļ‚§ā€Æ What data we need? How much? Ā§ļ‚§ā€Æ How we collect it? How we store and transfer it? Ā§ļ‚§ā€Æ Understanding the Data Ā§ļ‚§ā€Æ How sparse is the data? How much noise? Ā§ļ‚§ā€Æ There is redundancy? There are biases? Ā§ļ‚§ā€Æ There is spam? Any outliers? Ā§ļ‚§ā€Æ Analyzing the Data Ā§ļ‚§ā€Æ Any privacy issues? Do we need to anonymize? Ā§ļ‚§ā€Æ How well our algorithms scale? Ā§ļ‚§ā€Æ Can we visualize the results?
  • 4. 6/28/13 4 - 8 - 8 Too Much Data Available Ā§ļ‚§ā€Æ The Web is a database! Ā§ļ‚§ā€Æ Data does not imply information Ā§ļ‚§ā€Æ Many analyses for the sake of it (data driven) Ā§ļ‚§ā€Æ Analyzing data is not CS per se Ā§ļ‚§ā€Æ Publish in the right forum! Ā§ļ‚§ā€Æ Big Data or Right Data? - 9 - 9 The Different Facets of the Web
  • 5. 6/28/13 5 - 11 - 11 The Structure of the Web - 12 - Big Data in the Web Metadata RDF Wikipedia ODP Flickr Text Anchors + links Y! Answers Logs (Clicks+Queries) Explicit Implicit Wordnet UGC Private Scale Blogs, Groups Quality?
  • 6. 6/28/13 6 - 13 - Quantity Quality User- generated Traditional publishing What is in the Web? How Good it is? - 14 - 14 What else is in the Web?
  • 7. 6/28/13 7 - 15 - 15 Noise and Spam Ā§ļ‚§ā€Æ Noise may come from many places: Ā§ļ‚§ā€Æ Instruments that measure Ā§ļ‚§ā€Æ How we interpret the data (example later) Ā§ļ‚§ā€Æ Spam is everywhere - 16 - 16 Web Spam Deceiving text, links, clicksā€¦ due to an economic incentive Depending on the goal and the data, spam is easier to generate Depending on the type & target data, spam is easier to fight Disincentives for spammers? ā€¢ā€Æ Social ā€¢ā€Æ Economical Web Spam is NOT Mail Spam
  • 8. 6/28/13 8 - 17 - 17 - 18 - Content and Metadata Trends [Ramakrishnan and Tomkins 2007]
  • 9. 6/28/13 9 - 19 - Web Data Trends ā€¢ā€Æ User Generated Content ā€“ā€ÆMassive (quality vs. quantity) ā€“ā€ÆSocial Networks ā€“ā€ÆReal time (people + physical sensors) ā€¢ā€Æ Impact ā€“ā€ÆFragmentation of ownership ā€“ā€ÆFragmentation of access (longer heavy tail) ā€“ā€ÆFragmentation of right to access ā€¢ā€Æ Viability ā€“ā€ÆBusiness model based in advertising - 20 - The Wisdom of Crowds ā€¢ā€Æ James Surowiecki, a New Yorker columnist, published this book in 2004 ā€“ā€Æā€œUnder the right circumstances, groups are remarkably intelligentā€ ā€¢ā€Æ Importance of diversity, independence and decentralization ā€œlarge groups of people are smarter than an elite few, no matter how brilliantā€”they are better at solving problems, fostering innovation, coming to wise decisions, even predicting the futureā€. Aggregating data
  • 10. 6/28/13 10 - 21 - 21 Web Data Mining ā€¢ā€Æ Content: text & multimedia mining ā€¢ā€Æ Structure: link analysis, graph mining ā€¢ā€Æ Usage: log analysis, query mining ā€¢ā€Æ Relate all of the above ā€“ā€ÆWeb characterization ā€“ā€ÆParticular applications - 22 - Flickr: Clustering Pictures 22
  • 11. 6/28/13 11 - 23 - Popularity - 24 - Flickr: Geo-tagged pictures 24 24
  • 12. 6/28/13 12 - 27 - ā€œCrowd Sourcingā€ Web-based ā€œpeer productionā€ has produced a number of successful products and communities ā€¢ā€Æ Wikipedia, Y! Answers, YouTube, Flickr, Digg, ... Can this form of production be harnessed for other ends? ā€¢ā€Æ Existing successes are hard to replicate at will Amazon Mechanical Turk (AMT) ā€¢ā€Æ Like outsourcing, but in a micro-distributed fashion ā€¢ā€Æ Thousands of ā€œturkersā€ working on hundreds of ā€œHITSā€ (tasks) ā€¢ā€Æ Rates are typically few cents per task ā€¢ā€Æ Quality of their work is positively evaluated (e.g. in IR) - 28 - The Wisdom of (Large) Crowds ā€“ā€ÆCrucial for Search Ranking ā€“ā€ÆText: Web Writers & Editors ā€¢ā€Ænot only for the Web! ā€“ā€ÆLinks: Web Publishers ā€“ā€ÆTags: Web Taggers ā€“ā€ÆQueries: All Web Users! ā€¢ā€ÆQueries and actions (or no action!)ā€«ā€ā€¬ The crowd implicitly knows the experts!
  • 13. 6/28/13 13 - 30 - 30 Scalability Ā§ļ‚§ā€Æ How to scale? Ā§ļ‚§ā€Æ Doubling the data in the best case will double the time Ā§ļ‚§ā€Æ Time complexity vs. result quality trade-off Ā§ļ‚§ā€Æ Example: entity detection in linear time at almost state of the art quality Ā§ļ‚§ā€Æ That implies that there exists a text size n* for which the linear algorithm will produce more correct entities Ā§ļ‚§ā€Æ Distributed parallel processing Ā§ļ‚§ā€Æ Map-reduce not always works Ā§ļ‚§ā€Æ Parallelism is problem dependent Ā§ļ‚§ā€Æ Online processing needs a different approach - 31 - 31 Redundancy and Bias Ā§ļ‚§ā€Æ There is any dependency in the data? Ā§ļ‚§ā€Æ There is any duplication? Ā§ļ‚§ā€Æ Lexical duplication in the Web is around 25% Ā§ļ‚§ā€Æ Semantic duplication is larger Ā§ļ‚§ā€Æ Are there any biases? Ā§ļ‚§ā€Æ Example 1: clicks in search engines Ā§ļ‚§ā€Æ Bias to the ranking and the interface Ā§ļ‚§ā€Æ There is a ranking bias in the Web content Ā§ļ‚§ā€Æ Example 2: tag recommendation
  • 14. 6/28/13 14 - 32 - We can suggest tags: nice but .... - 33 - Privacy Example: AOL Query Logs Release Incident No. 4417749 conducted hundreds of searches over a three-month period on topics ranging from ā€œnumb fingersā€ to ā€œ60 single menā€. Other queries: ā€œlandscapers in Lilburn, Ga,ā€ several people with the last name Arnold and ā€œhomes sold in shadow lake subdivision gwinnett county georgia.ā€ Data trail led to Thelma Arnold, a 62-year-old widow who lives in Lilburn, Ga., frequently researches her friendsā€™ medical ailments and loves her three dogs. A Face Is Exposed for AOL Searcher No. 4417749, By MICHAEL BARBARO and TOM ZELLER Jr, The New York Times, Aug 9 2006 33
  • 15. 6/28/13 15 - 34 - Risks of Privacy (ZIP code, date of birth, gender) is enough to identify 87% of US citizens using public DB (Sweeney, 2001) K-anonymity Suppress or generalize attributes until each entry is identical to at least k-1 other entries Federal Trade Commission in US: Privacy policies should ā€œaddress the collection of data itself and not just how the data is usedā€, Dec 2010. Data Protection Directive in EU 34 - 35 - Risks of Privacy: Query Logs Profile: [Jones, Kumar, Pang, Tompkins, CIKM 2007] ā€¢ā€Æ Gender: 84% ā€¢ā€Æ Age (Ā±10): 79% ā€¢ā€Æ Location (ZIP3): 35% Vanity Queries: [Jones et al, CIKM 2008] ā€¢ā€Æ Partial name: 8.9% ā€¢ā€Æ Complete: 1.2% More information: ā€¢ā€Æ A Survey of query log privacy-enhancing techniques from a policy perspective [Cooper, ACM TWEB 2008] A good anonymization is still an open problem
  • 16. 6/28/13 16 - 36 - 36 Sparsity Ā§ļ‚§ā€Æ The Long Tail is always Sparse Ā§ļ‚§ā€Æ Why there is a long tail? Ā§ļ‚§ā€Æ When the crowd dominates Ā§ļ‚§ā€Æ Empowering the tail Ā§ļ‚§ā€Æ Example: Relations from Query Logs - 38 - The Wisdom of Crowds ā€“ā€ÆPopularity ā€“ā€ÆDiversity ā€“ā€ÆQuality ā€“ā€ÆCoverage Long tail Heavy tail
  • 17. 6/28/13 17 - 39 - The Long Tail Most measures in the Web follow a power law - 42 - People Interests 42 Heavy tail of user interests Many queries, each asked very few times, make up a large fraction of all queries Movies watched, blogs read, words used, ā€¦ Normal people Weirdos One explanation
  • 18. 6/28/13 18 - 43 - Many queries, each asked very few times, make up a large fraction of all queries Applies to word usage, web page access, ā€¦ We are all partially eclectic People Interests Broder, Gabrilovich, Goel, Pang; WSDM 2009 The reality Heavy tail of user interests - 44 - Example: Click Distribution User interaction is a power law! (Zipfā€™s principle of minimal effort)
  • 19. 6/28/13 19 - 45 - When the crowd dominates Kills the long tail See (obsolete now) ā€œshwarznegerā€ example 45 - 46 - Empowering the Tail The Filter ā€œBubbleā€, Eli Pariser ā€¢ā€Æ Avoid the Poor get Poorer Syndrome Solutions: ā€¢ā€Æ Diversity ā€¢ā€Æ Novelty ā€¢ā€Æ Serendipity 46 Explore & Exploit
  • 20. 6/28/13 20 - 47 - How to Circumvent Sparsity? Wisdom of ā€œad-hocā€ crowds? Aggregate data in the ā€œright wayā€ When data is sparse Aggregate users around same intent, task, facet, ā€¦. Change granularity ā€œad hocā€ ā€¢ā€Æ Middle age men ā€¢ā€Æ Fans of Messi 47 - 48 - 48 Example: Mining Geo/time Data ā€¢ā€Æ Optimal Touristic Paths from Flickr ā€¢ā€Æ Good for tourists and locals De Choudhury et al, HT 2010
  • 21. 6/28/13 21 - 49 - ā€¢ā€Æ The long tail is important not only for e- commerce, but because we are all there ā€¢ā€Æ Personalization vs. Contextualization User interaction is another long tail People Interests Aggregating in the Long Tail - 69 - 69 Epilogue lļ¬ā€ÆThe Web is scientifically young lļ¬ā€ÆThe Web is intellectually diverse lļ¬ā€ÆThe technology mirrors the economic, legal and sociological reality lļ¬ā€Æ Data must be interesting! (Gerhard Weikum) lļ¬ā€Æ Problem driven lļ¬ā€Æ Plenty of challenges
  • 22. 6/28/13 22 - 70 - 70 Mirror of Society - 71 - 71 Exports/Imports vs. Domain Links Baeza-Yates & Castillo, WWW2006
  • 23. 6/28/13 23 Contact: rbaeza@acm.org Thanks to many people at Yahoo! Labs ASIST 2012 Book of the Year Award Questions?