SlideShare a Scribd company logo
1 of 70
Download to read offline
NICK HALSTEAD, FOUNDER
DATASIFT, @NIK
Big Data
“Myths and Legends”
#BDW13
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
#DATASIFT
Thursday, 25 April 13
#BDW13
BIG DATASOCIAL DATA +
TV MONITORING POLITICAL TRACKING FINANCIAL FEEDS1.5 BILLION ITEMS DAY
1.5 PETABYTES OF STORAGE
5000 CPU HADOOP CLUSTER #DATASIFT
Thursday, 25 April 13
Big Data
“Myths and Legends”
#BD13
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA PERCEPTION
#GOOGLE
I THOUGHT I WOULD ASK GOOGLE....
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA VENDOR “MYTHS”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
1. YOU MUST BUY ALL OF THIS (for one job!)
#BDW13
Thursday, 25 April 13
2. HOW BIG IS “BIG”
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
#BDW13
Thursday, 25 April 13
20 PETABYTES IN EACH SEARCH INDEX REBULD (this was 2 years ago)
900,000 SERVERS
#BDW13
Thursday, 25 April 13
#BDW13
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
Thursday, 25 April 13
#BDW13
3.2 BILLION LIKES AND COMMENTS PER DAY
OVER HALF A PETABYTE … EVERY 24 HOURS
Thursday, 25 April 13
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
#BDW13 #HADRON
Thursday, 25 April 13
150 MILLION SENSORS DELIVERING DATA 40 MILLION TIMES PER SECOND
10’s OF PETABYTES PER YEAR
#BDW13 #HADRON
Thursday, 25 April 13
A TYPICAL COMPANY
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
Thursday, 25 April 13
A TYPICAL COMPANY
100 EMPLOYEES
10,000 CUSTOMERS
1 MILLION TRANSACTIONS RECORDS
5,000 BYTES PER TRANSACTION
25 DATABASES (customers, transactions, etc)
=4 GIGABYTES (for largest database)
=20 GIGABYTES (for ALL company data)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
Thursday, 25 April 13
A TYPICAL HARDDRIVE
2000 GIGABYTES (2TB)
4000 GIGABYTES (4TB)
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
3. YOU NEED *LOTS* OF DATA SCIENTISTS
#DILBERT#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
BANKING
COMMUNICATIONS
GOVERNMENT
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
Thursday, 25 April 13
4. HOW BIG DATA IS USED
#BDW13
WEB LOGS 51%
CLICK STREAM 35%
Thursday, 25 April 13
5. HADOOP GONE BAD
+
SQL
#BDW13 #HADOOPGONEBAD
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
#BDW13
Thursday, 25 April 13
RDBM - RELATIONAL DATABASE
NEEDS TO BE PRE-DEFINED
REQUIRES INDEX TO PERFORM
QUERIES ARE CONSTRAINED
#BDW13
Thursday, 25 April 13
MAP REDUCE
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
#MAPREDUCE#BDW13
Thursday, 25 April 13
MAP REDUCE
PROCESS CLOSE TO THE DATA
PARALLEL EXECUTION
ANY TYPE OF ANALYSIS
HIDES DETAILS OFFAULT TOLERANCE, LOCALITY
AND LOAD BALANCING
#MAPREDUCE#BDW13
Thursday, 25 April 13
BIG DATA SCHEMA #NOSQL
HBASE
COLUMNS FILES
#BDW13
Thursday, 25 April 13
(QUICK ASIDE)
#SIDEBARThursday, 25 April 13
GOOGLE FILE SYSTEM (GFS) GOOGLE MAPREDUCE (GMR).
GOOGLE STARTED ALL THIS....
Thursday, 25 April 13
GOOGLE DREMEL
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
GOOGLE DREMEL
INTERACTIVE ANALYSIS
SCALE UP TO 10,000 SERVERS
COLUMN STORAGE
http://bit.ly/mS8QxX#BDW13
Thursday, 25 April 13
OpenDremel
GOOGLE BIG QUERY
Google
Big Query
#BDW13
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
Thursday, 25 April 13
http://research.google.com/archive/spanner.html
GOOGLE SPANNER
#SPANNER #NEWSQL
RELATIONAL DATABASE
GLOBALLY DISTRIBUTED
USE GPS / TRUETIME
NO OPEN SOURCE EQUIVALENT
Thursday, 25 April 13
Thursday, 25 April 13
BIG DATA IS THE NEW OIL
Thursday, 25 April 13
NICK HALSTEAD, FOUNDER
HTTP://DATASIFT.COM
WE ARE HIRING!!
Thursday, 25 April 13

More Related Content

More from Nick Halstead

DataSift Historics in 5 Steps
DataSift Historics in 5 StepsDataSift Historics in 5 Steps
DataSift Historics in 5 StepsNick Halstead
 
Twitter and Mediasift Partnership
Twitter and Mediasift PartnershipTwitter and Mediasift Partnership
Twitter and Mediasift PartnershipNick Halstead
 
Have I Got The Future Of News For You
Have I Got The Future Of News For YouHave I Got The Future Of News For You
Have I Got The Future Of News For YouNick Halstead
 
A guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonA guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonNick Halstead
 

More from Nick Halstead (6)

DataSift Historics in 5 Steps
DataSift Historics in 5 StepsDataSift Historics in 5 Steps
DataSift Historics in 5 Steps
 
DataSift API
DataSift APIDataSift API
DataSift API
 
Twitter and Mediasift Partnership
Twitter and Mediasift PartnershipTwitter and Mediasift Partnership
Twitter and Mediasift Partnership
 
Have I Got The Future Of News For You
Have I Got The Future Of News For YouHave I Got The Future Of News For You
Have I Got The Future Of News For You
 
A guide to Twitter Tools & Jargon
A guide to Twitter Tools & JargonA guide to Twitter Tools & Jargon
A guide to Twitter Tools & Jargon
 
Building on Twitter
Building on TwitterBuilding on Twitter
Building on Twitter
 

Recently uploaded

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 

Recently uploaded (20)

Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 

Big Data Week - Myths and Legends