SlideShare a Scribd company logo
1 of 24
Big Data Anti-Patterns: 
Lessons from the Front Lines 
Strata NYC 
October 17, 2014 
Douglas Moore
| 2 
About Douglas Moore 
 Think Big – 3 Years 
- Delivery 
• BDW, Search, Streaming 
- Roadmaps 
- Tech Assessments 
2 
 Before Big Data 
- Data Warehousing 
- OLTP 
- Systems Architecture 
- Electricity 
- High End Graphics 
- Supercomputers 
- Numerical Analysis 
Contact me at: 
@douglas_ma
| 3 
Think Big 
3 
 4yr Old “Big Data” Professional Services Firm 
- Roadmaps 
- Engineering 
- Data Science 
- Hands on Training 
Recently acquired by Teradata 
• Maintaining Independence
| 4 
Content Drawn From Vast Amounts of Experience 
4 
… 
50+ Clients 
Leading 
security 
software 
vendor 
Leading 
Discount 
Retailer
| 5 
Introduction 
 I started out with just 3 topics… 
 Then while on the road to Strata, 
 I met 7 big data architects 
- Who had 7 clients 
• Who had 7 projects 
• That demonstrated 7 Anti-Patterns 
5 
Big Data Anti-pattern: 
“Commonly applied but bad solution” 
I95 Wikipedia
| 6 
Three Focus Areas 
• Hardware and Infrastructure 
• Tooling 
• Big Data Warehousing 
6
[Image source: HP: The transformation 
to HP Converged Infrastructure] 
| 7 
Hardware & Infrastructure 
 Reference Architecture Driven 
- 90’s & 00’s data center patterns 
- Servers MUST NOT FAIL 
- Standard Server Config 
• $35,000/node 
• Dual Power supply 
• RAID 
• SAS 15K RPM 
• SAN 
• VMs for Production 
• Flat Network 
7 
Automated provisioning is a good thing!
 Co-locate data and compute 
 Locally Attached Storage 
 Localize & isolate network traffic 
 Rack Awareness 
| 8 
#1 Locality 
 Locality Locality Locality 
- Bring Computation to Data 
8 
Hadoop Cluster VM Cluster 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
disk 
CPU 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
disk 
CPU 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
... 
CPU 
coCrePU 
CPU 
coCrePU 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
CPU 
core 
disk 
disk 
CPU 
coCrePU 
disk 
disk 
CPU 
coCrePU 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
coCrePU 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
coCrePU 
coCrePU 
coCrPeU 
core 
core 
CPU 
coCrePU 
coCrePU 
coCrePU 
core 
CPU 
coCrePU 
coCrePU 
coCrPeU 
core 
core 
CPU 
coCrePU 
coCrePU 
coCrePU 
core 
CPU 
coCrePU 
coCrePU 
coCrPeU 
core 
core 
CPU 
coCrePU 
coCrePU 
coCrePU 
core 
CPU 
coCrePU 
coCrePU 
coCrPeU 
core 
core 
CPU 
coCrePU 
coCrePU 
coCrePU 
core 
CPU 
coCrePU 
coCrePU 
coCrPeU 
core 
core 
CPU 
coCrePU 
coCrePU 
coCrePU 
core 
VS.
| 9 
#2 Sequential IO 
 Sequential IO >> Random Access 
9 
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html 
 Large block IO 
 Append only writes 
 JBOD 
Image credit: Wikipedia.org
|  Increase # parallel components 
- Reduce component cost 
 Data block replication 
- Availability 
- Performance 
 Commodity++ (2014) 
- High density data nodes 
- $8-12,000 
- ~12 drives 
- ~12-16 cores 
- Buy 4-5 servers for the cost of 1 
• 4-5x spindles 
• 4-5x cores 
#3 Increase parallelism 
10 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
disk 
CPU 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
disk 
CPU 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
...
|  Expect Failure1,2  Rack Awareness 
 Data Block Replication 
 Task Retry 
 Node Black Listing 
 Monitor Everything 
 Name Node HA 
#4 Failure 
11 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
disk 
CPU 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
disk core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
CPU 
core 
CPU 
core 
CPU 
core 
CPU 
core 
disk core 
disk 
CPU 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
disk 
CPU 
core 
disk 
disk 
disk 
disk 
...
| Tooling 
 Hadoop Ecosystem Tools 
12
| Tooling: Just looking inside the box 
 “If it came in the box then I should use it” 
 Example 
- Oozie for scheduling 
13 
Best Practice: 
• Use your current enterprise scheduler
| Tooling: NoSQL 
14 
• “Now I have all of my log data 
in NoSQL, let’s do analytics 
over it” 
 Example 
- Streaming data into Mongo DB 
• Running aggregates 
• Running MR jobs
| Best Practice 
15 
Best Practice: 
• Split the stream 
• Real-time access in NoSQL 
• Batch analytics in Hadoop
|  Key Purpose 
- Integrate legacy code 
- Integrate analytic tools 
• Data science libs 
Right Framework, Right Need… 
 Hadoop supports integrating 
any type of application tooling 
- Hadoop Streaming 
• Python 
• R 
• C, C++ 
• Fortran 
• Cobol 
• Ruby 
18
| Right Use Case – ETL, Wrong Framework 
 Got to love Ruby 
- Very Cool (or it was) 
- Dynamic Language 
- Expressive 
- Compact 
- Fast Iteration 
 Got to Hate Ruby 
- Slow 
- Hard to follow & debug 
- Does not play well with 
threading 
19 
“It’s much faster to develop in, 
developer time is valuable, 
just throw a couple more boxes at it” 
Bench tested at 5,000 records / 
second
| Right Use Case – ETL, Wrong Framework… 
20 
DO THE MATH: 
Storm Java: ~ 1MM+ events / second / Server 
Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server 
= 16.67 times more servers 
“Test and Learn!” 
Best Practice: 
• Write new code in fastest execution framework 
• High value legacy code, analytic tools use Hadoop Streaming
| Big Data Warehousing 
 #1 ETL Offload 
 #2 Data Warehousing 
21
| Right Schema 
22 
3NF - Transactional Source System Schema 
order 
customer 
order line 
product 
contract 
sales_person 
Dimensional Schema 
customer contract 
order 
product 
order 
line 
sales_person 
Data Warehouse 
Hadoop 
OLTP 
customer contract order order line product sales_person 
De-normalized schema
| 23 
Right Workload, Right Tool 
Workload Hadoop NoSQL MPP, Reporting 
DBs, Mainframe 
ETL 
Business Intelligence 
Cross business reporting 
Sub-set analytics 
Full scan analytics 
Decision Support TBs-PBs GB-TBs 
Operational Reports 
Complex security requirements 
Search 
Fast Lookup
| Summary 
 Understand strengths & weaknesses of each choice 
- Get help if needed 
 Deploy the right tool for the right workload 
 Test and Learn 
24
| Thank You 
25 
Douglas Moore 
@douglas_ma 
Work with the best on a wide variety of cool projects: 
• recruiting@thinkbiganalytics.com
Work with the 
Leading Innovator in Big Data 
DATA SCIENTISTS 
DATA ARCHITECTS 
DATA SOLUTIONS 
Think Big Start Smart Scale Fast 
26

More Related Content

What's hot

High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/KuduChris George
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache KuduJeff Holoman
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv larsgeorge
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopHadoop User Group
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemInSemble
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Sumeet Singh
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconYiwei Ma
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataRyan Bosshart
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)Todd Lipcon
 

What's hot (20)

High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Introduction to Apache Kudu
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
 
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
Data Pipelines in Hadoop - SAP Meetup in Tel Aviv
 
Building a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with HadoopBuilding a Scalable Web Crawler with Hadoop
Building a Scalable Web Crawler with Hadoop
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Introduction To Hadoop Ecosystem
Introduction To Hadoop EcosystemIntroduction To Hadoop Ecosystem
Introduction To Hadoop Ecosystem
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
Strata Conference + Hadoop World NY 2016: Lessons learned building a scalable...
 
Streaming in the Wild with Apache Flink
Streaming in the Wild with Apache FlinkStreaming in the Wild with Apache Flink
Streaming in the Wild with Apache Flink
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Facebook keynote-nicolas-qcon
Facebook keynote-nicolas-qconFacebook keynote-nicolas-qcon
Facebook keynote-nicolas-qcon
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Kudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast DataKudu - Fast Analytics on Fast Data
Kudu - Fast Analytics on Fast Data
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
A brave new world in mutable big data relational storage (Strata NYC 2017)
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
 

Viewers also liked

Web magazine-megilot-2014
Web magazine-megilot-2014Web magazine-megilot-2014
Web magazine-megilot-2014Yehonatan Eshed
 
A2 Media Evaluation - Question 3
A2 Media Evaluation - Question 3A2 Media Evaluation - Question 3
A2 Media Evaluation - Question 3NR10209
 
Create the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge
 
Question 3 Evaluation
Question 3 EvaluationQuestion 3 Evaluation
Question 3 EvaluationNR10209
 
Shahih bukhari 3001
Shahih bukhari 3001Shahih bukhari 3001
Shahih bukhari 3001arnie18ppu
 
Conceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deConceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deVERITO ARCOS BOSQUEZ
 
Cryhjntdfg
CryhjntdfgCryhjntdfg
Cryhjntdfgtaken987
 
Evaluation Question 3
Evaluation Question 3 Evaluation Question 3
Evaluation Question 3 NR10209
 
Andrew Goodwin - Theory
Andrew Goodwin - Theory Andrew Goodwin - Theory
Andrew Goodwin - Theory 09gooden
 
לוח שנה מועצת מגילות 2016-17
לוח שנה מועצת מגילות 2016-17לוח שנה מועצת מגילות 2016-17
לוח שנה מועצת מגילות 2016-17Yehonatan Eshed
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelapIggo Making
 
Pr8 and 9 backlinks
Pr8 and 9 backlinksPr8 and 9 backlinks
Pr8 and 9 backlinkse-books
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelapIggo Making
 
The follower thriller
The follower thrillerThe follower thriller
The follower thrillerNR10209
 
Example of andrew goodwin's theory
Example of andrew goodwin's theoryExample of andrew goodwin's theory
Example of andrew goodwin's theory09gooden
 

Viewers also liked (20)

Web magazine-megilot-2014
Web magazine-megilot-2014Web magazine-megilot-2014
Web magazine-megilot-2014
 
A2 Media Evaluation - Question 3
A2 Media Evaluation - Question 3A2 Media Evaluation - Question 3
A2 Media Evaluation - Question 3
 
Create the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge Agency Presentation
Create the Bridge Agency Presentation
 
Question 3 Evaluation
Question 3 EvaluationQuestion 3 Evaluation
Question 3 Evaluation
 
Cv. guarino english .
Cv. guarino english . Cv. guarino english .
Cv. guarino english .
 
Shahih bukhari 3001
Shahih bukhari 3001Shahih bukhari 3001
Shahih bukhari 3001
 
Conceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deConceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias de
 
Cryhjntdfg
CryhjntdfgCryhjntdfg
Cryhjntdfg
 
A entrevista
A entrevistaA entrevista
A entrevista
 
who me
who mewho me
who me
 
Evaluation Question 3
Evaluation Question 3 Evaluation Question 3
Evaluation Question 3
 
actividad numero 4
actividad numero 4actividad numero 4
actividad numero 4
 
Andrew Goodwin - Theory
Andrew Goodwin - Theory Andrew Goodwin - Theory
Andrew Goodwin - Theory
 
לוח שנה מועצת מגילות 2016-17
לוח שנה מועצת מגילות 2016-17לוח שנה מועצת מגילות 2016-17
לוח שנה מועצת מגילות 2016-17
 
introverts
introvertsintroverts
introverts
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelap
 
Pr8 and 9 backlinks
Pr8 and 9 backlinksPr8 and 9 backlinks
Pr8 and 9 backlinks
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelap
 
The follower thriller
The follower thrillerThe follower thriller
The follower thriller
 
Example of andrew goodwin's theory
Example of andrew goodwin's theoryExample of andrew goodwin's theory
Example of andrew goodwin's theory
 

Similar to Big Data Anti-Patterns: Lessons From the Front LIne

Pros_and_Cons_of_DW_Apps pdf.pdf
Pros_and_Cons_of_DW_Apps pdf.pdfPros_and_Cons_of_DW_Apps pdf.pdf
Pros_and_Cons_of_DW_Apps pdf.pdfHernanKlint
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarKognitio
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseDatabricks
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity PlanningMongoDB
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ HomeAbhishek Parolkar
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Lecture 24
Lecture 24Lecture 24
Lecture 24Shani729
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureScyllaDB
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisMike Pittaro
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis PyData
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity PlanningNorberto Leite
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGlusterFS
 
Beat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkBeat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkPedro González Serrano
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Community
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCSheetal Dolas
 

Similar to Big Data Anti-Patterns: Lessons From the Front LIne (20)

Pros_and_Cons_of_DW_Apps pdf.pdf
Pros_and_Cons_of_DW_Apps pdf.pdfPros_and_Cons_of_DW_Apps pdf.pdf
Pros_and_Cons_of_DW_Apps pdf.pdf
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-PremiseTackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
Tackling Network Bottlenecks with Hardware Accelerations: Cloud vs. On-Premise
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
Capacity Planning
Capacity PlanningCapacity Planning
Capacity Planning
 
Building SuperComputers @ Home
Building SuperComputers @ HomeBuilding SuperComputers @ Home
Building SuperComputers @ Home
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Lecture 24
Lecture 24Lecture 24
Lecture 24
 
Under The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database ArchitectureUnder The Hood Of A Shard-Per-Core Database Architecture
Under The Hood Of A Shard-Per-Core Database Architecture
 
High Performance Hardware for Data Analysis
High Performance Hardware for Data AnalysisHigh Performance Hardware for Data Analysis
High Performance Hardware for Data Analysis
 
Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis Mike Pittaro - High Performance Hardware for Data Analysis
Mike Pittaro - High Performance Hardware for Data Analysis
 
MongoDB Capacity Planning
MongoDB Capacity PlanningMongoDB Capacity Planning
MongoDB Capacity Planning
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Empower Data-Driven Organizations
Empower Data-Driven OrganizationsEmpower Data-Driven Organizations
Empower Data-Driven Organizations
 
Gluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & TricksGluster for Geeks: Performance Tuning Tips & Tricks
Gluster for Geeks: Performance Tuning Tips & Tricks
 
Beat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmarkBeat the devil: towards a Drupal performance benchmark
Beat the devil: towards a Drupal performance benchmark
 
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
Ceph Day London 2014 - Best Practices for Ceph-powered Implementations of Sto...
 
Open Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOCOpen Security Operations Center - OpenSOC
Open Security Operations Center - OpenSOC
 

Recently uploaded

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Principled Technologies
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 

Recently uploaded (20)

Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 

Big Data Anti-Patterns: Lessons From the Front LIne

  • 1. Big Data Anti-Patterns: Lessons from the Front Lines Strata NYC October 17, 2014 Douglas Moore
  • 2. | 2 About Douglas Moore  Think Big – 3 Years - Delivery • BDW, Search, Streaming - Roadmaps - Tech Assessments 2  Before Big Data - Data Warehousing - OLTP - Systems Architecture - Electricity - High End Graphics - Supercomputers - Numerical Analysis Contact me at: @douglas_ma
  • 3. | 3 Think Big 3  4yr Old “Big Data” Professional Services Firm - Roadmaps - Engineering - Data Science - Hands on Training Recently acquired by Teradata • Maintaining Independence
  • 4. | 4 Content Drawn From Vast Amounts of Experience 4 … 50+ Clients Leading security software vendor Leading Discount Retailer
  • 5. | 5 Introduction  I started out with just 3 topics…  Then while on the road to Strata,  I met 7 big data architects - Who had 7 clients • Who had 7 projects • That demonstrated 7 Anti-Patterns 5 Big Data Anti-pattern: “Commonly applied but bad solution” I95 Wikipedia
  • 6. | 6 Three Focus Areas • Hardware and Infrastructure • Tooling • Big Data Warehousing 6
  • 7. [Image source: HP: The transformation to HP Converged Infrastructure] | 7 Hardware & Infrastructure  Reference Architecture Driven - 90’s & 00’s data center patterns - Servers MUST NOT FAIL - Standard Server Config • $35,000/node • Dual Power supply • RAID • SAS 15K RPM • SAN • VMs for Production • Flat Network 7 Automated provisioning is a good thing!
  • 8.  Co-locate data and compute  Locally Attached Storage  Localize & isolate network traffic  Rack Awareness | 8 #1 Locality  Locality Locality Locality - Bring Computation to Data 8 Hadoop Cluster VM Cluster CPU core CPU core CPU core CPU core CPU core disk core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU disk core CPU core CPU core CPU core CPU core CPU core CPU CPU core CPU core CPU core CPU core disk core disk CPU disk disk disk CPU core disk disk disk disk CPU core CPU core CPU core CPU core CPU core disk core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU disk core CPU core CPU core CPU core CPU core CPU core CPU CPU core CPU core CPU core CPU core disk core disk CPU disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk ... CPU coCrePU CPU coCrePU disk CPU core disk CPU core disk CPU core disk CPU core disk CPU core disk CPU core disk CPU core disk CPU core disk CPU core disk disk CPU coCrePU disk disk CPU coCrePU disk disk disk disk disk disk disk disk disk disk disk CPU coCrePU disk disk disk disk disk disk disk disk disk CPU coCrePU coCrePU coCrPeU core core CPU coCrePU coCrePU coCrePU core CPU coCrePU coCrePU coCrPeU core core CPU coCrePU coCrePU coCrePU core CPU coCrePU coCrePU coCrPeU core core CPU coCrePU coCrePU coCrePU core CPU coCrePU coCrePU coCrPeU core core CPU coCrePU coCrePU coCrePU core CPU coCrePU coCrePU coCrPeU core core CPU coCrePU coCrePU coCrePU core VS.
  • 9. | 9 #2 Sequential IO  Sequential IO >> Random Access 9 http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html  Large block IO  Append only writes  JBOD Image credit: Wikipedia.org
  • 10. |  Increase # parallel components - Reduce component cost  Data block replication - Availability - Performance  Commodity++ (2014) - High density data nodes - $8-12,000 - ~12 drives - ~12-16 cores - Buy 4-5 servers for the cost of 1 • 4-5x spindles • 4-5x cores #3 Increase parallelism 10 CPU core CPU core CPU core CPU core CPU core disk core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU disk core CPU core CPU core CPU core CPU core CPU core CPU CPU core CPU core CPU core CPU core disk core disk CPU disk disk disk CPU core disk disk disk disk CPU core CPU core CPU core CPU core CPU core disk core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU disk core CPU core CPU core CPU core CPU core CPU core CPU CPU core CPU core CPU core CPU core disk core disk CPU disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk ...
  • 11. |  Expect Failure1,2  Rack Awareness  Data Block Replication  Task Retry  Node Black Listing  Monitor Everything  Name Node HA #4 Failure 11 CPU core CPU core CPU core CPU core CPU core disk core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU disk core CPU core CPU core CPU core CPU core CPU core CPU CPU core CPU core CPU core CPU core disk core disk CPU disk disk disk CPU core disk disk disk disk CPU core CPU core CPU core CPU core CPU core disk core CPU core CPU core CPU core CPU core CPU core CPU core CPU core CPU disk core CPU core CPU core CPU core CPU core CPU core CPU CPU core CPU core CPU core CPU core disk core disk CPU disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk disk disk disk disk CPU core disk disk disk disk ...
  • 12. | Tooling  Hadoop Ecosystem Tools 12
  • 13. | Tooling: Just looking inside the box  “If it came in the box then I should use it”  Example - Oozie for scheduling 13 Best Practice: • Use your current enterprise scheduler
  • 14. | Tooling: NoSQL 14 • “Now I have all of my log data in NoSQL, let’s do analytics over it”  Example - Streaming data into Mongo DB • Running aggregates • Running MR jobs
  • 15. | Best Practice 15 Best Practice: • Split the stream • Real-time access in NoSQL • Batch analytics in Hadoop
  • 16. |  Key Purpose - Integrate legacy code - Integrate analytic tools • Data science libs Right Framework, Right Need…  Hadoop supports integrating any type of application tooling - Hadoop Streaming • Python • R • C, C++ • Fortran • Cobol • Ruby 18
  • 17. | Right Use Case – ETL, Wrong Framework  Got to love Ruby - Very Cool (or it was) - Dynamic Language - Expressive - Compact - Fast Iteration  Got to Hate Ruby - Slow - Hard to follow & debug - Does not play well with threading 19 “It’s much faster to develop in, developer time is valuable, just throw a couple more boxes at it” Bench tested at 5,000 records / second
  • 18. | Right Use Case – ETL, Wrong Framework… 20 DO THE MATH: Storm Java: ~ 1MM+ events / second / Server Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server = 16.67 times more servers “Test and Learn!” Best Practice: • Write new code in fastest execution framework • High value legacy code, analytic tools use Hadoop Streaming
  • 19. | Big Data Warehousing  #1 ETL Offload  #2 Data Warehousing 21
  • 20. | Right Schema 22 3NF - Transactional Source System Schema order customer order line product contract sales_person Dimensional Schema customer contract order product order line sales_person Data Warehouse Hadoop OLTP customer contract order order line product sales_person De-normalized schema
  • 21. | 23 Right Workload, Right Tool Workload Hadoop NoSQL MPP, Reporting DBs, Mainframe ETL Business Intelligence Cross business reporting Sub-set analytics Full scan analytics Decision Support TBs-PBs GB-TBs Operational Reports Complex security requirements Search Fast Lookup
  • 22. | Summary  Understand strengths & weaknesses of each choice - Get help if needed  Deploy the right tool for the right workload  Test and Learn 24
  • 23. | Thank You 25 Douglas Moore @douglas_ma Work with the best on a wide variety of cool projects: • recruiting@thinkbiganalytics.com
  • 24. Work with the Leading Innovator in Big Data DATA SCIENTISTS DATA ARCHITECTS DATA SOLUTIONS Think Big Start Smart Scale Fast 26

Editor's Notes

  1. 3 Years at Think Big Big Data Warehouse Search Streaming Big Data Roadmaps Tech assessments Worked on 5 distributions, including the original Apache
  2. The strengths we bring into this presentation…. This is not even half of it
  3. I wrote the proposal for this spot with just 3 topics in mind, then I began discussing this with my colleagues and the topic generated quite a bit of buzz. It’s amazing how much energy people will put into explaining crazy things they’ve seen. With all of the architects, clients, projects how many anti-patterns did I come to Strata with? 343 if you’re doing the math in your head.
  4. Many of our customers are big successful companies that have been around a long time During the 90’s and the Oughts, they developed reference architectures Based on input from companies like EMC, HP, IBM They developed the mindset: Mindset “SERVERS MUST NOT FAIL” And, … This what you needed for your Oracle OLTP servers to supplant Mainframe DB2 & Tandem. These servers can range up to $35k/node. At one client they were too embarrassed to show me how much they referenced “Propietary Information” I could tell they spent a lot, based on the data node specs: Dual power supplies, RAID, 15,000 RPM SAS, SAN, VMs, flattened network… The best part of this reference architecture is Automated provisioning & configuration management. Unfortunately I don’t see that as often as I would like. Let’s go back to first principles of Hadoop & Big Data…. [turn] Also seeing Hadoop companies migrate back this way to capture dying or dead data. E.g. Cloudera – Isilon partnership - Not the best performance but does turn that archive data from “dead data” into data producing business value
  5. Let’s talk about big data and hadoop first principles: Everything is about locality Best to bring your computation to where your data is. What hadoop does is shown here in the diagram
  6. Doesn’t matter, Hard Disk, SSD, Main Memory, Cache. Sequential IO always faster See that actuator Arm? On a modern drive it takes a good 4ms to move from one track to another. That’s a lifetime in terms of computing. SANS, Virtual drives, multi-tenant VM farms all essentially incur random access reads (and writes). Hadoop strives to move that arm as infrequently as possible. The only thing slower than a disk seek is a round trip to the Netherlands. So what Hadoop does: Does IO in large blocks Append Only Writes Disks on each node are in a ‘ Just a Bunch of Disk’ configuration, and Hadoop is the only workload accessing those drives. It can force sequential access and optimal through put
  7. Increase your parallelism Increate # components To keep in budget though, spend less per component Hadoop helps out in this area: Data block replications Buy more servers for the cost of one You Get more spindles and cores Ultimately to get more throughput
  8. With more components you need to expect failure. Handle them in software “That reminds me of the operations team that said, it's fault tolerant, so we never have to fix it. Imagine a 300 node cluster where 60 nodes were down (blacklisted) for over two months because the system was fault tolerant, and therefore the tickets to fix it were low priority. In some environments, low priority tickets never get touched. This was that kind of environment” Best practice: Fix it in the morning There are more first principles, namely no locking….
  9. Now I want to talk a bit about tooling , tools within and around Hadoop
  10. A common anti-pattern is: “If it came in the box… Your enterprise scheduler is well coordinated with the rest of your environment. Others include, “We should use Pig” Me: But all your people are SQL programmers Another reason, SQL leads in terms of optimizations and performance. I like Pig for deeply nested data structures.
  11. A NoSQL anti-pattern develops over the course of time: First streaming data is loaded into NoSQL to provide some near-real time content serving This falls down because of some of the previous first principles - Namely locality
  12. Best practice… split the streams
  13. There’s a place for each of these technology, we see them as complementary. It takes time for these tools to mature. For example, Hive date types didn’t mature until Hive 0.13.
  14. So, choose the right tool for the right job
  15. Let’s talk about Hadoop streaming - Not to be confused with stream processing, samza, spark streaming, storm The key purpose of hadoop streaming is to Integrate legacy code Integrate analytic tools, like Python, R…
  16. Got to love Got to hate I often hear this argument
  17. New code , especially high volume ETL code High value legacy code – Hadoop streaming
  18. Top Hadoop Use Cases #2 use case for Hadoop Quickly Combine data from data silos systems
  19. Thanks for your time today. We look forward to helping you drive new value from big data. Questions? Next steps?