SlideShare a Scribd company logo
| 1 
Big Data Anti-Patterns: 
Lessons from the Front Lines 
Douglas Moore 
Principal Data Architect 
Think Big, a Teradata Company
| 2 
About Douglas Moore 
 Think Big – 3 Years 
- Roadmaps 
- Delivery 
• BDW, Search, Streaming 
- Tech Assessments 
2 
 Before Big Data 
- Data Warehousing 
- OLTP 
- Systems Architecture 
- Electricity 
- High End Graphics 
- Supercomputers 
- Numerical Analysis 
Contact me at: 
@douglas_ma
| 3 
Think Big 
3 
 4yr Old “Big Data” Professional Services Firm 
- Roadmaps 
- Engineering 
- Data Science 
- Hands on Training 
Recently acquired by Teradata 
• Maintaining Independence
| 4 
Content Drawn From Vast Amounts of Experience 
4 
… 
50+ Clients 
Leading 
security 
software 
vendor 
Leading 
Discount 
Retailer
| 5 
 I started out with just 3 topics… 
 Then while on the road to Strata, 
 I met 7 big data architects 
- Who had 7 clients 
• Who had 7 projects 
• That demonstrated 7 Anti-Patterns 
Introduction 
5 
Big Data Anti-pattern: 
“Commonly applied but bad solution” 
I95 Wikipedia
| 6 
Three Focus Areas 
• Hardware and Infrastructure 
• Tooling 
• Big Data Warehousing 
6
| 7 
Hardware & Infrastructure 
 Reference Architecture Driven 
- 90’s & 00’s data center patterns 
- Servers MUST NOT FAIL 
- Standard Server Config 
• $35,000/node 
• Dual Power supply 
• RAID 
• SAS 15K RPM 
• SAN 
• VMs for Production 
• Flat Network 
7 
[Image source: HP: The transformation 
to HP Converged Infrastructure] 
Automated provisioning is a good thing!
| 8 
 Locality Locality Locality 
- Bring Computation to Data 
#1 Locality 
8 
 Co-locate data and compute 
 Locally Attached Storage 
 Localize & isolate network traffic 
 Rack Awareness 
VM Cluster Hadoop Cluster
| 9 
#2 Sequential IO 
 Sequential IO >> Random Access 
9 
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html 
 Large block IO 
 Append only writes 
 JBOD 
Image credit: Wikipedia.org
| 10 
 Increase # parallel components 
- Reduce component cost 
 Data block replication 
- Performance 
- Availability 
 Commodity++ (2014) 
- High density data nodes 
- $8-12,000 
- ~12 drives 
- ~12-16 cores 
- Buy more servers for the cost of 
one 
• 4-5x spindles 
• 4-5x cores 
#3 Increase Parallelism 
10 
Hadoop Cluster
| 11 
 Expect Failure1,2  Rack Awareness 
Hadoop Cluster 
 Data Block Replication 
 Task Retry 
 Node Black Listing 
 Monitor Everything 
 Name Node HA 
#4 Failure 
11
| 12 
 Hadoop Ecosystem Tools 
Tooling 
12 
http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG
| 13 
Tooling: Just looking inside the box 
 “If it came in the box then I should use it” 
 Example 
- Oozie for scheduling 
13 
Best Practice: 
• Use your current enterprise scheduler
| 14 
Tooling: NoSQL 
14 
• “Now I have all of my log data 
in NoSQL, let’s do analytics 
over it” 
 Example 
- Streaming data into Mongo DB 
• Running aggregates 
• Running MR jobs
| 15 
Best Practice 
15 
Best Practice: 
• Split the stream 
• Real-time access in NoSQL 
• Batch analytics in Hadoop
| 18 
 Hadoop Streaming 
- Integrate legacy code 
- Integrate analytic tools 
• Data science libs 
Right Framework, Right Need… 
 Hadoop integrates any 
type of application tooling 
- Java 
- Python 
- R 
- C, C++ 
- Fortran 
- Cobol 
- Ruby 
18
| 19 
 Got to love Ruby 
- Very Cool (or it was) 
- Dynamic Language 
- Expressive 
- Compact 
- Fast Iteration 
 Got to Hate Ruby 
- Slow 
- Hard to follow & debug 
- Does not play well with 
threading 
Right Use Case – ETL, Wrong Framework 
19 
“It’s much faster to develop in, 
developer time is valuable, 
just throw a couple more boxes at it” 
Bench tested Ruby ETL framework 
at 5,000 records / second
| 20 
Right Use Case – ETL, Wrong Framework… 
20 
DO THE MATH: 
Storm Java: ~ 1MM+ events / second / server 
Storm Ruby: 5000 * 12 cores = 60,000 events / second / server 
= 16.67 times more servers 
bit.ly/1t0HXJH 
Best Practice: 
• Write new code in fastest execution framework 
• High value legacy code, analytic tools use Hadoop Streaming 
• Innovation is Important: Test and Learn
| 21 
Big Data Warehousing 
 Hadoop Use Cases 
1. ETL Offload 
2. Data Warehousing 
21 
 Hadoop Data Types 
1. Structured 
2. Semi-structured 
3. Multi or Unstructured
| 22 
Don’t over curate: 
 “We are going to 
- Define and parse 1,000 
attributes from the machine log 
files on ETL servers, 
- load just what we need to, 
- this will take 6 months” 
 HCatalog 
 Navigator, Loom,… 
 UDFs, UDTFs 
- JSON, Regex built in 
- Custom Java 
- Hadoop Streaming (e.g. use 
Python, Perl) 
 Hive Partitions 
 Recursive directory reads 
 Bucket Joins 
 Columnar formats 
- ORC 
- Parquet 
First Principles: #5 Schema on Read 
22 
Best Practices: 
• Define what you need to 
• Parse on Demand 
• Structure to optimize 
• Beware the data palace 
fountain & data swamp
| 23 
Right Schema 
23 
3NF - Transactional Source System Schema 
order 
customer 
order line 
product 
contract 
sales_person 
Dimensional Schema 
customer contract 
order 
product 
order 
line 
sales_person 
Data Warehouse 
OLTP 
customer contract order order line product sales_person 
Hadoop 
De-normalized schema
| 24 
Right Workload, Right Tool 
Workload Hadoop NoSQL MPP, Reporting 
DBs, Mainframe 
ETL 
Business Intelligence 
Cross business reporting 
Sub-set analytics 
Full scan analytics 
Decision Support TBs-PBs GB-TBs 
Operational Reports 
Complex security requirements 
Search 
Fast Lookup
| 25 
 Understand strengths & weaknesses of each choice 
- Get help as needed to make your first effort successful 
 Deploy the right tool for the right workload 
 Test and Learn 
Summary 
25 
http://www.keepcalm-o-matic. 
co.uk/p/keep-calm-and-climb-on- 
94/
| 26 
Thank You 
26 
Douglas Moore 
@douglas_ma 
Work with the best on a wide range of cool projects: 
recruiting@thinkbiganalytics.com
Work with the 
Leading Innovator in Big Data 
DATA SCIENTISTS 
DATA ARCHITECTS 
DATA SOLUTIONS 
Think Big Start Smart Scale Fast 
27

More Related Content

What's hot

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
hadooparchbook
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
markgrover
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
DataWorks Summit/Hadoop Summit
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
DataWorks Summit/Hadoop Summit
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
hadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
hadooparchbook
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
Michael Stack
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
hadooparchbook
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Spark Summit
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
Slim Baltagi
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
Zekeriya Besiroglu
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
DataStax
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
hadooparchbook
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
hadooparchbook
 

What's hot (20)

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Viewers also liked

Hypercubes In Hbase
Hypercubes In HbaseHypercubes In Hbase
Hypercubes In HbaseGeorge Ang
 
Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with Kafka
Tim Lossen
 
GMO's & Consumer Food Trends
GMO's & Consumer Food Trends GMO's & Consumer Food Trends
GMO's & Consumer Food Trends
Quid Inc.
 
Moringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_engMoringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_eng
moringaplus1malaysia
 
The follower thriller
The follower thrillerThe follower thriller
The follower thriller
NR10209
 
Anestesiologia - nervio facial
Anestesiologia - nervio facialAnestesiologia - nervio facial
Anestesiologia - nervio facial
Karla Almazán
 
Create the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge Agency Presentation
Create the Bridge Agency Presentation
Create the Bridge
 
who me
who mewho me
who me
taken987
 
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILEFUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILEKristian Puche Osuna
 
Example of andrew goodwin's theory
Example of andrew goodwin's theoryExample of andrew goodwin's theory
Example of andrew goodwin's theory
09gooden
 
Question 3 Evaluation
Question 3 EvaluationQuestion 3 Evaluation
Question 3 Evaluation
NR10209
 
Mooty's History of Football
Mooty's History of FootballMooty's History of Football
Mooty's History of Football
pmooty
 
Cv. guarino english .
Cv. guarino english . Cv. guarino english .
Cv. guarino english .
RAFFAELE GUARINO
 
introverts
introvertsintroverts
introverts
ericanondi
 
Conceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deConceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias de
VERITO ARCOS BOSQUEZ
 
Web magazine-megilot-2014
Web magazine-megilot-2014Web magazine-megilot-2014
Web magazine-megilot-2014
Yehonatan Eshed
 
Evolve video recipes
Evolve video recipesEvolve video recipes
Evolve video recipes
Brand Protect Plus
 
A entrevista
A entrevistaA entrevista
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelapIggo Making
 

Viewers also liked (20)

Hypercubes In Hbase
Hypercubes In HbaseHypercubes In Hbase
Hypercubes In Hbase
 
Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with Kafka
 
GMO's & Consumer Food Trends
GMO's & Consumer Food Trends GMO's & Consumer Food Trends
GMO's & Consumer Food Trends
 
Moringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_engMoringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_eng
 
The follower thriller
The follower thrillerThe follower thriller
The follower thriller
 
Anestesiologia - nervio facial
Anestesiologia - nervio facialAnestesiologia - nervio facial
Anestesiologia - nervio facial
 
Create the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge Agency Presentation
Create the Bridge Agency Presentation
 
who me
who mewho me
who me
 
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILEFUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
 
Example of andrew goodwin's theory
Example of andrew goodwin's theoryExample of andrew goodwin's theory
Example of andrew goodwin's theory
 
Question 3 Evaluation
Question 3 EvaluationQuestion 3 Evaluation
Question 3 Evaluation
 
Mooty's History of Football
Mooty's History of FootballMooty's History of Football
Mooty's History of Football
 
Cv. guarino english .
Cv. guarino english . Cv. guarino english .
Cv. guarino english .
 
introverts
introvertsintroverts
introverts
 
Conceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deConceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias de
 
Web magazine-megilot-2014
Web magazine-megilot-2014Web magazine-megilot-2014
Web magazine-megilot-2014
 
Evolve video recipes
Evolve video recipesEvolve video recipes
Evolve video recipes
 
A entrevista
A entrevistaA entrevista
A entrevista
 
actividad numero 4
actividad numero 4actividad numero 4
actividad numero 4
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelap
 

Similar to Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
Konstantin V. Shvachko
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
Sandeep Singh
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
chariorienit
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
Caserta
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
Zohar Elkayam
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
Zohar Elkayam
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
Inside Analysis
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
Rittman Analytics
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
Zohar Elkayam
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
MongoDB
 
Apache drill
Apache drillApache drill
Apache drill
MapR Technologies
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
Adaryl "Bob" Wakefield, MBA
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Codemotion
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
Uri Laserson
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
Roorkee College of Engineering, Roorkee
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
arslanhaneef
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
sonukumar379092
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 

Similar to Teradata Partners Conference Oct 2014 Big Data Anti-Patterns (20)

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Apache drill
Apache drillApache drill
Apache drill
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

Recently uploaded

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 

Recently uploaded (20)

Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

  • 1. | 1 Big Data Anti-Patterns: Lessons from the Front Lines Douglas Moore Principal Data Architect Think Big, a Teradata Company
  • 2. | 2 About Douglas Moore  Think Big – 3 Years - Roadmaps - Delivery • BDW, Search, Streaming - Tech Assessments 2  Before Big Data - Data Warehousing - OLTP - Systems Architecture - Electricity - High End Graphics - Supercomputers - Numerical Analysis Contact me at: @douglas_ma
  • 3. | 3 Think Big 3  4yr Old “Big Data” Professional Services Firm - Roadmaps - Engineering - Data Science - Hands on Training Recently acquired by Teradata • Maintaining Independence
  • 4. | 4 Content Drawn From Vast Amounts of Experience 4 … 50+ Clients Leading security software vendor Leading Discount Retailer
  • 5. | 5  I started out with just 3 topics…  Then while on the road to Strata,  I met 7 big data architects - Who had 7 clients • Who had 7 projects • That demonstrated 7 Anti-Patterns Introduction 5 Big Data Anti-pattern: “Commonly applied but bad solution” I95 Wikipedia
  • 6. | 6 Three Focus Areas • Hardware and Infrastructure • Tooling • Big Data Warehousing 6
  • 7. | 7 Hardware & Infrastructure  Reference Architecture Driven - 90’s & 00’s data center patterns - Servers MUST NOT FAIL - Standard Server Config • $35,000/node • Dual Power supply • RAID • SAS 15K RPM • SAN • VMs for Production • Flat Network 7 [Image source: HP: The transformation to HP Converged Infrastructure] Automated provisioning is a good thing!
  • 8. | 8  Locality Locality Locality - Bring Computation to Data #1 Locality 8  Co-locate data and compute  Locally Attached Storage  Localize & isolate network traffic  Rack Awareness VM Cluster Hadoop Cluster
  • 9. | 9 #2 Sequential IO  Sequential IO >> Random Access 9 http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html  Large block IO  Append only writes  JBOD Image credit: Wikipedia.org
  • 10. | 10  Increase # parallel components - Reduce component cost  Data block replication - Performance - Availability  Commodity++ (2014) - High density data nodes - $8-12,000 - ~12 drives - ~12-16 cores - Buy more servers for the cost of one • 4-5x spindles • 4-5x cores #3 Increase Parallelism 10 Hadoop Cluster
  • 11. | 11  Expect Failure1,2  Rack Awareness Hadoop Cluster  Data Block Replication  Task Retry  Node Black Listing  Monitor Everything  Name Node HA #4 Failure 11
  • 12. | 12  Hadoop Ecosystem Tools Tooling 12 http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG
  • 13. | 13 Tooling: Just looking inside the box  “If it came in the box then I should use it”  Example - Oozie for scheduling 13 Best Practice: • Use your current enterprise scheduler
  • 14. | 14 Tooling: NoSQL 14 • “Now I have all of my log data in NoSQL, let’s do analytics over it”  Example - Streaming data into Mongo DB • Running aggregates • Running MR jobs
  • 15. | 15 Best Practice 15 Best Practice: • Split the stream • Real-time access in NoSQL • Batch analytics in Hadoop
  • 16. | 18  Hadoop Streaming - Integrate legacy code - Integrate analytic tools • Data science libs Right Framework, Right Need…  Hadoop integrates any type of application tooling - Java - Python - R - C, C++ - Fortran - Cobol - Ruby 18
  • 17. | 19  Got to love Ruby - Very Cool (or it was) - Dynamic Language - Expressive - Compact - Fast Iteration  Got to Hate Ruby - Slow - Hard to follow & debug - Does not play well with threading Right Use Case – ETL, Wrong Framework 19 “It’s much faster to develop in, developer time is valuable, just throw a couple more boxes at it” Bench tested Ruby ETL framework at 5,000 records / second
  • 18. | 20 Right Use Case – ETL, Wrong Framework… 20 DO THE MATH: Storm Java: ~ 1MM+ events / second / server Storm Ruby: 5000 * 12 cores = 60,000 events / second / server = 16.67 times more servers bit.ly/1t0HXJH Best Practice: • Write new code in fastest execution framework • High value legacy code, analytic tools use Hadoop Streaming • Innovation is Important: Test and Learn
  • 19. | 21 Big Data Warehousing  Hadoop Use Cases 1. ETL Offload 2. Data Warehousing 21  Hadoop Data Types 1. Structured 2. Semi-structured 3. Multi or Unstructured
  • 20. | 22 Don’t over curate:  “We are going to - Define and parse 1,000 attributes from the machine log files on ETL servers, - load just what we need to, - this will take 6 months”  HCatalog  Navigator, Loom,…  UDFs, UDTFs - JSON, Regex built in - Custom Java - Hadoop Streaming (e.g. use Python, Perl)  Hive Partitions  Recursive directory reads  Bucket Joins  Columnar formats - ORC - Parquet First Principles: #5 Schema on Read 22 Best Practices: • Define what you need to • Parse on Demand • Structure to optimize • Beware the data palace fountain & data swamp
  • 21. | 23 Right Schema 23 3NF - Transactional Source System Schema order customer order line product contract sales_person Dimensional Schema customer contract order product order line sales_person Data Warehouse OLTP customer contract order order line product sales_person Hadoop De-normalized schema
  • 22. | 24 Right Workload, Right Tool Workload Hadoop NoSQL MPP, Reporting DBs, Mainframe ETL Business Intelligence Cross business reporting Sub-set analytics Full scan analytics Decision Support TBs-PBs GB-TBs Operational Reports Complex security requirements Search Fast Lookup
  • 23. | 25  Understand strengths & weaknesses of each choice - Get help as needed to make your first effort successful  Deploy the right tool for the right workload  Test and Learn Summary 25 http://www.keepcalm-o-matic. co.uk/p/keep-calm-and-climb-on- 94/
  • 24. | 26 Thank You 26 Douglas Moore @douglas_ma Work with the best on a wide range of cool projects: recruiting@thinkbiganalytics.com
  • 25. Work with the Leading Innovator in Big Data DATA SCIENTISTS DATA ARCHITECTS DATA SOLUTIONS Think Big Start Smart Scale Fast 27

Editor's Notes

  1. When I’m not rock climbing, I’m doing big data data architecture for Think Big clients. Helping them realize value with data analytics. 3 Years at Think Big Big Data Warehouse Search Streaming Big Data Roadmaps Tech assessments Worked on 5 distributions, including the original Apache
  2. The strengths we bring into this presentation…. This is not even half of it
  3. I wrote the proposal for this spot with just 3 topics in mind, then I began discussing this with my colleagues and the topic generated quite a bit of buzz. It’s amazing how much energy people will put into explaining crazy things they’ve seen. With all of the architects, clients, projects how many anti-patterns did I come to Strata with? 343 if you’re doing the math in your head.
  4. Many of our customers are big successful companies that have been around a long time During the 90’s and the Oughts, they developed reference architectures Based on input from companies like EMC, HP, IBM They developed the mindset: Mindset “SERVERS MUST NOT FAIL” And, … This what you needed for your Oracle OLTP servers to supplant Mainframe DB2 & Tandem. These servers can range up to $35k/node. At one client they were too embarrassed to show me how much they referenced “Propietary Information” I could tell they spent a lot, based on the data node specs: Dual power supplies, RAID, 15,000 RPM SAS, SAN, VMs, flattened network… The best part of this reference architecture is Automated provisioning & configuration management. Unfortunately I don’t see that as often as I would like. Let’s go back to first principles of Hadoop & Big Data…. [turn] Also seeing Hadoop companies migrate back this way to capture dying or dead data. E.g. Cloudera – Isilon partnership - Not the best performance but does turn that archive data from “dead data” into data producing business value
  5. Let’s talk about big data and hadoop first principles: Everything is about locality Best to bring your computation to where your data is. What hadoop does is shown here in the diagram
  6. Doesn’t matter, Hard Disk, SSD, Main Memory, Cache. Sequential IO always faster See that actuator Arm? On a modern drive it takes a good 4ms to move from one track to another. That’s a lifetime in terms of computing. SANS, Virtual drives, multi-tenant VM farms all essentially incur random access reads (and writes). Hadoop strives to move that arm as infrequently as possible. The only thing slower than a disk seek is a round trip to the Netherlands. So what Hadoop does: Does IO in large blocks Append Only Writes Disks on each node are in a ‘ Just a Bunch of Disk’ configuration, and Hadoop is the only workload accessing those drives. It can force sequential access and optimal through put
  7. Increase your parallelism - Which requires brings us to principle #0 – Do not lock, sync, wait, dawdle, dally, hover or loiter Increase # components To keep in budget though, you must spend less per component to get more value per dollar, euro or yen. Hadoop helps out in this area: Data block replications Buy more servers for the cost of one You get more spindles and cores More spindles mean more IO More cores mean more compute Ultimately to get more throughput
  8. With more components you need to expect failure. Handle them in software “That reminds me of the operations team that said, it's fault tolerant, so we never have to fix it. Imagine a 300 node cluster where 60 nodes were down (blacklisted) for over two months because the system was fault tolerant, and therefore the tickets to fix it were low priority. In some environments, low priority tickets never get touched. This was that kind of environment” Best practice: Fix it in the morning There are more first principles, namely no locking….
  9. Now I want to talk a bit about tooling , tools within and around Hadoop
  10. A common anti-pattern is: “If it came in the box… Example Oozie Your enterprise scheduler is well coordinated with the rest of your environment. Others include, “We should use Pig” Me: But all your people are SQL programmers Customer: We should use Pig We’re already demoing Hive. Another reason for Hive: Hive leads in terms of optimizations I like Pig for deeply nested data structures.
  11. A NoSQL anti-pattern develops over the course of time: First streaming data is loaded into NoSQL to provide some near-real time content serving This falls down because of some of the previous first principles Namely locality You end up having your storage in your Mongo cluster, and your compute in the Hadoop cluster, you move the data across the data center to do some aggregate.
  12. Best practice… split the streams
  13. There’s a place for each of these technology, we see them as complementary. It takes time for these tools to mature. For example, Hive date types didn’t mature until Hive 0.13.
  14. So, choose the right tool for the right job
  15. Let’s talk about Hadoop streaming - Not to be confused with stream processing, samza, spark streaming, storm The key purpose of hadoop streaming is to Integrate legacy code Integrate analytic tools, like Python, R…
  16. Got to love Got to hate I often hear this argument
  17. New code , especially high volume ETL code High value legacy code – Hadoop streaming
  18. Top Hadoop Use Cases #2 use case for Hadoop Quickly Combine data from data silos systems #1 data type is structured #2 is semi-structure – Machine logs, web server logs #3 is multi-structured – text, image, voice,…
  19. When I structure, I structure right.
  20. Thanks for your time today. We look forward to helping you drive new value from big data. Questions? Next steps?