SlideShare a Scribd company logo
1 of 25
| 1 
Big Data Anti-Patterns: 
Lessons from the Front Lines 
Douglas Moore 
Principal Data Architect 
Think Big, a Teradata Company
| 2 
About Douglas Moore 
 Think Big – 3 Years 
- Roadmaps 
- Delivery 
• BDW, Search, Streaming 
- Tech Assessments 
2 
 Before Big Data 
- Data Warehousing 
- OLTP 
- Systems Architecture 
- Electricity 
- High End Graphics 
- Supercomputers 
- Numerical Analysis 
Contact me at: 
@douglas_ma
| 3 
Think Big 
3 
 4yr Old “Big Data” Professional Services Firm 
- Roadmaps 
- Engineering 
- Data Science 
- Hands on Training 
Recently acquired by Teradata 
• Maintaining Independence
| 4 
Content Drawn From Vast Amounts of Experience 
4 
… 
50+ Clients 
Leading 
security 
software 
vendor 
Leading 
Discount 
Retailer
| 5 
 I started out with just 3 topics… 
 Then while on the road to Strata, 
 I met 7 big data architects 
- Who had 7 clients 
• Who had 7 projects 
• That demonstrated 7 Anti-Patterns 
Introduction 
5 
Big Data Anti-pattern: 
“Commonly applied but bad solution” 
I95 Wikipedia
| 6 
Three Focus Areas 
• Hardware and Infrastructure 
• Tooling 
• Big Data Warehousing 
6
| 7 
Hardware & Infrastructure 
 Reference Architecture Driven 
- 90’s & 00’s data center patterns 
- Servers MUST NOT FAIL 
- Standard Server Config 
• $35,000/node 
• Dual Power supply 
• RAID 
• SAS 15K RPM 
• SAN 
• VMs for Production 
• Flat Network 
7 
[Image source: HP: The transformation 
to HP Converged Infrastructure] 
Automated provisioning is a good thing!
| 8 
 Locality Locality Locality 
- Bring Computation to Data 
#1 Locality 
8 
 Co-locate data and compute 
 Locally Attached Storage 
 Localize & isolate network traffic 
 Rack Awareness 
VM Cluster Hadoop Cluster
| 9 
#2 Sequential IO 
 Sequential IO >> Random Access 
9 
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html 
 Large block IO 
 Append only writes 
 JBOD 
Image credit: Wikipedia.org
| 10 
 Increase # parallel components 
- Reduce component cost 
 Data block replication 
- Performance 
- Availability 
 Commodity++ (2014) 
- High density data nodes 
- $8-12,000 
- ~12 drives 
- ~12-16 cores 
- Buy more servers for the cost of 
one 
• 4-5x spindles 
• 4-5x cores 
#3 Increase Parallelism 
10 
Hadoop Cluster
| 11 
 Expect Failure1,2  Rack Awareness 
Hadoop Cluster 
 Data Block Replication 
 Task Retry 
 Node Black Listing 
 Monitor Everything 
 Name Node HA 
#4 Failure 
11
| 12 
 Hadoop Ecosystem Tools 
Tooling 
12 
http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG
| 13 
Tooling: Just looking inside the box 
 “If it came in the box then I should use it” 
 Example 
- Oozie for scheduling 
13 
Best Practice: 
• Use your current enterprise scheduler
| 14 
Tooling: NoSQL 
14 
• “Now I have all of my log data 
in NoSQL, let’s do analytics 
over it” 
 Example 
- Streaming data into Mongo DB 
• Running aggregates 
• Running MR jobs
| 15 
Best Practice 
15 
Best Practice: 
• Split the stream 
• Real-time access in NoSQL 
• Batch analytics in Hadoop
| 18 
 Hadoop Streaming 
- Integrate legacy code 
- Integrate analytic tools 
• Data science libs 
Right Framework, Right Need… 
 Hadoop integrates any 
type of application tooling 
- Java 
- Python 
- R 
- C, C++ 
- Fortran 
- Cobol 
- Ruby 
18
| 19 
 Got to love Ruby 
- Very Cool (or it was) 
- Dynamic Language 
- Expressive 
- Compact 
- Fast Iteration 
 Got to Hate Ruby 
- Slow 
- Hard to follow & debug 
- Does not play well with 
threading 
Right Use Case – ETL, Wrong Framework 
19 
“It’s much faster to develop in, 
developer time is valuable, 
just throw a couple more boxes at it” 
Bench tested Ruby ETL framework 
at 5,000 records / second
| 20 
Right Use Case – ETL, Wrong Framework… 
20 
DO THE MATH: 
Storm Java: ~ 1MM+ events / second / server 
Storm Ruby: 5000 * 12 cores = 60,000 events / second / server 
= 16.67 times more servers 
bit.ly/1t0HXJH 
Best Practice: 
• Write new code in fastest execution framework 
• High value legacy code, analytic tools use Hadoop Streaming 
• Innovation is Important: Test and Learn
| 21 
Big Data Warehousing 
 Hadoop Use Cases 
1. ETL Offload 
2. Data Warehousing 
21 
 Hadoop Data Types 
1. Structured 
2. Semi-structured 
3. Multi or Unstructured
| 22 
Don’t over curate: 
 “We are going to 
- Define and parse 1,000 
attributes from the machine log 
files on ETL servers, 
- load just what we need to, 
- this will take 6 months” 
 HCatalog 
 Navigator, Loom,… 
 UDFs, UDTFs 
- JSON, Regex built in 
- Custom Java 
- Hadoop Streaming (e.g. use 
Python, Perl) 
 Hive Partitions 
 Recursive directory reads 
 Bucket Joins 
 Columnar formats 
- ORC 
- Parquet 
First Principles: #5 Schema on Read 
22 
Best Practices: 
• Define what you need to 
• Parse on Demand 
• Structure to optimize 
• Beware the data palace 
fountain & data swamp
| 23 
Right Schema 
23 
3NF - Transactional Source System Schema 
order 
customer 
order line 
product 
contract 
sales_person 
Dimensional Schema 
customer contract 
order 
product 
order 
line 
sales_person 
Data Warehouse 
OLTP 
customer contract order order line product sales_person 
Hadoop 
De-normalized schema
| 24 
Right Workload, Right Tool 
Workload Hadoop NoSQL MPP, Reporting 
DBs, Mainframe 
ETL 
Business Intelligence 
Cross business reporting 
Sub-set analytics 
Full scan analytics 
Decision Support TBs-PBs GB-TBs 
Operational Reports 
Complex security requirements 
Search 
Fast Lookup
| 25 
 Understand strengths & weaknesses of each choice 
- Get help as needed to make your first effort successful 
 Deploy the right tool for the right workload 
 Test and Learn 
Summary 
25 
http://www.keepcalm-o-matic. 
co.uk/p/keep-calm-and-climb-on- 
94/
| 26 
Thank You 
26 
Douglas Moore 
@douglas_ma 
Work with the best on a wide range of cool projects: 
recruiting@thinkbiganalytics.com
Work with the 
Leading Innovator in Big Data 
DATA SCIENTISTS 
DATA ARCHITECTS 
DATA SOLUTIONS 
Think Big Start Smart Scale Fast 
27

More Related Content

What's hot

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Data Con LA
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureGwen (Chen) Shapira
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoopmarkgrover
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detectionhadooparchbook
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applicationshadooparchbook
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3Gwen (Chen) Shapira
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataHakka Labs
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseMichael Stack
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationshadooparchbook
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaSpark Summit
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...DataStax
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an examplehadooparchbook
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopGwen (Chen) Shapira
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoophadooparchbook
 

What's hot (20)

Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014Application architectures with Hadoop – Big Data TechCon 2014
Application architectures with Hadoop – Big Data TechCon 2014
 
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
 
Scaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding FailureScaling ETL with Hadoop - Avoiding Failure
Scaling ETL with Hadoop - Avoiding Failure
 
Architecting Applications with Hadoop
Architecting Applications with HadoopArchitecting Applications with Hadoop
Architecting Applications with Hadoop
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale Time-oriented event search. A new level of scale
Time-oriented event search. A new level of scale
 
Architecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud DetectionArchitecting applications with Hadoop - Fraud Detection
Architecting applications with Hadoop - Fraud Detection
 
Architectural Patterns for Streaming Applications
Architectural Patterns for Streaming ApplicationsArchitectural Patterns for Streaming Applications
Architectural Patterns for Streaming Applications
 
Scaling etl with hadoop shapira 3
Scaling etl with hadoop   shapira 3Scaling etl with hadoop   shapira 3
Scaling etl with hadoop shapira 3
 
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast DataDatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
 
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBaseHBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
HBaseConAsia2018: Track2-5: JanusGraph-Distributed graph database with HBase
 
Top 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applicationsTop 5 mistakes when writing Streaming applications
Top 5 mistakes when writing Streaming applications
 
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-MallaKerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
Kerberizing Spark: Spark Summit East talk by Abel Rincon and Jorge Lopez-Malla
 
Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
Bigdata : Big picture
Bigdata : Big pictureBigdata : Big picture
Bigdata : Big picture
 
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
Building a Pluggable Analytics Stack with Cassandra (Jim Peregord, Element Co...
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
Hadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an exampleHadoop application architectures - using Customer 360 as an example
Hadoop application architectures - using Customer 360 as an example
 
Data Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for HadoopData Wrangling and Oracle Connectors for Hadoop
Data Wrangling and Oracle Connectors for Hadoop
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 

Viewers also liked

Hypercubes In Hbase
Hypercubes In HbaseHypercubes In Hbase
Hypercubes In HbaseGeorge Ang
 
Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with KafkaTim Lossen
 
GMO's & Consumer Food Trends
GMO's & Consumer Food Trends GMO's & Consumer Food Trends
GMO's & Consumer Food Trends Quid Inc.
 
Moringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_engMoringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_engmoringaplus1malaysia
 
The follower thriller
The follower thrillerThe follower thriller
The follower thrillerNR10209
 
Anestesiologia - nervio facial
Anestesiologia - nervio facialAnestesiologia - nervio facial
Anestesiologia - nervio facialKarla Almazán
 
Create the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge
 
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILEFUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILEKristian Puche Osuna
 
Example of andrew goodwin's theory
Example of andrew goodwin's theoryExample of andrew goodwin's theory
Example of andrew goodwin's theory09gooden
 
Question 3 Evaluation
Question 3 EvaluationQuestion 3 Evaluation
Question 3 EvaluationNR10209
 
Mooty's History of Football
Mooty's History of FootballMooty's History of Football
Mooty's History of Footballpmooty
 
Conceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deConceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deVERITO ARCOS BOSQUEZ
 
Web magazine-megilot-2014
Web magazine-megilot-2014Web magazine-megilot-2014
Web magazine-megilot-2014Yehonatan Eshed
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelapIggo Making
 

Viewers also liked (20)

Hypercubes In Hbase
Hypercubes In HbaseHypercubes In Hbase
Hypercubes In Hbase
 
Event-Stream Processing with Kafka
Event-Stream Processing with KafkaEvent-Stream Processing with Kafka
Event-Stream Processing with Kafka
 
GMO's & Consumer Food Trends
GMO's & Consumer Food Trends GMO's & Consumer Food Trends
GMO's & Consumer Food Trends
 
Moringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_engMoringga plus pp_vf12.11_cambodia_eng
Moringga plus pp_vf12.11_cambodia_eng
 
The follower thriller
The follower thrillerThe follower thriller
The follower thriller
 
Anestesiologia - nervio facial
Anestesiologia - nervio facialAnestesiologia - nervio facial
Anestesiologia - nervio facial
 
Create the Bridge Agency Presentation
Create the Bridge Agency PresentationCreate the Bridge Agency Presentation
Create the Bridge Agency Presentation
 
who me
who mewho me
who me
 
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILEFUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
FUNCIONAMIENTO DE LAS ESTRUCTURAS REPETITIVAS FOR Y WHILE
 
Example of andrew goodwin's theory
Example of andrew goodwin's theoryExample of andrew goodwin's theory
Example of andrew goodwin's theory
 
Question 3 Evaluation
Question 3 EvaluationQuestion 3 Evaluation
Question 3 Evaluation
 
Mooty's History of Football
Mooty's History of FootballMooty's History of Football
Mooty's History of Football
 
Cv. guarino english .
Cv. guarino english . Cv. guarino english .
Cv. guarino english .
 
introverts
introvertsintroverts
introverts
 
Conceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias deConceptualización de la actividad de las agencias de
Conceptualización de la actividad de las agencias de
 
Web magazine-megilot-2014
Web magazine-megilot-2014Web magazine-megilot-2014
Web magazine-megilot-2014
 
Evolve video recipes
Evolve video recipesEvolve video recipes
Evolve video recipes
 
A entrevista
A entrevistaA entrevista
A entrevista
 
actividad numero 4
actividad numero 4actividad numero 4
actividad numero 4
 
Parade di negeri gelap
Parade di negeri gelapParade di negeri gelap
Parade di negeri gelap
 

Similar to Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewKonstantin V. Shvachko
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game ChangerCaserta
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Precisely
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersZohar Elkayam
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...Rittman Analytics
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Zohar Elkayam
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware ProvisioningMongoDB
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Christopher Curtin
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist SoftServe
 

Similar to Teradata Partners Conference Oct 2014 Big Data Anti-Patterns (20)

Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer5 Things that Make Hadoop a Game Changer
5 Things that Make Hadoop a Game Changer
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
The Hadoop Ecosystem for Developers
The Hadoop Ecosystem for DevelopersThe Hadoop Ecosystem for Developers
The Hadoop Ecosystem for Developers
 
Intro to Big Data
Intro to Big DataIntro to Big Data
Intro to Big Data
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
New World Hadoop Architectures (& What Problems They Really Solve) for Oracle...
 
Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016Rapid Cluster Computing with Apache Spark 2016
Rapid Cluster Computing with Apache Spark 2016
 
Hardware Provisioning
Hardware ProvisioningHardware Provisioning
Hardware Provisioning
 
Apache drill
Apache drillApache drill
Apache drill
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013Atlanta hadoop users group july 2013
Atlanta hadoop users group july 2013
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

  • 1. | 1 Big Data Anti-Patterns: Lessons from the Front Lines Douglas Moore Principal Data Architect Think Big, a Teradata Company
  • 2. | 2 About Douglas Moore  Think Big – 3 Years - Roadmaps - Delivery • BDW, Search, Streaming - Tech Assessments 2  Before Big Data - Data Warehousing - OLTP - Systems Architecture - Electricity - High End Graphics - Supercomputers - Numerical Analysis Contact me at: @douglas_ma
  • 3. | 3 Think Big 3  4yr Old “Big Data” Professional Services Firm - Roadmaps - Engineering - Data Science - Hands on Training Recently acquired by Teradata • Maintaining Independence
  • 4. | 4 Content Drawn From Vast Amounts of Experience 4 … 50+ Clients Leading security software vendor Leading Discount Retailer
  • 5. | 5  I started out with just 3 topics…  Then while on the road to Strata,  I met 7 big data architects - Who had 7 clients • Who had 7 projects • That demonstrated 7 Anti-Patterns Introduction 5 Big Data Anti-pattern: “Commonly applied but bad solution” I95 Wikipedia
  • 6. | 6 Three Focus Areas • Hardware and Infrastructure • Tooling • Big Data Warehousing 6
  • 7. | 7 Hardware & Infrastructure  Reference Architecture Driven - 90’s & 00’s data center patterns - Servers MUST NOT FAIL - Standard Server Config • $35,000/node • Dual Power supply • RAID • SAS 15K RPM • SAN • VMs for Production • Flat Network 7 [Image source: HP: The transformation to HP Converged Infrastructure] Automated provisioning is a good thing!
  • 8. | 8  Locality Locality Locality - Bring Computation to Data #1 Locality 8  Co-locate data and compute  Locally Attached Storage  Localize & isolate network traffic  Rack Awareness VM Cluster Hadoop Cluster
  • 9. | 9 #2 Sequential IO  Sequential IO >> Random Access 9 http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html  Large block IO  Append only writes  JBOD Image credit: Wikipedia.org
  • 10. | 10  Increase # parallel components - Reduce component cost  Data block replication - Performance - Availability  Commodity++ (2014) - High density data nodes - $8-12,000 - ~12 drives - ~12-16 cores - Buy more servers for the cost of one • 4-5x spindles • 4-5x cores #3 Increase Parallelism 10 Hadoop Cluster
  • 11. | 11  Expect Failure1,2  Rack Awareness Hadoop Cluster  Data Block Replication  Task Retry  Node Black Listing  Monitor Everything  Name Node HA #4 Failure 11
  • 12. | 12  Hadoop Ecosystem Tools Tooling 12 http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG
  • 13. | 13 Tooling: Just looking inside the box  “If it came in the box then I should use it”  Example - Oozie for scheduling 13 Best Practice: • Use your current enterprise scheduler
  • 14. | 14 Tooling: NoSQL 14 • “Now I have all of my log data in NoSQL, let’s do analytics over it”  Example - Streaming data into Mongo DB • Running aggregates • Running MR jobs
  • 15. | 15 Best Practice 15 Best Practice: • Split the stream • Real-time access in NoSQL • Batch analytics in Hadoop
  • 16. | 18  Hadoop Streaming - Integrate legacy code - Integrate analytic tools • Data science libs Right Framework, Right Need…  Hadoop integrates any type of application tooling - Java - Python - R - C, C++ - Fortran - Cobol - Ruby 18
  • 17. | 19  Got to love Ruby - Very Cool (or it was) - Dynamic Language - Expressive - Compact - Fast Iteration  Got to Hate Ruby - Slow - Hard to follow & debug - Does not play well with threading Right Use Case – ETL, Wrong Framework 19 “It’s much faster to develop in, developer time is valuable, just throw a couple more boxes at it” Bench tested Ruby ETL framework at 5,000 records / second
  • 18. | 20 Right Use Case – ETL, Wrong Framework… 20 DO THE MATH: Storm Java: ~ 1MM+ events / second / server Storm Ruby: 5000 * 12 cores = 60,000 events / second / server = 16.67 times more servers bit.ly/1t0HXJH Best Practice: • Write new code in fastest execution framework • High value legacy code, analytic tools use Hadoop Streaming • Innovation is Important: Test and Learn
  • 19. | 21 Big Data Warehousing  Hadoop Use Cases 1. ETL Offload 2. Data Warehousing 21  Hadoop Data Types 1. Structured 2. Semi-structured 3. Multi or Unstructured
  • 20. | 22 Don’t over curate:  “We are going to - Define and parse 1,000 attributes from the machine log files on ETL servers, - load just what we need to, - this will take 6 months”  HCatalog  Navigator, Loom,…  UDFs, UDTFs - JSON, Regex built in - Custom Java - Hadoop Streaming (e.g. use Python, Perl)  Hive Partitions  Recursive directory reads  Bucket Joins  Columnar formats - ORC - Parquet First Principles: #5 Schema on Read 22 Best Practices: • Define what you need to • Parse on Demand • Structure to optimize • Beware the data palace fountain & data swamp
  • 21. | 23 Right Schema 23 3NF - Transactional Source System Schema order customer order line product contract sales_person Dimensional Schema customer contract order product order line sales_person Data Warehouse OLTP customer contract order order line product sales_person Hadoop De-normalized schema
  • 22. | 24 Right Workload, Right Tool Workload Hadoop NoSQL MPP, Reporting DBs, Mainframe ETL Business Intelligence Cross business reporting Sub-set analytics Full scan analytics Decision Support TBs-PBs GB-TBs Operational Reports Complex security requirements Search Fast Lookup
  • 23. | 25  Understand strengths & weaknesses of each choice - Get help as needed to make your first effort successful  Deploy the right tool for the right workload  Test and Learn Summary 25 http://www.keepcalm-o-matic. co.uk/p/keep-calm-and-climb-on- 94/
  • 24. | 26 Thank You 26 Douglas Moore @douglas_ma Work with the best on a wide range of cool projects: recruiting@thinkbiganalytics.com
  • 25. Work with the Leading Innovator in Big Data DATA SCIENTISTS DATA ARCHITECTS DATA SOLUTIONS Think Big Start Smart Scale Fast 27

Editor's Notes

  1. When I’m not rock climbing, I’m doing big data data architecture for Think Big clients. Helping them realize value with data analytics. 3 Years at Think Big Big Data Warehouse Search Streaming Big Data Roadmaps Tech assessments Worked on 5 distributions, including the original Apache
  2. The strengths we bring into this presentation…. This is not even half of it
  3. I wrote the proposal for this spot with just 3 topics in mind, then I began discussing this with my colleagues and the topic generated quite a bit of buzz. It’s amazing how much energy people will put into explaining crazy things they’ve seen. With all of the architects, clients, projects how many anti-patterns did I come to Strata with? 343 if you’re doing the math in your head.
  4. Many of our customers are big successful companies that have been around a long time During the 90’s and the Oughts, they developed reference architectures Based on input from companies like EMC, HP, IBM They developed the mindset: Mindset “SERVERS MUST NOT FAIL” And, … This what you needed for your Oracle OLTP servers to supplant Mainframe DB2 & Tandem. These servers can range up to $35k/node. At one client they were too embarrassed to show me how much they referenced “Propietary Information” I could tell they spent a lot, based on the data node specs: Dual power supplies, RAID, 15,000 RPM SAS, SAN, VMs, flattened network… The best part of this reference architecture is Automated provisioning & configuration management. Unfortunately I don’t see that as often as I would like. Let’s go back to first principles of Hadoop & Big Data…. [turn] Also seeing Hadoop companies migrate back this way to capture dying or dead data. E.g. Cloudera – Isilon partnership - Not the best performance but does turn that archive data from “dead data” into data producing business value
  5. Let’s talk about big data and hadoop first principles: Everything is about locality Best to bring your computation to where your data is. What hadoop does is shown here in the diagram
  6. Doesn’t matter, Hard Disk, SSD, Main Memory, Cache. Sequential IO always faster See that actuator Arm? On a modern drive it takes a good 4ms to move from one track to another. That’s a lifetime in terms of computing. SANS, Virtual drives, multi-tenant VM farms all essentially incur random access reads (and writes). Hadoop strives to move that arm as infrequently as possible. The only thing slower than a disk seek is a round trip to the Netherlands. So what Hadoop does: Does IO in large blocks Append Only Writes Disks on each node are in a ‘ Just a Bunch of Disk’ configuration, and Hadoop is the only workload accessing those drives. It can force sequential access and optimal through put
  7. Increase your parallelism - Which requires brings us to principle #0 – Do not lock, sync, wait, dawdle, dally, hover or loiter Increase # components To keep in budget though, you must spend less per component to get more value per dollar, euro or yen. Hadoop helps out in this area: Data block replications Buy more servers for the cost of one You get more spindles and cores More spindles mean more IO More cores mean more compute Ultimately to get more throughput
  8. With more components you need to expect failure. Handle them in software “That reminds me of the operations team that said, it's fault tolerant, so we never have to fix it. Imagine a 300 node cluster where 60 nodes were down (blacklisted) for over two months because the system was fault tolerant, and therefore the tickets to fix it were low priority. In some environments, low priority tickets never get touched. This was that kind of environment” Best practice: Fix it in the morning There are more first principles, namely no locking….
  9. Now I want to talk a bit about tooling , tools within and around Hadoop
  10. A common anti-pattern is: “If it came in the box… Example Oozie Your enterprise scheduler is well coordinated with the rest of your environment. Others include, “We should use Pig” Me: But all your people are SQL programmers Customer: We should use Pig We’re already demoing Hive. Another reason for Hive: Hive leads in terms of optimizations I like Pig for deeply nested data structures.
  11. A NoSQL anti-pattern develops over the course of time: First streaming data is loaded into NoSQL to provide some near-real time content serving This falls down because of some of the previous first principles Namely locality You end up having your storage in your Mongo cluster, and your compute in the Hadoop cluster, you move the data across the data center to do some aggregate.
  12. Best practice… split the streams
  13. There’s a place for each of these technology, we see them as complementary. It takes time for these tools to mature. For example, Hive date types didn’t mature until Hive 0.13.
  14. So, choose the right tool for the right job
  15. Let’s talk about Hadoop streaming - Not to be confused with stream processing, samza, spark streaming, storm The key purpose of hadoop streaming is to Integrate legacy code Integrate analytic tools, like Python, R…
  16. Got to love Got to hate I often hear this argument
  17. New code , especially high volume ETL code High value legacy code – Hadoop streaming
  18. Top Hadoop Use Cases #2 use case for Hadoop Quickly Combine data from data silos systems #1 data type is structured #2 is semi-structure – Machine logs, web server logs #3 is multi-structured – text, image, voice,…
  19. When I structure, I structure right.
  20. Thanks for your time today. We look forward to helping you drive new value from big data. Questions? Next steps?