Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

| 1
Big Data Anti-Patterns:
Lessons from the Front Lines
Douglas Moore
Principal Data Architect
Think Big, a Teradata Company

| 2
About Douglas Moore
 Think Big – 3 Years
- Roadmaps
- Delivery
• BDW, Search, Streaming
- Tech Assessments
2
 Before Big Data
- Data Warehousing
- OLTP
- Systems Architecture
- Electricity
- High End Graphics
- Supercomputers
- Numerical Analysis
Contact me at:
@douglas_ma

| 3
Think Big
3
 4yr Old “Big Data” Professional Services Firm
- Roadmaps
- Engineering
- Data Science
- Hands on Training
Recently acquired by Teradata
• Maintaining Independence

| 4
Content Drawn From Vast Amounts of Experience
4
…
50+ Clients
Leading
security
software
vendor
Leading
Discount
Retailer

| 5
 I started out with just 3 topics…
 Then while on the road to Strata,
 I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
Introduction
5
Big Data Anti-pattern:
“Commonly applied but bad solution”
I95 Wikipedia

| 6
Three Focus Areas
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
6

| 7
Hardware & Infrastructure
 Reference Architecture Driven
- 90’s & 00’s data center patterns
- Servers MUST NOT FAIL
- Standard Server Config
• $35,000/node
• Dual Power supply
• RAID
• SAS 15K RPM
• SAN
• VMs for Production
• Flat Network
7
[Image source: HP: The transformation
to HP Converged Infrastructure]
Automated provisioning is a good thing!

| 8
 Locality Locality Locality
- Bring Computation to Data
#1 Locality
8
 Co-locate data and compute
 Locally Attached Storage
 Localize & isolate network traffic
 Rack Awareness
VM Cluster Hadoop Cluster

| 9
#2 Sequential IO
 Sequential IO >> Random Access
9
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
 Large block IO
 Append only writes
 JBOD
Image credit: Wikipedia.org

| 10
 Increase # parallel components
- Reduce component cost
 Data block replication
- Performance
- Availability
 Commodity++ (2014)
- High density data nodes
- $8-12,000
- ~12 drives
- ~12-16 cores
- Buy more servers for the cost of
one
• 4-5x spindles
• 4-5x cores
#3 Increase Parallelism
10
Hadoop Cluster

| 11
 Expect Failure1,2  Rack Awareness
Hadoop Cluster
 Data Block Replication
 Task Retry
 Node Black Listing
 Monitor Everything
 Name Node HA
#4 Failure
11

| 12
 Hadoop Ecosystem Tools
Tooling
12
http://en.wikipedia.org/wiki/File:Bicycle_multi-tool.JPG

| 13
Tooling: Just looking inside the box
 “If it came in the box then I should use it”
 Example
- Oozie for scheduling
13
Best Practice:
• Use your current enterprise scheduler

| 14
Tooling: NoSQL
14
• “Now I have all of my log data
in NoSQL, let’s do analytics
over it”
 Example
- Streaming data into Mongo DB
• Running aggregates
• Running MR jobs

| 15
Best Practice
15
Best Practice:
• Split the stream
• Real-time access in NoSQL
• Batch analytics in Hadoop

| 18
 Hadoop Streaming
- Integrate legacy code
- Integrate analytic tools
• Data science libs
Right Framework, Right Need…
 Hadoop integrates any
type of application tooling
- Java
- Python
- R
- C, C++
- Fortran
- Cobol
- Ruby
18

| 19
 Got to love Ruby
- Very Cool (or it was)
- Dynamic Language
- Expressive
- Compact
- Fast Iteration
 Got to Hate Ruby
- Slow
- Hard to follow & debug
- Does not play well with
threading
Right Use Case – ETL, Wrong Framework
19
“It’s much faster to develop in,
developer time is valuable,
just throw a couple more boxes at it”
Bench tested Ruby ETL framework
at 5,000 records / second

| 20
Right Use Case – ETL, Wrong Framework…
20
DO THE MATH:
Storm Java: ~ 1MM+ events / second / server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / server
= 16.67 times more servers
bit.ly/1t0HXJH
Best Practice:
• Write new code in fastest execution framework
• High value legacy code, analytic tools use Hadoop Streaming
• Innovation is Important: Test and Learn

| 21
Big Data Warehousing
 Hadoop Use Cases
1. ETL Offload
2. Data Warehousing
21
 Hadoop Data Types
1. Structured
2. Semi-structured
3. Multi or Unstructured

| 22
Don’t over curate:
 “We are going to
- Define and parse 1,000
attributes from the machine log
files on ETL servers,
- load just what we need to,
- this will take 6 months”
 HCatalog
 Navigator, Loom,…
 UDFs, UDTFs
- JSON, Regex built in
- Custom Java
- Hadoop Streaming (e.g. use
Python, Perl)
 Hive Partitions
 Recursive directory reads
 Bucket Joins
 Columnar formats
- ORC
- Parquet
First Principles: #5 Schema on Read
22
Best Practices:
• Define what you need to
• Parse on Demand
• Structure to optimize
• Beware the data palace
fountain & data swamp

| 23
Right Schema
23
3NF - Transactional Source System Schema
order
customer
order line
product
contract
sales_person
Dimensional Schema
customer contract
order
product
order
line
sales_person
Data Warehouse
OLTP
customer contract order order line product sales_person
Hadoop
De-normalized schema

| 24
Right Workload, Right Tool
Workload Hadoop NoSQL MPP, Reporting
DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup

| 25
 Understand strengths & weaknesses of each choice
- Get help as needed to make your first effort successful
 Deploy the right tool for the right workload
 Test and Learn
Summary
25
http://www.keepcalm-o-matic.
co.uk/p/keep-calm-and-climb-on-
94/

| 26
Thank You
26
Douglas Moore
@douglas_ma
Work with the best on a wide range of cool projects:
recruiting@thinkbiganalytics.com

Work with the
Leading Innovator in Big Data
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
27

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Similar to Teradata Partners Conference Oct 2014 Big Data Anti-Patterns (20)

Recently uploaded

Recently uploaded (20)

Teradata Partners Conference Oct 2014 Big Data Anti-Patterns

Editor's Notes