Douglas Moore discusses common anti-patterns seen when implementing big data solutions based on lessons learned from working with over 50 clients. He covers anti-patterns in hardware and infrastructure like relying on outdated reference architectures, tooling like trying to do analytics directly in NoSQL databases, and big data warehousing like over-curating data during ETL. The key is to understand the strengths and weaknesses of different tools and deploy the right solution for the intended workload.
Teradata Partners Conference Oct 2014 Big Data Anti-Patterns
1. | 1
Big Data Anti-Patterns:
Lessons from the Front Lines
Douglas Moore
Principal Data Architect
Think Big, a Teradata Company
2. | 2
About Douglas Moore
Think Big – 3 Years
- Roadmaps
- Delivery
• BDW, Search, Streaming
- Tech Assessments
2
Before Big Data
- Data Warehousing
- OLTP
- Systems Architecture
- Electricity
- High End Graphics
- Supercomputers
- Numerical Analysis
Contact me at:
@douglas_ma
3. | 3
Think Big
3
4yr Old “Big Data” Professional Services Firm
- Roadmaps
- Engineering
- Data Science
- Hands on Training
Recently acquired by Teradata
• Maintaining Independence
4. | 4
Content Drawn From Vast Amounts of Experience
4
…
50+ Clients
Leading
security
software
vendor
Leading
Discount
Retailer
5. | 5
I started out with just 3 topics…
Then while on the road to Strata,
I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
Introduction
5
Big Data Anti-pattern:
“Commonly applied but bad solution”
I95 Wikipedia
6. | 6
Three Focus Areas
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
6
7. | 7
Hardware & Infrastructure
Reference Architecture Driven
- 90’s & 00’s data center patterns
- Servers MUST NOT FAIL
- Standard Server Config
• $35,000/node
• Dual Power supply
• RAID
• SAS 15K RPM
• SAN
• VMs for Production
• Flat Network
7
[Image source: HP: The transformation
to HP Converged Infrastructure]
Automated provisioning is a good thing!
8. | 8
Locality Locality Locality
- Bring Computation to Data
#1 Locality
8
Co-locate data and compute
Locally Attached Storage
Localize & isolate network traffic
Rack Awareness
VM Cluster Hadoop Cluster
9. | 9
#2 Sequential IO
Sequential IO >> Random Access
9
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Large block IO
Append only writes
JBOD
Image credit: Wikipedia.org
10. | 10
Increase # parallel components
- Reduce component cost
Data block replication
- Performance
- Availability
Commodity++ (2014)
- High density data nodes
- $8-12,000
- ~12 drives
- ~12-16 cores
- Buy more servers for the cost of
one
• 4-5x spindles
• 4-5x cores
#3 Increase Parallelism
10
Hadoop Cluster
11. | 11
Expect Failure1,2 Rack Awareness
Hadoop Cluster
Data Block Replication
Task Retry
Node Black Listing
Monitor Everything
Name Node HA
#4 Failure
11
13. | 13
Tooling: Just looking inside the box
“If it came in the box then I should use it”
Example
- Oozie for scheduling
13
Best Practice:
• Use your current enterprise scheduler
14. | 14
Tooling: NoSQL
14
• “Now I have all of my log data
in NoSQL, let’s do analytics
over it”
Example
- Streaming data into Mongo DB
• Running aggregates
• Running MR jobs
15. | 15
Best Practice
15
Best Practice:
• Split the stream
• Real-time access in NoSQL
• Batch analytics in Hadoop
16. | 18
Hadoop Streaming
- Integrate legacy code
- Integrate analytic tools
• Data science libs
Right Framework, Right Need…
Hadoop integrates any
type of application tooling
- Java
- Python
- R
- C, C++
- Fortran
- Cobol
- Ruby
18
17. | 19
Got to love Ruby
- Very Cool (or it was)
- Dynamic Language
- Expressive
- Compact
- Fast Iteration
Got to Hate Ruby
- Slow
- Hard to follow & debug
- Does not play well with
threading
Right Use Case – ETL, Wrong Framework
19
“It’s much faster to develop in,
developer time is valuable,
just throw a couple more boxes at it”
Bench tested Ruby ETL framework
at 5,000 records / second
18. | 20
Right Use Case – ETL, Wrong Framework…
20
DO THE MATH:
Storm Java: ~ 1MM+ events / second / server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / server
= 16.67 times more servers
bit.ly/1t0HXJH
Best Practice:
• Write new code in fastest execution framework
• High value legacy code, analytic tools use Hadoop Streaming
• Innovation is Important: Test and Learn
19. | 21
Big Data Warehousing
Hadoop Use Cases
1. ETL Offload
2. Data Warehousing
21
Hadoop Data Types
1. Structured
2. Semi-structured
3. Multi or Unstructured
20. | 22
Don’t over curate:
“We are going to
- Define and parse 1,000
attributes from the machine log
files on ETL servers,
- load just what we need to,
- this will take 6 months”
HCatalog
Navigator, Loom,…
UDFs, UDTFs
- JSON, Regex built in
- Custom Java
- Hadoop Streaming (e.g. use
Python, Perl)
Hive Partitions
Recursive directory reads
Bucket Joins
Columnar formats
- ORC
- Parquet
First Principles: #5 Schema on Read
22
Best Practices:
• Define what you need to
• Parse on Demand
• Structure to optimize
• Beware the data palace
fountain & data swamp
21. | 23
Right Schema
23
3NF - Transactional Source System Schema
order
customer
order line
product
contract
sales_person
Dimensional Schema
customer contract
order
product
order
line
sales_person
Data Warehouse
OLTP
customer contract order order line product sales_person
Hadoop
De-normalized schema
22. | 24
Right Workload, Right Tool
Workload Hadoop NoSQL MPP, Reporting
DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup
23. | 25
Understand strengths & weaknesses of each choice
- Get help as needed to make your first effort successful
Deploy the right tool for the right workload
Test and Learn
Summary
25
http://www.keepcalm-o-matic.
co.uk/p/keep-calm-and-climb-on-
94/
24. | 26
Thank You
26
Douglas Moore
@douglas_ma
Work with the best on a wide range of cool projects:
recruiting@thinkbiganalytics.com
25. Work with the
Leading Innovator in Big Data
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
27
Editor's Notes
When I’m not rock climbing, I’m doing big data data architecture for Think Big clients.
Helping them realize value with data analytics.
3 Years at Think Big
Big Data Warehouse
Search
Streaming
Big Data Roadmaps
Tech assessments
Worked on 5 distributions, including the original Apache
The strengths we bring into this presentation….
This is not even half of it
I wrote the proposal for this spot with just 3 topics in mind,
then
I began discussing this with my colleagues and the topic generated quite a bit of buzz.
It’s amazing how much energy people will put into explaining crazy things they’ve seen.
With all of the architects, clients, projects how many anti-patterns did I come to Strata with?
343 if you’re doing the math in your head.
Many of our customers are big successful companies that have been around a long time
During the 90’s and the Oughts, they developed reference architectures
Based on input from companies like EMC, HP, IBM
They developed the mindset: Mindset “SERVERS MUST NOT FAIL”
And, … This what you needed for your Oracle OLTP servers to supplant Mainframe DB2 & Tandem.
These servers can range up to $35k/node. At one client they were too embarrassed to show me how much they referenced “Propietary Information”
I could tell they spent a lot, based on the data node specs:
Dual power supplies, RAID, 15,000 RPM SAS, SAN, VMs, flattened network…
The best part of this reference architecture is Automated provisioning & configuration management.
Unfortunately I don’t see that as often as I would like.
Let’s go back to first principles of Hadoop & Big Data….
[turn]
Also seeing Hadoop companies migrate back this way to capture dying or dead data.
E.g. Cloudera – Isilon partnership
- Not the best performance but does turn that archive data from “dead data” into data producing business value
Let’s talk about big data and hadoop first principles:
Everything is about locality
Best to bring your computation to where your data is.
What hadoop does is shown here in the diagram
Doesn’t matter, Hard Disk, SSD, Main Memory, Cache. Sequential IO always faster
See that actuator Arm? On a modern drive it takes a good 4ms to move from one track to another.
That’s a lifetime in terms of computing.
SANS, Virtual drives, multi-tenant VM farms all essentially incur random access reads (and writes).
Hadoop strives to move that arm as infrequently as possible.
The only thing slower than a disk seek is a round trip to the Netherlands.
So what Hadoop does:
Does IO in large blocks
Append Only Writes
Disks on each node are in a ‘ Just a Bunch of Disk’ configuration, and Hadoop is the only workload accessing those drives. It can force sequential access and optimal through put
Increase your parallelism
- Which requires brings us to principle #0 – Do not lock, sync, wait, dawdle, dally, hover or loiter
Increase # components
To keep in budget though, you must spend less per component to get more value per dollar, euro or yen.
Hadoop helps out in this area:
Data block replications
Buy more servers for the cost of one
You get more spindles and cores
More spindles mean more IO
More cores mean more compute
Ultimately to get more throughput
With more components you need to expect failure. Handle them in software
“That reminds me of the operations team that said, it's fault tolerant, so we never have to fix it. Imagine a 300 node cluster where 60 nodes were down (blacklisted) for over two months because the system was fault tolerant, and therefore the tickets to fix it were low priority. In some environments, low priority tickets never get touched. This was that kind of environment”
Best practice: Fix it in the morning
There are more first principles, namely no locking….
Now I want to talk a bit about tooling
, tools within and around Hadoop
A common anti-pattern is: “If it came in the box…
Example Oozie
Your enterprise scheduler is well coordinated with the rest of your environment.
Others include, “We should use Pig”
Me: But all your people are SQL programmers
Customer: We should use Pig
We’re already demoing Hive.
Another reason for Hive: Hive leads in terms of optimizations
I like Pig for deeply nested data structures.
A NoSQL anti-pattern develops over the course of time:
First streaming data is loaded into NoSQL to provide some near-real time content serving
This falls down because of some of the previous first principles
Namely locality
You end up having your storage in your Mongo cluster, and your compute in the Hadoop cluster, you move the data across the data center to do some aggregate.
Best practice… split the streams
There’s a place for each of these technology, we see them as complementary.
It takes time for these tools to mature. For example, Hive date types didn’t mature until Hive 0.13.
So, choose the right tool for the right job
Let’s talk about Hadoop streaming
- Not to be confused with stream processing, samza, spark streaming, storm
The key purpose of hadoop streaming is to
Integrate legacy code
Integrate analytic tools, like Python, R…
Got to love
Got to hate
I often hear this argument
New code , especially high volume ETL code
High value legacy code – Hadoop streaming
Top Hadoop Use Cases
#2 use case for Hadoop
Quickly Combine data from data silos systems
#1 data type is structured
#2 is semi-structure – Machine logs, web server logs
#3 is multi-structured – text, image, voice,…
When I structure, I structure right.
Thanks for your time today. We look forward to helping you drive new value from big data.
Questions?
Next steps?