Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Big Data Anti-Patterns: Lessons From the Front LIne
1. Big Data Anti-Patterns:
Lessons from the Front Lines
Strata NYC
October 17, 2014
Douglas Moore
2. | 2
About Douglas Moore
Think Big – 3 Years
- Delivery
• BDW, Search, Streaming
- Roadmaps
- Tech Assessments
2
Before Big Data
- Data Warehousing
- OLTP
- Systems Architecture
- Electricity
- High End Graphics
- Supercomputers
- Numerical Analysis
Contact me at:
@douglas_ma
3. | 3
Think Big
3
4yr Old “Big Data” Professional Services Firm
- Roadmaps
- Engineering
- Data Science
- Hands on Training
Recently acquired by Teradata
• Maintaining Independence
4. | 4
Content Drawn From Vast Amounts of Experience
4
…
50+ Clients
Leading
security
software
vendor
Leading
Discount
Retailer
5. | 5
Introduction
I started out with just 3 topics…
Then while on the road to Strata,
I met 7 big data architects
- Who had 7 clients
• Who had 7 projects
• That demonstrated 7 Anti-Patterns
5
Big Data Anti-pattern:
“Commonly applied but bad solution”
I95 Wikipedia
6. | 6
Three Focus Areas
• Hardware and Infrastructure
• Tooling
• Big Data Warehousing
6
7. [Image source: HP: The transformation
to HP Converged Infrastructure]
| 7
Hardware & Infrastructure
Reference Architecture Driven
- 90’s & 00’s data center patterns
- Servers MUST NOT FAIL
- Standard Server Config
• $35,000/node
• Dual Power supply
• RAID
• SAS 15K RPM
• SAN
• VMs for Production
• Flat Network
7
Automated provisioning is a good thing!
8. Co-locate data and compute
Locally Attached Storage
Localize & isolate network traffic
Rack Awareness
| 8
#1 Locality
Locality Locality Locality
- Bring Computation to Data
8
Hadoop Cluster VM Cluster
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
CPU
core
CPU
core
CPU
core
CPU
core
disk core
disk
CPU
disk
disk
disk
CPU
core
disk
disk
disk
disk
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
CPU
core
CPU
core
CPU
core
CPU
core
disk core
disk
CPU
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
...
CPU
coCrePU
CPU
coCrePU
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
CPU
core
disk
disk
CPU
coCrePU
disk
disk
CPU
coCrePU
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
disk
CPU
coCrePU
disk
disk
disk
disk
disk
disk
disk
disk
disk
CPU
coCrePU
coCrePU
coCrPeU
core
core
CPU
coCrePU
coCrePU
coCrePU
core
CPU
coCrePU
coCrePU
coCrPeU
core
core
CPU
coCrePU
coCrePU
coCrePU
core
CPU
coCrePU
coCrePU
coCrPeU
core
core
CPU
coCrePU
coCrePU
coCrePU
core
CPU
coCrePU
coCrePU
coCrPeU
core
core
CPU
coCrePU
coCrePU
coCrePU
core
CPU
coCrePU
coCrePU
coCrPeU
core
core
CPU
coCrePU
coCrePU
coCrePU
core
VS.
9. | 9
#2 Sequential IO
Sequential IO >> Random Access
9
http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html
Large block IO
Append only writes
JBOD
Image credit: Wikipedia.org
10. | Increase # parallel components
- Reduce component cost
Data block replication
- Availability
- Performance
Commodity++ (2014)
- High density data nodes
- $8-12,000
- ~12 drives
- ~12-16 cores
- Buy 4-5 servers for the cost of 1
• 4-5x spindles
• 4-5x cores
#3 Increase parallelism
10
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
CPU
core
CPU
core
CPU
core
CPU
core
disk core
disk
CPU
disk
disk
disk
CPU
core
disk
disk
disk
disk
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
CPU
core
CPU
core
CPU
core
CPU
core
disk core
disk
CPU
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
...
11. | Expect Failure1,2 Rack Awareness
Data Block Replication
Task Retry
Node Black Listing
Monitor Everything
Name Node HA
#4 Failure
11
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
CPU
core
CPU
core
CPU
core
CPU
core
disk core
disk
CPU
disk
disk
disk
CPU
core
disk
disk
disk
disk
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
disk core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
core
CPU
CPU
core
CPU
core
CPU
core
CPU
core
disk core
disk
CPU
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
disk
disk
disk
disk
CPU
core
disk
disk
disk
disk
...
13. | Tooling: Just looking inside the box
“If it came in the box then I should use it”
Example
- Oozie for scheduling
13
Best Practice:
• Use your current enterprise scheduler
14. | Tooling: NoSQL
14
• “Now I have all of my log data
in NoSQL, let’s do analytics
over it”
Example
- Streaming data into Mongo DB
• Running aggregates
• Running MR jobs
15. | Best Practice
15
Best Practice:
• Split the stream
• Real-time access in NoSQL
• Batch analytics in Hadoop
16. | Key Purpose
- Integrate legacy code
- Integrate analytic tools
• Data science libs
Right Framework, Right Need…
Hadoop supports integrating
any type of application tooling
- Hadoop Streaming
• Python
• R
• C, C++
• Fortran
• Cobol
• Ruby
18
17. | Right Use Case – ETL, Wrong Framework
Got to love Ruby
- Very Cool (or it was)
- Dynamic Language
- Expressive
- Compact
- Fast Iteration
Got to Hate Ruby
- Slow
- Hard to follow & debug
- Does not play well with
threading
19
“It’s much faster to develop in,
developer time is valuable,
just throw a couple more boxes at it”
Bench tested at 5,000 records /
second
18. | Right Use Case – ETL, Wrong Framework…
20
DO THE MATH:
Storm Java: ~ 1MM+ events / second / Server
Storm Ruby: 5000 * 12 cores = 60,000 events / second / Server
= 16.67 times more servers
“Test and Learn!”
Best Practice:
• Write new code in fastest execution framework
• High value legacy code, analytic tools use Hadoop Streaming
19. | Big Data Warehousing
#1 ETL Offload
#2 Data Warehousing
21
20. | Right Schema
22
3NF - Transactional Source System Schema
order
customer
order line
product
contract
sales_person
Dimensional Schema
customer contract
order
product
order
line
sales_person
Data Warehouse
Hadoop
OLTP
customer contract order order line product sales_person
De-normalized schema
21. | 23
Right Workload, Right Tool
Workload Hadoop NoSQL MPP, Reporting
DBs, Mainframe
ETL
Business Intelligence
Cross business reporting
Sub-set analytics
Full scan analytics
Decision Support TBs-PBs GB-TBs
Operational Reports
Complex security requirements
Search
Fast Lookup
22. | Summary
Understand strengths & weaknesses of each choice
- Get help if needed
Deploy the right tool for the right workload
Test and Learn
24
23. | Thank You
25
Douglas Moore
@douglas_ma
Work with the best on a wide variety of cool projects:
• recruiting@thinkbiganalytics.com
24. Work with the
Leading Innovator in Big Data
DATA SCIENTISTS
DATA ARCHITECTS
DATA SOLUTIONS
Think Big Start Smart Scale Fast
26
Editor's Notes
3 Years at Think Big
Big Data Warehouse
Search
Streaming
Big Data Roadmaps
Tech assessments
Worked on 5 distributions, including the original Apache
The strengths we bring into this presentation….
This is not even half of it
I wrote the proposal for this spot with just 3 topics in mind,
then
I began discussing this with my colleagues and the topic generated quite a bit of buzz.
It’s amazing how much energy people will put into explaining crazy things they’ve seen.
With all of the architects, clients, projects how many anti-patterns did I come to Strata with?
343 if you’re doing the math in your head.
Many of our customers are big successful companies that have been around a long time
During the 90’s and the Oughts, they developed reference architectures
Based on input from companies like EMC, HP, IBM
They developed the mindset: Mindset “SERVERS MUST NOT FAIL”
And, … This what you needed for your Oracle OLTP servers to supplant Mainframe DB2 & Tandem.
These servers can range up to $35k/node. At one client they were too embarrassed to show me how much they referenced “Propietary Information”
I could tell they spent a lot, based on the data node specs:
Dual power supplies, RAID, 15,000 RPM SAS, SAN, VMs, flattened network…
The best part of this reference architecture is Automated provisioning & configuration management.
Unfortunately I don’t see that as often as I would like.
Let’s go back to first principles of Hadoop & Big Data….
[turn]
Also seeing Hadoop companies migrate back this way to capture dying or dead data.
E.g. Cloudera – Isilon partnership
- Not the best performance but does turn that archive data from “dead data” into data producing business value
Let’s talk about big data and hadoop first principles:
Everything is about locality
Best to bring your computation to where your data is.
What hadoop does is shown here in the diagram
Doesn’t matter, Hard Disk, SSD, Main Memory, Cache. Sequential IO always faster
See that actuator Arm? On a modern drive it takes a good 4ms to move from one track to another.
That’s a lifetime in terms of computing.
SANS, Virtual drives, multi-tenant VM farms all essentially incur random access reads (and writes).
Hadoop strives to move that arm as infrequently as possible.
The only thing slower than a disk seek is a round trip to the Netherlands.
So what Hadoop does:
Does IO in large blocks
Append Only Writes
Disks on each node are in a ‘ Just a Bunch of Disk’ configuration, and Hadoop is the only workload accessing those drives. It can force sequential access and optimal through put
Increase your parallelism
Increate # components
To keep in budget though, spend less per component
Hadoop helps out in this area:
Data block replications
Buy more servers for the cost of one
You Get more spindles and cores
Ultimately to get more throughput
With more components you need to expect failure. Handle them in software
“That reminds me of the operations team that said, it's fault tolerant, so we never have to fix it. Imagine a 300 node cluster where 60 nodes were down (blacklisted) for over two months because the system was fault tolerant, and therefore the tickets to fix it were low priority. In some environments, low priority tickets never get touched. This was that kind of environment”
Best practice: Fix it in the morning
There are more first principles, namely no locking….
Now I want to talk a bit about tooling
, tools within and around Hadoop
A common anti-pattern is: “If it came in the box…
Your enterprise scheduler is well coordinated with the rest of your environment.
Others include, “We should use Pig”
Me: But all your people are SQL programmers
Another reason, SQL leads in terms of optimizations and performance.
I like Pig for deeply nested data structures.
A NoSQL anti-pattern develops over the course of time:
First streaming data is loaded into NoSQL to provide some near-real time content serving
This falls down because of some of the previous first principles
- Namely locality
Best practice… split the streams
There’s a place for each of these technology, we see them as complementary.
It takes time for these tools to mature. For example, Hive date types didn’t mature until Hive 0.13.
So, choose the right tool for the right job
Let’s talk about Hadoop streaming
- Not to be confused with stream processing, samza, spark streaming, storm
The key purpose of hadoop streaming is to
Integrate legacy code
Integrate analytic tools, like Python, R…
Got to love
Got to hate
I often hear this argument
New code , especially high volume ETL code
High value legacy code – Hadoop streaming
Top Hadoop Use Cases
#2 use case for Hadoop
Quickly Combine data from data silos systems
Thanks for your time today. We look forward to helping you drive new value from big data.
Questions?
Next steps?