February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps

Slow, Stuck, or Runaway Apps?
Learn How to Quickly Fix Problems
While Operationalizing Big Data Apps
Shivnath Babu
CTO @ Unravel Data
shivnath@unraveldata.com

About me
Shivnath Babu
Co-founder/CTO,
Unravel Data Systems
Adjunct Professor,
Duke University
Menlo Park, CA 94025
• R&D on Hadoop, Spark, NoSQL, streaming,
& MPP to simplify ongoing app/system
management
• Led work at Duke on first self-tuning Hadoop
platform: Starfish
• Awards from NSF, IBM, HP
• PhD, Stanford University

Missed SLAs
Poor performance
Failed applications
Underutilized clusters
Low throughput
Unused datasets
Poor data layout
Content
• Challenges in operationalizing big data apps
• How we can improve state-of-the-art
• Real-life examples

Image source: http://bits2byte.blogspot.com/
Alerts
Insights
TYPICAL BIG DATA ENVIRONMENT
BI/SQL
ETL Pipeline
Big Data Apps
ML/Modeling
Pipeline
Data Cubes
Real-time
Stream
Processing
AI / Deep
Learning
Graph
Analytics

Image source: http://bits2byte.blogspot.com/
Alerts
Insights
TYPICAL BIG DATA ENVIRONMENT
Doesn’t Scale
FailedOperationalizing Big Data Apps is
Challenging! Slow
Stuck
Unpredictable /
unreliable
performance
Missed SLA
Rogue /
Contention

Events
Activity
Data
MLlib
Machine
Learning
Analytics
Business
Intelligence
Data
Services
RDBMS
Why is my pipeline slow?
Why is my app failing?
Devs
How do I ensure apps
are meeting SLAs?
How to maximize
resource utilization?
How to plan for growth?
Ops
Big Data Stack
Stack is Complex Expert DataOps
Wanted!

What can go wrong?
• Failures
• My query failed after 6 hours!
• What does this Exception mean?

What can go wrong?
• Failures
• My query failed after 6 hours!
• What does this Exception mean?
• Bad performance
• My app is very slow
• Pipeline is not meeting 4hr SLA
• Unreliable performance
• My app is stuck
• Latency is 3x worse today
• Poor scalability
• Oh, but it worked on the dev cluster!
• Bad App(le)s
• Tom’s query brought the cluster down
• Application Problems
• Poor joins/transformations
• Ineffective caching
• Bloated data structures
• Data/Storage Problems
• Skewed data, load imbalance
• Small files, poor data partitioning
• Configuration Problems
• Suboptimal container sizes
• Scheduler weight/capacity settings
• Resource Problems
• Resource contention
• Service degradation (ex: NameNode)
And why?

Network
Settings
Scheduler
Settings
Machine
Degradation
File Formats
Wrong Results
Data
Layout
Bad Joins
Config
Settings
Code Bugs
So Many Problems, So Little Time
DataOps

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
How do DataOps address this
problem today?

Look at Logs?
Logs in distributed systems are spread out, incomplete,
& usually very difficult to understand

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
There has to be a better way
Full Stack Performance Intelligence

HW HW HW
Hadoop Spark Kafka Cassandra Elasticsearch
MPP
Applications: ETL, BI/SQL, Data Pipelines, Streaming, ML
Cloud
Big Data Stack
Logs,Profiles,Metrics,Events
Cloud
Full Stack Performance Intelligence from 30k ft
ApplyPredictive
Analytics
Intelligence
needed by
DataOps

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Why Full Stack?
• Because problems can happen all over the stack
o Otherwise, we will be blindsided and give wrong insights
• Because it is now possible to:
o Get full-stack telemetry data (high volume, velocity, & variety)
o Reuse distributed systems to process and store this data

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
What is “Intelligence”?
• Not just graphs and time-series charts
• And not simply throwing some AI/ML and seeing what comes out
Intelligence = Automation to Augment DataOps

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Let us Dig Deeper
• We surveyed 250+ DataOps professionals across many verticals to
understand where and how intelligence can benefit them
• Use cases from this survey fall into three categories (aka the Three P’s)
1. I have a Problem that I need to fix
2. I want to be Proactive in detecting and fixing problems
3. I need to Plan for future use

Intelligence = Automation to Augment DataOps
1. Automated Diagnosis
2. Automated Remediation
DataOps Need Intelligence Needed
I have a problem
1. Automated Prediction
2. Automated What-if Analysis
I need to plan
Low throughput
Unused datasets
Poor data layout1. Automated Detection
2. Automated Diagnosis
3. Automated Prevention/Remediation
I want to be proactive

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Real-life Example: Proactive Alerts for SLA
Management

Missed SLAs
Poor performance
Failed applications
Low throughput
Unpredictable Pipeline Performance is Common
Duration
Anomal
y
Duration trending
upwards

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Duration trending
upwards
Duration

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
I/O
I/O is steady

Low throughput
But, This is Just One Type of Contention
• At Resource Manager Level
• App admission time
• Container allocation for Application Master
• Container allocation for tasks
• Container allocation for Executor
• At Application Level
• Workflow Scheduler, e.g., Oozie
• Query Engine, e.g., HiveServer2
• At Master Daemon Level
• NameNode
• Hive MetaStore

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Key Takeaways
Resource contention at different levels affects app performance
• Different apps (Oozie workflows, MapReduce, Spark, Tez) are affected differently
• Manual diagnosis can be hard and time-consuming
It is possible to diagnose and remedy such problems automatically
• By analyzing full-stack telemetry data
• By combining: Automated Baselining, Anomaly Detection, & Correlation Analysis

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
Real-life Problem: Hive/Spark App Failure
• My SQL query failed. Why?
• A MapReduce job failed. Why?
• A Task failed. Why?
• JVM went Out-of-Memory. Why?
• Data skew. Where?
• Reduce-side. Got it!
• How to Fix it?
1. At Resource layer, e.g., larger containers
2. At Configuration layer, e.g., turn on dynamic adaptation to skew
3. At Data layer, e.g., separate skewed keys from others
4. At App layer, e.g., filter skew keys or change algorithm
5. Some combination of the above

Low throughput
Unused datasets
Poor data layout
Real-life Planning: Migrate Apps to Cloud
• How to create perf baselines for on-
prem Vs. cloud comparison?
• What type of instances to get for
same performance on cloud?
• How many permanent Vs. spun-on-
demand instances are needed?
• Which configuration settings will need
tuning for on-prem Vs. cloud?

Low throughput
Unused datasets
Poor data layout
Real-life Planning: Migrate Apps to Cloud
Needs what-if
analysis on
full-stack data

Missed SLAs
Poor performance
Failed applications
Low throughput
Unused datasets
Poor data layout
To Summarize
• Operationalizing big data apps is very challenging for DataOps
• Full Stack Performance Intelligence will augment DataOps to:
1. Deliver quick and high ROI on the Big Data Stack
2. Do more in less time
3. Help them sleep better

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps

Similar to February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps (20)

More from Yahoo Developer Network

More from Yahoo Developer Network (17)

Recently uploaded

Recently uploaded (20)

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Problems while Operationalizing Big Data Apps