Using Apache Spark to Tune Spark with Shivnath Babu and Adrian Popescu

Adrian'Popescu,'Shivnath Babu
Using&Spark&to&Tune&Spark
#AI7SAIS

Meet$the$speakers
• Data%engineer%at%Unravel
• PhD%from%EPFL,%Switzerland
• 8+%years%of%experience%in%performance%
monitoring%&%modeling%of%data%management%
systems
• Focusing%on%tuning%and%optimization%of%Big%
Data%apps
2#Exp8SAIS
Adrian$Popescu
• Cofounder%and%CTO%at%Unravel,%Adjunct%
Professor%at%Duke%University
• Focusing%on%easeKofKuse%and%manageability%
of%dataKintensive%systems
• Recipient%of%US%National%Science%
Foundation%CAREER%Award,%three%IBM%
Faculty%Awards,%HP%Labs%Innovation%
Research%Award
Shivnath Babu

Many%apps%are%being%built%in%Spark
3#Exp8SAIS

But,%let%us%face%it:%
Running%Spark%apps%in%
production%is%hard
4#Exp8SAIS

My#app#often#fails#with#Out#of#
Memory…
5
DATA#SCIENTIST
#Exp8SAIS

My#app#is#too#slow…
6
DATA#ENGINEER
#Exp8SAIS

My#app#is#missing#SLA…
7
DATA#PIPELINE#OWNER
#Exp8SAIS

This%rogue%app%is%wasting%resources%
and%reducing%cluster%throughput
8
OPERATIONS%TEAMS
#Exp8SAIS

Many%factors%affect%app%performance
9#Exp8SAIS

To#add#to#Spark’s#complexity
• Many#types#of#Spark#apps###########################
• SQL
• Streaming
• AI/ML
• Graph
• Scala/Python/R
• Many#app#submission#methods#in#Spark####
• CLI
• Thrift=Server
• Notebooks=like=Zeppelin,=Jupyter,=Hue
• ETL=tools=like=Informatica,=Pentaho,=and=Talend
• Schedulers=like=Airflow,=Autosys,=Control=M,=Oozie,=Tidal,=TWS
• Many#infrastructure#choices#for#Spark#########
• OnNpremises=multiNtenant=clusters
• Transient=cloud=clusters
• AutoNscaling=clusters
• Containerized=deployments=
Simple#SQL#and#
Programming#Interface#
10#Exp8SAIS

11#Exp8SAIS
Can-we-convert-this-problem
into-a-data-problem?

First:'Bring'all'monitoring'data'to'a'
single'platform
One$complete$correlated$view.
History'Server'
API
Resource'
Manager'API
Container'
Metrics
Logs
Metadata
Data'
Statistics
SQL'Query'
Plans
Configuration
12#Exp8SAIS

Then:&Apply&intelligent&algorithms&to&
analyze&the&data&automatically
One$complete$correlated$view.
History&Server&
API
Resource&
Manager&API
Container&
Metrics
Logs
Metadata
Data&
Statistics
SQL&Query&
Plans
Configuration
Built4in$intelligence.
13#Exp8SAIS

Why$not$use$Spark$itself?
15#Exp8SAIS
History$Server$
API
Resource$
Manager$API
Container$
Metrics
Logs
Metadata
Data$
Statistics
SQL$Query$
Plans
Configuration
What$
application$&$
cluster$
management$
tasks$can$we$
automate$with$
intelligent$
algorithms$in$
Spark?

Let$us$take$three$(hard)$tasks
16#Exp8SAIS
• Failures$in$Spark
• SLA%management%for%real0time%data%pipelines
• Application%autotuning

Manual&Root&Cause&Analysis&of&Spark&Failures
Typical(Failure(in(Spark
• Many(levels(of(correlated(stack(traces
• Identifying(the(root(cause(is(hard(and(time(consuming
17#Exp8SAIS

• Reduce&troubleshooting&time&from&days&to&seconds
• Improve)productivity)of)data)scientists)and)analysts
Automated&Root&Cause&Analysis&of&Spark&Failures
18#Exp8SAIS

Automatic)Root)Cause)Analysis
Error$
Template$
Extraction
Feature$
vectors
Learning$
Algorithm
for$
Predictive$
Model
Container$
Logs
Predictive$
Model
Root$
causes
19#Exp8SAIS

We#have#created#a#Failure#Taxonomy
Configuration
Errors
Data
Errors
Resource1
Errors
Deployment1
Errors
Root1Node
Category1of1failure
Input1Path1
Not1
Available
Number1
Format1
Exception
SparkSQL
JsonProcessing
Exception
…
Root1cause1labels
20#Exp8SAIS

Two$Ways$to$get$Root-Cause$Labels
• Manual'diagnosis'by'a'domain'expert
• Automatic'injection'of'the'root'cause
21#Exp8SAIS

Unravel’s Large,scale.Lab.Framework.for.
Automatic.Root.Cause.Analysis
Spark.and.multi,tenant.Workloads:
, Variety(of(workloads:(Batch,(ML,(SQL,(Streaming,(etc.
Failures:
= Large(set(of(root(causes(learned(from(customers(&(
partners.(Constantly(updated
= Continuously(inject(these(root(causes(to(train(&(test(
models(for(root=cause(prediction(
Environment:
= Lab(created(on(demand(on(cloud(or(on=premises
= Workloads(are(run(and(failures(are(injected
22#Exp8SAIS

Injecting)Failures
Application
Execution
Application
Monitor
FAILED
Injected6
Failure
Label
Labeled6
Failures
• Invalid6input
• Invalid6memory6
configuration
• OOME:6Java6heap6space
• OOME:6GC6overhead6limit6
• Container6killed6by6YARN
• Runtime6incompatibility
Injected)failure)examples:
• No6space6left6on6device
• Transformations6inside6
other6transformations
• Runtime6error
• Arithmetic6error
• Invalid6configuration6
settings
Input6Feature6
Extraction
23#Exp8SAIS

Extracting*Input*Features*from*Logs
java.lang.OutOfMemoryError: Java heap space
at
scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:114)
at
scala.reflect.ManifestFactory$$anon$9.newArray(Manifest.scala:112)
at …
• Extracting+stack+traces+and+error+messages
• Tokenize+by+class+names+and+words
Tokens*example:
java.lang.OutOfmemoryError Java heap space at
scala.reflect.ManifestFactory$$anon$9.newArray(Manife
st.scala:114)
• Create+a+vocabulary+of+words+from+all+words+collected
24#Exp8SAIS

Input&Feature&extraction
• Bag&of&Words&with&TF8IDF
– Computes+a+vocabulary+of+words
– Uses+TF9IDF+to+reflect+importance+of+words+in+a+document
• Doc2Vec
– Maps+words,+paragraphs,+or+documents+to+multi9dimensional+vectors
– Evaluates+the+placement+of+words+wrt neighboring+words
– Uses+a+39layer+neural+network
25#Exp8SAIS

System'Architecture
Error$
Template$
Extraction
Feature$
vectors
Learning$
Algorithm
for$
Predictive$
Model
Container$
Logs
Predictive$
Model
Root$
causes
Root$cause$
of$the$
failure
New$failure
Feature$
vector
Container$
Logs
Error$
Template$
Extraction
26#Exp8SAIS

Predictive)Models
• Shallow)Learning
– Logistic*Regression
– Random*forests
• Deep)Learning
– Neural*networks
Very)easy)to)
implement)these)
in)Spark
27#Exp8SAIS

Predicting*the*Root*Cause*of*Failures
• Training and%testing*with%injected%failures
• Test%to%train%data%set%ratio%75%*to*25%
• Models:%logistic%regression,%random%forests%
80
85
90
95
100
TF>IDF Doc2Vec
Accuracy*Score*
[%]
Logistic%Regression Random%Forests
Work%with%
deep%learning%
is%in%progress
28#Exp8SAIS
See*our*talk*at*
Strata,*NY*2017*
for*more*details

29#Exp8SAIS
• Failures*in*Spark
• SLA$management$for$real>time$data$pipelines
• Application*autotuning

Many%problems%can%cause%unreliable%
performance%in%streaming%pipelines
0
STREAM%STORE
Kafka HBase Spark
REAL?TIME%PROCESSORIoT Sensors
Database
Other%Data
Dashboard
Result%Store
Other%Output
Untimely%
results
Poor%
partitioning
Inefficient%
Configuration
Resource%
contention
+ + =
30#Exp8SAIS

Forecasting,&,Anomaly,
Detection,are,important,for,
streaming,pipelines
1
31#Exp8SAIS

Predicting*when*SLAs*are*in*danger*of*being*missed
Latency*SLA*is*
3*minutes
Latency*SLA*can*be
missed*by*this*time
Current*time*is*here
32#Exp8SAIS

3
Detecting(anomalies(is(tricky
Is(this(an(unexpected(lag(
worth(alerting/acting(on?
33#Exp8SAIS

What%is%an%anomaly?
1. An%unexpected%change%that%needs%your%attention
2. Smart%alerts/actions:%
1. False%positives%should%be%minimal
2. False%negatives%should%be%minimal
4
34#Exp8SAIS

Time%Series%Analysis%in%Spark
5

Time%Series%Analysis%in%Spark
6
See%our%talk%at%Strata,%San%Jose%2018
for%more%insights

37#Exp8SAIS
• Failures*in*Spark
• SLA*management*for*real6time*data*pipelines
• Application$autotuning

spark.driver.cores 2
spark.executor.cores
…
10
spark.sql.shuffle.partitions 300
spark.sql.autoBroadcastJoinThres
hold
20MB
…
SKEW('orders',@'o_custId') true
spark.catalog.cacheTable(“orders") true
…
PERFORMANCE
#AI7SAIS 38
Today,@tuning@is@often@by@trialNandNerror

Imagine(a(new(world
INPUTS
1. App(=(Spark(Query
2. Goal(=(Speedup
“I#need#to#make#this#app#faster”
39#Exp8SAIS

Imagine(a(new(world
TIME
APP(DURATION
In#blink#of#an#eye,#user#
gets#recommendations#to#
make#the#app#30%(faster
As#user#finishes#
checking#email,#she#
has#a#verified#run#
that#is#60%(faster
User#comes#back#from#
lunch.#A#verified#run#that#
is#90%(faster
90%(
faster!
40#Exp8SAIS

Response'Surface'Methodology'Reinforcement'Learning
41#Exp8SAIS

See#our#talk#on#Unravel#Sessions#at#Spark+AI
Summit#2018#on#this#problem#
Frameworks#like#BigDL,#Deeplearning4j,#and#
Keras are#simplifying#the#use#of#deep#
reinforcement#learning#in#Spark
42#Exp8SAIS

In#Summary
• Spark'Performance'Management'can'be'
addressed'using'Spark'itself
43#Exp8SAIS

In#Summary
• Spark'Performance'Management'can'be'
addressed'using'Spark'itself
44#Exp8SAIS
ALGORITHMS
TASKS
YOU#MUST

Sign%up%for%a%free%trial.%We%value%your%feedback!%
https://unraveldata.com/free4trial
Join%the%Unravel%team!
adrian &%shivnath [@unraveldata.com]
45#Exp8SAIS

Using Apache Spark to Tune Spark with Shivnath Babu and Adrian Popescu

Recommended

Recommended

More Related Content

Similar to Using Apache Spark to Tune Spark with Shivnath Babu and Adrian Popescu

Similar to Using Apache Spark to Tune Spark with Shivnath Babu and Adrian Popescu (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Using Apache Spark to Tune Spark with Shivnath Babu and Adrian Popescu