Running Spark in Production

Running Spark in
Production
Director, Product Management Member, Technical Staff
April 13, 2016
Twitter: @neomythos
Vinay Shukla Saisai (Jerry) Shao

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who Are We?
 Product Management
 Spark for 2+ years, Hadoop for 3+ years
 Recovering Programmer
 Blog at www.vinayshukla.com
 Twitter: @neomythos
 Addicted to Yoga, Hiking, & Coffee
 Member of Technical Staff
 Apache Spark Contributor
Vinay Shukla Saisai (Jerry) Shao

Spark Security : Secure Production Deployments

Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client
YARN-Cluster

Spark on YARN
Spark Submit
User
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager

Spark – Security – Four Pillars
 Authentication
 Authorization
 Audit
 Encryption
Spark leverages Kerberos on YARN
This presumes network is secured,
the first step of security

Kerberos Primer
Client
KDC
NN
DN
1. kinit - Login and get Ticket Granting Ticket (TGT)
3. Get NameNode Service Ticket (NN-ST)
2. Client Stores TGT in Ticket
Cache
4. Client Stores NN-ST in Ticket
Cache
5. Read/write file given NN-ST and
file name; returns block locations,
block IDs and Block Access Tokens
if access permitted
6. Read/write block given
Block Access Token and block ID
Client’s
Kerberos
Ticket Cache

Kerberos authentication within Spark
KDC
Use Spark ST, submit Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John
Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster

Spark + X (Source of Data)
KDC
Use Spark ST, submit Spark Job
Spark gets X ST
YARN launches Spark
Executors using John
Doe’s identity
Get Service Ticket (ST)
for Spark
John
Doe
Spark AMSpark AM
XX
Executor reads from X using John
Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster

Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab
johndoe@EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10

HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN) service
ticket
Executors read
from HDFS
Client gets service ticket
for Spark
John
Doe
RangerRangerCan John launch this job?
Can John read this file

Spark – Communication Channels – YARN-CLUSTER Mode
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download

Spark – Channel Encryption - Example
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true

Gotchas with Spark Security
 Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Forces STS to run as Hive user to read all data
– Reduces security
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
 SparkSQL – Granular security unavailable
– Ranger integration will solve this problem (TP coming)
– Brings Row/Column level/Masking features to SparkSQL
 Spark + HBase with Kerberos
– Issue fixed in Spark 1.4 (Spark-6918)
 Spark Stream + Kafka + Kerberos + SSL
– Issues fixed in HDP 2.4.x
 Spark jobs > 72 Hours
– Kerberos token not renewed before Spark 1.5

Spark Performance

#1 Big or Small Executor ?
spark-submit --master yarn
--deploy-mode client
--num-executors ?
--executor-cores ?
--executor-memory ?
--class MySimpleApp
mySimpleApp.jar
arg1 arg2

Show the Details
 Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 cores
96GB Memory

Benchmark with Different Combinations
Cores
Per E#
Memory
Per E
(GB)
E Per
Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the
performance is relatively close
18 cores 108GB memory per node# The lower the better

Big Or Small Executor ?
 Avoid too small or too big executor.
– Small executor will decrease CPU and memory efficiency.
– Big executor will introduce heavy GC overhead.
 Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.

Any Other Thing ?
 Executor memory != Container memory
 Container memory = executor memory + overhead memory (10% of executory memory)
 Leave some resources to other service and system
 Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling -
http://hortonworks.com/blog/managing-cpu-resources-in-y

#2 Task Parallelism ?
 What is the preferable number of partitions (task parallelism) ?
rdd.groupByKey(numPartitions)
rdd.reduceByKey(func, [numPartitions])
rdd.sortByKey([ascending], [numPartitions])

Task Parallelism ?
map map map
reduce1 reduce2

Task Parallelism ?
 Task parallelism == number of reducer
 Increasing the task parallelism will reduce the data to be processed for each task
map map map
reduce1 reduce2 reduce2

Big or Small Task ?
 Small task
– Fine-grained partition can mitigate data skew problem to some extent.
– With relatively small data to be processed in each task can alleviate disk spill.
– Task number will be increased, this will somehow increase the scheduling overhead.
 Big task
– Too big shuffle block (> 2GB) will crash the Spark application.
– Easily introduce data skew problem.
– Frequent GC problem.
 Summary
– Fully make using of cluster parallelism.
– Consider a preferable task size according to the size of shuffle input (~128MB).
– Prefer small task rather than big task (increase the task parallelism).

Benchmark With Different Task Parallelism
# The lower the better

Conclusion
 Different workloads have different processing pattern, knowing the pattern will
give a right direction to tune
 No one rule could fit all situations, based on the monitored behavior to tune
workload can save the effort and get a better result
 The more you know about the framework, the better you will get the tuning
result

Thank You
Vinay Shukla, @neomythos
Saisai (Jerry) Shao

Running Spark in Production

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Running Spark in Production

Similar to Running Spark in Production (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Running Spark in Production

Editor's Notes