Running Apache Spark & Apache
Zeppelin in Production
Director, Product Management
August 31, 2016
Twitter: @neomythos
Vinay Shukla
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Who am i?
 Product Management
 Spark for 2.5 + years, Hadoop for 3+ years
 Recovering Programmer
 Blog at www.vinayshukla.com
 Twitter: @neomythos
 Addicted to Yoga, Hiking, & Coffee
 Smallest contributor to Apache Zeppelin
Vinay Shukla
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security: Rings of Defense
Perimeter Level Security
•Network Security (i.e. Firewalls)
Data Protection
•Wire encryption
•HDFS TDE/Dare
•Others
Authentication
•Kerberos
•Knox (Other Gateways)
OS Security
Authorization
•Apache Ranger/Sentry
•HDFS Permissions
•HDFS ACLs
•YARN ACL
Page 3
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Interacting with Spark
Ex
Spark on YARN
Zeppelin
Spark-
Shell
Ex
Spark
Thrift
Server
Driver
REST
ServerDriver
Driver
Driver
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Context: Spark Deployment Modes
• Spark on YARN
–Spark driver (SparkContext) in YARN AM(yarn-cluster)
–Spark driver (SparkContext) in local (yarn-client):
• Spark Shell & Spark Thrift Server runs in yarn-client only
Client
Executor
App
MasterSpark Driver
Client
Executor
App Master
Spark Driver
YARN-Client YARN-Cluster
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark on YARN
Spark Submit
John Doe
Spark
AM
Spark
AM
1
Hadoop Cluster
HDFS
Executor
YARN
RM
YARN
RM
4
2 3
Node
Manager
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Security – Four Pillars
 Authentication
 Authorization
 Audit
 Encryption
Spark leverages Kerberos on YARN
Ensure network is secure
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Kerberos authentication within Spark
KDC
Use Spark ST, submit
Spark Job
Spark gets Namenode (NN)
service ticket
YARN launches Spark
Executors using John
Doe’s identity
Get service ticket for
Spark,
John Doe
Spark AMSpark AM
NNNN
Executor reads from HDFS using
John Doe’s delegation token
kinit
1
2
3
4
5
6
7
Hadoop Cluster
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark – Kerberos - Example
kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@
EXAMPLE.COM
./bin/spark-submit --class org.apache.spark.examples.SparkPi
--master yarn-cluster --num-executors 3 --driver-memory 512m
--executor-memory 512m --executor-cores 1 lib/spark-
examples*.jar 10
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDFS
Spark – Authorization
YARN Cluster
A B C
KDC
Use Spark ST,
submit Spark Job
Get Namenode (NN)
service ticket
Executors
read from
HDFS
Client gets service
ticket for Spark
RangerRangerCan John launch this job?
Can John read this file
John Doe
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Encryption: Spark – Communication Channels
Spark
Submit
RM
Shuffle
Service
AM
Driver
NM
Ex 1 Ex N
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Data
Source
Read/Write
Data
FS – Broadcast,
File Download
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Communication Encryption Settings
Shuffle Data
Control/RPC
Shuffle
BlockTransfer
Read/Write
Data
FS – Broadcast,
File Download
spark.authenticate.enableSaslEncryption= true
spark.authenticate = true. Leverage YARN to distribute keys
Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS
NM > Ex leverages YARN based SSL
spark.ssl.enabled = true
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sharp Edges with Spark Security
 SparkSQL – Only coarse grain access control today
 Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd
hop
– Lowers security, forces STS to run as Hive user to read all data
– Use SparkSQL via shell or programmatic API
– https://issues.apache.org/jira/browse/SPARK-5159
 Spark Stream + Kafka + Kerberos
– Issues fixed in HDP 2.4.x
– No SSL support yet
 Spark Shuffle > Only SASL, no SSL support
 Spark Shuffle > No encryption for spill to disk or intermediate data
Fine grained Security to
SparkSQL
http://bit.ly/2bLghGz
http://bit.ly/2bTX7Pm
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin Security
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authentication + SSL
Spark on YARN
Ex Ex
LDAP
John Doe
1
2
3
SSL
Firewall
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Zeppelin + Livy E2E Security
Zeppelin
Spark
Yarn
Livy
Ispark Group
Interpreter
SPNego: Kerberos Kerberos/RPC
Livy APIs
LDAP
John Doe
Job runs as John Doe
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Authorization
 Note level authorization
 Grant Permissions (Owner, Reader, Writer)
to users/groups on Notes
 LDAP Group integration
 Zeppelin UI Authorization
 Allow only admins to configure interpreter
 Configured in shiro.ini
 For Spark with Zeppelin > Livy > Spark
– Identity Propagation Jobs run as End-User
 For Hive with Zeppelin > JDBC interpreter
 Shell Interpreter
– Runs as end-user
Authorization in Zeppelin Authorization at Data Level
[urls]
/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: Credentials
 LDAP/AD account
 Zeppelin leverages Hadoop Credential API
 Interpreter Credentials
 Not solved yet
 Credentials
Credentials in Zeppelin
This is
still an
open
issue
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: AD Authentication
1. /etc/zeppelin/conf/shiro.ini
[urls]
/api/version = anon
/** = authc
Configure Zeppelin to Authenticate users
Zeppelin leverages
Apache Shiro for
authentication/authoriza
tion
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Zeppelin: AD Authentication
1. Create an entry for AD credential
– Zeppelin leverages Hadoop Credential API
– >hadoop credential create
– activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks
– chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials
1. Configure Zeppelin to use AD
activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm
activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM
#activeDirectoryRealm.systemPassword = Password1!
activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks
activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM
activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389
activeDirectoryRealm.groupRolesMap =
"CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK
S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin”
activeDirectoryRealm.authorizationCachingEnabled = true
Active Directory Authentication
Skip step 1 if securing
LDAP password is not an
issue
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Spark Performance
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
#1 Big or Small Executor ?
spark-submit --master yarn 
--deploy-mode client 
--num-executors ? 
--executor-cores ? 
--executor-memory ? 
--class MySimpleApp 
mySimpleApp.jar 
arg1 arg2
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Show the Details
 Cluster resource: 72 cores + 288GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 cores
96GB Memory
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Benchmark with Different Combinations
Cores
Per E#
Memory
Per E
(GB)
E Per
Node
1 6 18
2 12 9
3 18 6
6 36 3
9 54 2
18 108 1
#E stands for Executor
Except too small or too big executor, the
performance is relatively close
18 cores 108GB memory per node# The lower the better
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Or Small Executor ?
 Avoid too small or too big executor.
– Small executor will decrease CPU and memory efficiency.
– Big executor will introduce heavy GC overhead.
 Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Any Other Thing ?
 Executor memory != Container memory
 Container memory = executor memory + overhead memory (10% of executory memory)
 Leave some resources to os and other services
 Enable CPU scheduling if you want to constrain CPU usage#
#CPU scheduling -
http://hortonworks.com/blog/managing-cpu-resources-in-y
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Multi-tenancy for Spark
 Leverage YARN queues
– Set user quotas
– Set default yarn-queue in spark-defaults
– User can override for each job
 Leverage Dynamic Resource Allocation
– Specify range of executors a job uses
– This needs shuffle service to be used
Cluster resource Utilization
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank You
Vinay Shukla
@neomythos

Running Apache Spark & Apache Zeppelin in Production

  • 1.
    Running Apache Spark& Apache Zeppelin in Production Director, Product Management August 31, 2016 Twitter: @neomythos Vinay Shukla
  • 2.
    2 © HortonworksInc. 2011 – 2016. All Rights Reserved Who am i?  Product Management  Spark for 2.5 + years, Hadoop for 3+ years  Recovering Programmer  Blog at www.vinayshukla.com  Twitter: @neomythos  Addicted to Yoga, Hiking, & Coffee  Smallest contributor to Apache Zeppelin Vinay Shukla
  • 3.
    3 © HortonworksInc. 2011 – 2016. All Rights Reserved Security: Rings of Defense Perimeter Level Security •Network Security (i.e. Firewalls) Data Protection •Wire encryption •HDFS TDE/Dare •Others Authentication •Kerberos •Knox (Other Gateways) OS Security Authorization •Apache Ranger/Sentry •HDFS Permissions •HDFS ACLs •YARN ACL Page 3
  • 4.
    4 © HortonworksInc. 2011 – 2016. All Rights Reserved Interacting with Spark Ex Spark on YARN Zeppelin Spark- Shell Ex Spark Thrift Server Driver REST ServerDriver Driver Driver
  • 5.
    5 © HortonworksInc. 2011 – 2016. All Rights Reserved Context: Spark Deployment Modes • Spark on YARN –Spark driver (SparkContext) in YARN AM(yarn-cluster) –Spark driver (SparkContext) in local (yarn-client): • Spark Shell & Spark Thrift Server runs in yarn-client only Client Executor App MasterSpark Driver Client Executor App Master Spark Driver YARN-Client YARN-Cluster
  • 6.
    6 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark on YARN Spark Submit John Doe Spark AM Spark AM 1 Hadoop Cluster HDFS Executor YARN RM YARN RM 4 2 3 Node Manager
  • 7.
    7 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark – Security – Four Pillars  Authentication  Authorization  Audit  Encryption Spark leverages Kerberos on YARN Ensure network is secure
  • 8.
    8 © HortonworksInc. 2011 – 2016. All Rights Reserved Kerberos authentication within Spark KDC Use Spark ST, submit Spark Job Spark gets Namenode (NN) service ticket YARN launches Spark Executors using John Doe’s identity Get service ticket for Spark, John Doe Spark AMSpark AM NNNN Executor reads from HDFS using John Doe’s delegation token kinit 1 2 3 4 5 6 7 Hadoop Cluster
  • 9.
    9 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark – Kerberos - Example kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@ EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark- examples*.jar 10
  • 10.
    10 © HortonworksInc. 2011 – 2016. All Rights Reserved HDFS Spark – Authorization YARN Cluster A B C KDC Use Spark ST, submit Spark Job Get Namenode (NN) service ticket Executors read from HDFS Client gets service ticket for Spark RangerRangerCan John launch this job? Can John read this file John Doe
  • 11.
    11 © HortonworksInc. 2011 – 2016. All Rights Reserved Encryption: Spark – Communication Channels Spark Submit RM Shuffle Service AM Driver NM Ex 1 Ex N Shuffle Data Control/RPC Shuffle BlockTransfer Data Source Read/Write Data FS – Broadcast, File Download
  • 12.
    12 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark Communication Encryption Settings Shuffle Data Control/RPC Shuffle BlockTransfer Read/Write Data FS – Broadcast, File Download spark.authenticate.enableSaslEncryption= true spark.authenticate = true. Leverage YARN to distribute keys Depends on Data Source, For HDFS RPC (RC4 | 3DES) or SSL for WebHDFS NM > Ex leverages YARN based SSL spark.ssl.enabled = true
  • 13.
    13 © HortonworksInc. 2011 – 2016. All Rights Reserved Sharp Edges with Spark Security  SparkSQL – Only coarse grain access control today  Client -> Spark Thrift Server > Spark Executors – No identity propagation on 2nd hop – Lowers security, forces STS to run as Hive user to read all data – Use SparkSQL via shell or programmatic API – https://issues.apache.org/jira/browse/SPARK-5159  Spark Stream + Kafka + Kerberos – Issues fixed in HDP 2.4.x – No SSL support yet  Spark Shuffle > Only SASL, no SSL support  Spark Shuffle > No encryption for spill to disk or intermediate data Fine grained Security to SparkSQL http://bit.ly/2bLghGz http://bit.ly/2bTX7Pm
  • 14.
    14 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin Security
  • 15.
    15 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authentication + SSL Spark on YARN Ex Ex LDAP John Doe 1 2 3 SSL Firewall
  • 16.
    16 © HortonworksInc. 2011 – 2016. All Rights Reserved Zeppelin + Livy E2E Security Zeppelin Spark Yarn Livy Ispark Group Interpreter SPNego: Kerberos Kerberos/RPC Livy APIs LDAP John Doe Job runs as John Doe
  • 17.
    17 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Authorization  Note level authorization  Grant Permissions (Owner, Reader, Writer) to users/groups on Notes  LDAP Group integration  Zeppelin UI Authorization  Allow only admins to configure interpreter  Configured in shiro.ini  For Spark with Zeppelin > Livy > Spark – Identity Propagation Jobs run as End-User  For Hive with Zeppelin > JDBC interpreter  Shell Interpreter – Runs as end-user Authorization in Zeppelin Authorization at Data Level [urls] /api/interpreter/** = authc, roles[admin] /api/configurations/** = authc, roles[admin] /api/credential/** = authc, roles[admin]
  • 18.
    18 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin: Credentials  LDAP/AD account  Zeppelin leverages Hadoop Credential API  Interpreter Credentials  Not solved yet  Credentials Credentials in Zeppelin This is still an open issue
  • 19.
    19 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. /etc/zeppelin/conf/shiro.ini [urls] /api/version = anon /** = authc Configure Zeppelin to Authenticate users Zeppelin leverages Apache Shiro for authentication/authoriza tion
  • 20.
    20 © HortonworksInc. 2011 – 2016. All Rights Reserved Apache Zeppelin: AD Authentication 1. Create an entry for AD credential – Zeppelin leverages Hadoop Credential API – >hadoop credential create – activeDirectoryRealm.systemPassword -provider jceks://etc/zeppelin/conf/credentials.jceks – chmod 400 with only Zeppelin process r/w access, no other user allowed accessCredentials 1. Configure Zeppelin to use AD activeDirectoryRealm = org.apache.zeppelin.server.ActiveDirectoryGroupRealm activeDirectoryRealm.systemUsername = CN=Administrator,CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM #activeDirectoryRealm.systemPassword = Password1! activeDirectoryRealm.hadoopSecurityCredentialPath = jceks://etc/zeppelin/conf/credentials.jceks activeDirectoryRealm.searchBase = CN=Users,DC=HWQE,DC=HORTONWORKS,DC=COM activeDirectoryRealm.url = ldap://ad-nano.qe.hortonworks.com:389 activeDirectoryRealm.groupRolesMap = "CN=admin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"admin","CN=finance,OU=groups,DC=HWQE,DC=HORTONWORK S,DC=COM":"finance","CN=zeppelin,OU=groups,DC=HWQE,DC=HORTONWORKS,DC=COM":"zeppelin” activeDirectoryRealm.authorizationCachingEnabled = true Active Directory Authentication Skip step 1 if securing LDAP password is not an issue
  • 21.
    21 © HortonworksInc. 2011 – 2016. All Rights Reserved Spark Performance
  • 22.
    22 © HortonworksInc. 2011 – 2016. All Rights Reserved #1 Big or Small Executor ? spark-submit --master yarn --deploy-mode client --num-executors ? --executor-cores ? --executor-memory ? --class MySimpleApp mySimpleApp.jar arg1 arg2
  • 23.
    23 © HortonworksInc. 2011 – 2016. All Rights Reserved Show the Details  Cluster resource: 72 cores + 288GB Memory 24 cores 96GB Memory 24 cores 96GB Memory 24 cores 96GB Memory
  • 24.
    24 © HortonworksInc. 2011 – 2016. All Rights Reserved Benchmark with Different Combinations Cores Per E# Memory Per E (GB) E Per Node 1 6 18 2 12 9 3 18 6 6 36 3 9 54 2 18 108 1 #E stands for Executor Except too small or too big executor, the performance is relatively close 18 cores 108GB memory per node# The lower the better
  • 25.
    25 © HortonworksInc. 2011 – 2016. All Rights Reserved Big Or Small Executor ?  Avoid too small or too big executor. – Small executor will decrease CPU and memory efficiency. – Big executor will introduce heavy GC overhead.  Usually 3 ~ 6 cores and 10 ~ 40 GB of memory per executor is a preferable choice.
  • 26.
    26 © HortonworksInc. 2011 – 2016. All Rights Reserved Any Other Thing ?  Executor memory != Container memory  Container memory = executor memory + overhead memory (10% of executory memory)  Leave some resources to os and other services  Enable CPU scheduling if you want to constrain CPU usage# #CPU scheduling - http://hortonworks.com/blog/managing-cpu-resources-in-y
  • 27.
    27 © HortonworksInc. 2011 – 2016. All Rights Reserved Multi-tenancy for Spark  Leverage YARN queues – Set user quotas – Set default yarn-queue in spark-defaults – User can override for each job  Leverage Dynamic Resource Allocation – Specify range of executors a job uses – This needs shuffle service to be used Cluster resource Utilization
  • 28.
    28 © HortonworksInc. 2011 – 2016. All Rights Reserved Thank You Vinay Shukla @neomythos

Editor's Notes

  • #7 John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  • #8 The first step of security is network security The second step of security is Authentication Most Hadoop echo system projects rely on Kerberos for Authentication Kerberos – 3 Headed Guard Dog : https://en.wikipedia.org/wiki/Cerberus
  • #9 John Doe first authenticates to Kerberos before launching Spark Shell kinit -kt /etc/security/keytabs/johndoe.keytab johndoe@EXAMPLE.COM ./bin/spark-submit --class org.apache.spark.examples.SparkPi --master yarn-cluster --num-executors 3 --driver-memory 512m --executor-memory 512m --executor-cores 1 lib/spark-examples*.jar 10
  • #11 Controlling HDFS Authorization is easy/Done Controlling Hive row/column level authorization in Spark is WIP
  • #12 For HDFS as Data Source can use RPC or use SSL with WebHDFS For NM Shuffle Data – Use YARN SSL Spark support SSL for FS (Broadcast or File download) Shuffle Block Transfer supports SASL based encryption – SSL coming
  • #18 Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  • #19 Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  • #20 Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  • #21 Thank you Prasad Wagle (Twitter) & Prabhjot Singh (Hortonworks)
  • #29 All Images from Flicker Commons