How to create a multi tenancy for an interactive data analysis

How-to create a multi
tenancy for an interactive
data analysis
Spark Cluster + Livy + Zeppelin

Introduction
With this presentation you should be able to create an architecture for a framework of an
interactive data analysis by using a Spark Cluster with Kerberos, a Livy Server and
Zeppelin notebook with Kerberos authentication.

Architecture
This architecture enables the following:
● Transparent data-science development
● Upgrades on Cluster won’t affect the
developments.
● Controlled access to the data and
resources by Kerberos/Sentry.
● High availability.
● Several coding API’s (Scala, R, Python,
PySpark, etc…).

Pre-Assumptions
1. Cluster hostname: cm1.localdomain Zeppelin hostname: cm2.localdomain
2. Cluster supergroup: bdamanager
3. Cluster Manager: Cloudera Manager 5.12.2
4. Service Yarn Installed
5. Cluster Authentication Pre-Installed: Kerberos
a. Kerberos Realm DOMAIN.COM
6. Chosen IDE: Zeppelin
7. Zeppelin Machine Authentication Not-Installed: Kerberos

Livy server configuration
Create User and Group for Livy
sudo useradd livy
sudo passwd livy
sudo usermod -G bdamanager livy
Create User Zeppelin for the IDE
sudo useradd zeppelin
sudo passwd zeppelin
Note 1: due to the livy impersonation, livy should be added to the cluster supergroup, so you should replace the highlighted name with your
supergroup name.
Note 2: the chosen IDE it’s the zeppelin if you chose other just replace the highlighted field.

Download and installation
su livy
cd home/livy
wget
http://mirrors.up.pt/pub/apache/incubator/livy
/0.5.0-incubating/livy-0.5.0-incubating-bin.zip
unzip livy-0.5.0-incubating-bin.zip
cd livy-0.5.0-incubating-bin/
mkdir logs
cd conf/
mv livy.conf.template livy.conf
mv livy-env.sh.template livy-env.sh
mv livy-client.conf.template livy-client.conf
Edit Livy environment variables
nano livy-env.sh
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.12.2-1.cdh5.12.2.p0.4/lib/spark/
export HADOOP_HOME=/opt/cloudera/parcels/CDH-5.12.2-1.cdh5.12.2.p0.4
export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera/
export HADOOP_CONF_DIR=/etc/hadoop/conf
export LIVY_HOME=/home/livy/livy-0.5.0-incubating-bin/
export LIVY_LOG_DIR=/var/log/livy2
export LIVY_SERVER_JAVA_OPTS="-Xmx2g"
Make livy hive aware
sudo ln -s /etc/hive/conf/hive-site.xml /etc/spark/conf/hive-site.xml

Edit livy configuration file
nano livy.conf
# What spark master Livy sessions should use.
livy.spark.master = yarn
# What spark deploy mode Livy sessions should use.
livy.spark.deploy-mode = cluster
# If livy should impersonate the requesting users when creating a new session.
livy.impersonation.enabled = true
# Whether to enable HiveContext in livy interpreter, if it is true hive-site.xml will be detected
# on user request and then livy server classpath automatically.
livy.repl.enable-hive-context = true

Edit livy configuration file
# Add Kerberos Config
livy.server.launch.kerberos.keytab = /home/livy/livy.keytab
livy.server.launch.kerberos.principal=livy/cm1.localdomain@DOMAIN.COM
livy.server.auth.type = kerberos
livy.server.auth.kerberos.keytab=/home/livy/spnego.keytab
livy.server.auth.kerberos.principal=HTTP/cm1.localdomain@DOMAIN.COM
livy.server.access-control.enabled=true
livy.server.access-control.users=zeppelin,livy
livy.superusers=zeppelin,livy
Note 1: on this example the chosen IDE is the zeppelin.
Note 2: with livy.impersonation.enabled = true it implies that livy will be able to impersonate any user present on the Cluster (proxyUser).
Note 3: with livy.server.auth.type = kerberos it implies that to interact with livy, requires to the user be correctly authenticated.
Note 4: it’s only necessary to change the highlighted, ex: for your hostname.

Create Kerberos Livy and Zeppelin principal and keytabs
sudo kadmin.local <<eoj
addprinc -pw welcome1 livy/cm1.localdomain@DOMAIN.COM
modprinc -maxrenewlife 1week livy/cm1.localdomain@DOMAIN.COM
xst -norandkey -k /home/livy/livy.keytab livy/cm1.localdomain@DOMAIN.COM
addprinc -pw welcome1 zeppelin/cm1.localdomain@DOMAIN.COM
modprinc -maxrenewlife 1week zeppelin/cm1.localdomain@DOMAIN.COM
xst -norandkey -k /home/livy/zeppelin.keytab zeppelin/cm1.localdomain@DOMAIN.COM
xst -norandkey -k /home/livy/spnego.keytab HTTP/cm1.localdomain@DOMAIN.COM
eoj
Create Log Dir and add Permissions
cd /home
sudo chown -R livy:livy livy/
sudo mkdir /var/log/livy2
sudo chown -R livy:bdamanager /var/log/livy2
Note: it’s only necessary to change the highlighted names , for your hostname and for last your supergroup name..

Cloudera configuration
HUE - Create Users Livy, Zeppelin and add Livy to a Supergroup HDFS - Add Livy proxyuser permissions
On the Cloudera Manager menu:
HDFS > Advanced Configuration Snippet for core-site.xml
you should add the following xml:
<property>
<name>hadoop.proxyuser.livy.groups</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.livy.hosts</name>
<value>*</value>
</property>

Interact with Livy server
Start Livy server
sudo -u livy /home/livy/livy-0.5.0-incubating-bin/bin/livy-server
Verify that the server is running by connecting to its web UI, which uses the port 8998 (default)
http://cm1.localdomain:8998/ui
Authenticate with a user principal
Example:
kinit livy/cm1.localdomain@DOMAIN.COM
kinit tpsimoes/cm1.localdomain@DOMAIN.COM
Livy offers a REST APIs to create interactive sessions and therefore submit Spark code the same way you can do with a
Spark shell or a PySpark shell. The following interaction examples with livy server will be in Python.

Create session
curl --negotiate -u:livy -H "Content-Type: application/json" -X POST -d '{"kind":"pyspark", "proxyUser": "livy"}' -i
http://cm1.localdomain:8998/sessions
curl --negotiate -u:livy -H "Content-Type: application/json" -X POST -d '{"kind":"spark", "proxyUser": "livy"}' -i
http://cm1.localdomain:8998/sessions
Check for sessions with details
curl --negotiate -u:livy cm1.localdomain:8998/sessions | python -m json.tool
Note 1: using livy with a kerberized cluster all commands must have --negotiate -u:user -or --negotiate -u:user:password
Note 2: to create a different code language session just have to change the highlighted field

Submit a job
curl -H "Content-Type: application/json" -X POST -d '{"code":"2 + 2"}' -i --negotiate -u:livy
cm1.localdomain:8998/sessions/0/statements
{"id":0,"code":"2 + 2","state":"waiting","output":null,"progress":0.0}
Check result from statement
curl --negotiate -u:livy cm1.localdomain:8998/sessions/0/statements/0
{"id":0,"code":"2 + 2","state":"available","output":{"status":"ok","execution_count":0,"data":{"text/plain":"4"}},"progress":1.0}

Submit another job
curl -H "Content-Type: application/json" -X POST -d '{"code":"println(sc.parallelize(1 to 5).collect())"}' -i --negotiate -u:livy
http://cm1.localdomain:8998/sessions/1/statements
curl -H "Content-Type: application/json" -X POST -d '{"code":"a = 10"}' -i --negotiate -u:livy
curl -H "Content-Type: application/json" -X POST -d '{"code":"a + 1"}' -i --negotiate -u:livy
Submit another job
curl --negotiate -u:livy cm1.localdomain:8998/sessions/0 -X DELETE
Note: while submitting jobs or check for details pay attention to session number in the highlighted field ex: sessions/2

Zeppelin Architecture
Zeppelin it’s a multi-purpose notebook that
enables:
● Data Ingestion & Discovery.
● Data Analytics.
● Data Visualization & Collaboration.
And with the livy interpreter enables spark
integration with a Multiple Language Backend.

Configure Zeppelin Machine
Download and Install UnlimitedJCEPolicyJDK8 from Oracle
wget http://www.oracle.com/technetwork/java/javase/downloads/jce8-download-2133166.html
unzip jce_policy-8.zip
sudo cp local_policy.jar US_export_policy.jar /usr/java/jdk1.8.0_131/jre/lib/security/
Note: confirm the java directory and replace in the highlighted field.

Assuming that on Zeppelin machine we will require kerberos authentication and it’s not installed, i’ll
provide quick steps for the installation and respective configuration.
Install Kerberos server and open ldap client
sudo yum install -y krb5-server openldap-clients krb5-workstation
Set Kerberos Realm
sudo sed -i.orig 's/EXAMPLE.COM/DOMAIN.COM/g' cd
Set the hostname for the Kerberos server
sudo sed -i.m1 's/kerberos.example.com/cm1.localdomain/g' /etc/krb5.conf
Change domain name to cloudera
sudo sed -i.m2 's/example.com/DOMAIN.COM/g' /etc/krb5.conf
Note: replace you hostname and realm on the highlighted fields.

Create the Kerberos database
sudo kdb5_util create -s
Acl file needs to be updated so the */admin is enabled with admin privileges
sudo sed -i 's/EXAMPLE.COM/DOMAIN.COM/' /var/kerberos/krb5kdc/kadm5.acl
Update the kdc.conf file to allow renewable
sudo sed -i.m3 '/supported_enctypes/a default_principal_flags = +renewable, +forwardable'
/var/kerberos/krb5kdc/kdc.conf
Fix the indenting
sudo sed -i.m4 's/^default_principal_flags/ default_principal_flags/' /var/kerberos/krb5kdc/kdc.conf

Update kdc.conf file
sudo sed -i.orig 's/EXAMPLE.COM/DOMAIN.COM/g' /var/kerberos/krb5kdc/kdc.conf
Acl file needs to be updated so the */admin is enabled with admin privileges
sudo sed -i 's/EXAMPLE.COM/DOMAIN.COM/' /var/kerberos/krb5kdc/kadm5.acl
Add a line to the file with ticket life
sudo sed -i.m1 '/dict_file/a max_life = 1d' /var/kerberos/krb5kdc/kdc.conf
Add a max renewable life
sudo sed -i.m2 '/dict_file/a max_renewable_life = 7d' /var/kerberos/krb5kdc/kdc.conf
Indent the two new lines in the file
sudo sed -i.m3 's/^max_/ max_/' /var/kerberos/krb5kdc/kdc.conf

Start up the kdc server and the admin server
sudo service krb5kdc start;
sudo service kadmin start;
Make the kerberos services autostart
sudo chkconfig kadmin on
sudo chkconfig krb5kdc on
Add a line to the file with ticket life
sudo sed -i.m1 '/dict_file/a max_life = 1d' /var/kerberos/krb5kdc/kdc.conf
Add a max renewable life
sudo sed -i.m2 '/dict_file/a max_renewable_life = 7d' /var/kerberos/krb5kdc/kdc.conf
Indent the two new lines in the file
sudo sed -i.m3 's/^max_/ max_/' /var/kerberos/krb5kdc/kdc.conf

Create Kerberos Livy and Zeppelin principal and keytabs
sudo kadmin.local <<eoj
addprinc -pw welcome1 zeppelin/cm1.localdomain@DOMAIN.COM
modprinc -maxrenewlife 1week zeppelin/cm1.localdomain@DOMAIN.COM
xst -norandkey -k /home/zeppelin/zeppelin.keytab zeppelin/cm1.localdomain@DOMAIN.COM
eoj
Set Hostname and make Zeppelin aware of Livy/Cluster Machine
sudo /etc/hosts
# Zeppelin IP HOST
10.222.33.200 cm2.localdomain
# Livy/Cluster IP HOST
10.222.33.100 cm1.localdomain
sudo hostname cm2.localdomain

Set Hostname and make Zeppelin aware of Livy Machine
sudo nano /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=cm2.localdomain
NTPSERVERARGS=iburst
Disable SELinux
sudo nano /etc/selinux/config
SELINUX=disabled
sudo setenforce 0
Clean iptables rules
sudo iptables -F
sudo nano /etc/rc.local
iptables -F
Note: after all operations it’s recommended a restart.
Make executable to operation run at startup
sudo chmod +x /etc/rc.d/rc.local
Save iptables rules on restart
sudo nano /etc/sysconfig/iptables-config
# Save current firewall rules on restart.
IPTABLES_SAVE_ON_RESTART="yes"
Disable firewall
sudo systemctl disable firewalld;
sudo systemctl stop firewalld;

Create User Zeppelin
sudo useradd zeppelin
sudo passwd zeppelin
Add user Zeppelin to sudoers
sudo cat /etc/sudoers
## Same thing without a password
# %wheel ALL=(ALL) NOPASSWD: ALL
zeppelin ALL=(ALL) NOPASSWD: ALL
Note: on the highlighted fields you should replace with chosen IDE and the available java installation.
Download and Install Zeppelin
su zeppelin
cd ~
wget http://mirrors.up.pt/pub/apache/zeppelin/zeppelin-0.7.3/zeppelin-0.7.3-bin-all.tgz
tar -zxvf zeppelin-0.7.3-bin-all.tgz
cd /home/
sudo chown -R zeppelin:zeppelin zeppelin/
Create Zeppelin environment variables
cd /home/zeppelin/zeppelin-0.7.3-bin-all/conf
cp zeppelin-env.sh.template zeppelin-env.sh
cp zeppelin-site.xml.template zeppelin-site.xml
Export Java properties
export port JAVA_HOME=/usr/java/jdk1.7.0_67

Add User Zeppelin and Authentication type on the configuration file (shiro.ini)
cd /home/zeppelin/zeppelin-0.7.3-bin-all/conf/
nano shiro.ini
[users]
# List of users with their password allowed to access Zeppelin.
# To use a different strategy (LDAP / Database / ...) check the shiro doc at http://shiro.apache.org/configuration.html
admin = welcome1, admin
zeppelin = welcome1, admin
user2 = password3, role3
…
[urls]
# This section is used for url-based security.
# anon means the access is anonymous.
# authc means Form based Auth Security
# To enfore security, comment the line below and uncomment the next one
/api/version = authc
/api/interpreter/** = authc, roles[admin]
/api/configurations/** = authc, roles[admin]
/api/credential/** = authc, roles[admin]
#/** = anon
/** = authc

Interact with Zeppelin
Kinit User
cd /home/zeppelin/
kinit -kt zeppelin.keytab zeppelin/cm1.localdomain@DOMAIN.COM
Start/Stop Zeppelin
cd ~/zeppelin-0.7.3-bin-all
sudo ./bin/zeppelin-daemon.sh start
sudo ./bin/zeppelin-daemon.sh stop
Open Zeppelin UI
http://cm2.localdomain:8080/#/
Note: change with your hostname and domain in the highlighted field.
Login Zeppelin User
Create Livy Notebook

Configure Livy Interpreter
zeppelin.livy.keytab: zeppelin/cm1.localdomain@DOMAIN.COM
zeppelin.livy.principal: /home/zeppelin/zeppelin.keytab
zeppelin.livy.url: http://cm1.localdomain:8998
Using Livy Interpreter
spark
%livy.spark
sc.version
sparkR
%livy.sparkr
hello <- function( name ) {
sprintf( "Hello, %s", name );
}
hello("livy")
pyspark
%livy.pyspark
print "1"

%pyspark
from pyspark.sql import HiveContext
hiveCtx= HiveContext(sc)
hiveCtx.sql("show databases").show()
hiveCtx.sql("select current_user()").show()
Note: due to livy impersonation, we will see every database on Hive Metadata, but on a valid user can access to the correspondent data.
%pyspark
hiveCtx.sql("select * from notMyDB.TAB_TPS").show()
hiveCtx.sql("Create External Table myDB.TAB_TST (Operation_Type String, Operation String)")
hiveCtx.sql("Insert Into Table myDB.TAB_TST select 'ZEPPELIN','FIRST'")
hiveCtx.sql("select * from myDB.TAB_TST").show()

%livy.pyspark
sc._conf.setAppName("Zeppelin-HiveOnSpark")
hiveCtx = HiveContext(sc)
hiveCtx.sql("set yarn.nodemanager.resource.cpu-vcores=4")
hiveCtx.sql("set yarn.nodemanager.resource.memory-mb=16384")
hiveCtx.sql("set yarn.scheduler.maximum-allocation-vcores=4")
hiveCtx.sql("set yarn.scheduler.minimum-allocation-mb=4096")
hiveCtx.sql("set yarn.scheduler.maximum-allocation-mb=8192")
hiveCtx.sql("set spark.executor.memory=1684354560")
hiveCtx.sql("set spark.yarn.executor.memoryOverhead=1000")
hiveCtx.sql("set spark.driver.memory=10843545604")
hiveCtx.sql("set spark.yarn.driver.memoryOverhead=800")
hiveCtx.sql("set spark.executor.instances=10")
hiveCtx.sql("set spark.executor.cores=8")
hiveCtx.sql("set hive.map.aggr.hash.percentmemory=0.7")
hiveCtx.sql("set hive.limit.pushdown.memory.usage=0.5")
countryList = hiveCtx.sql("select distinct country from myDB.SALES_WORLD")
countryList.show(4)

Thanks
Big Data Engineer
Tiago Simões

How to create a multi tenancy for an interactive data analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to How to create a multi tenancy for an interactive data analysis

Similar to How to create a multi tenancy for an interactive data analysis (20)

Recently uploaded

Recently uploaded (20)

How to create a multi tenancy for an interactive data analysis