SlideShare a Scribd company logo
1 of 41
MODULE 2
Installing Spark
• High-level physical cluster architecture
• Software architecture of a standalone cluster
• Installing Spark standalone locally
• Running the Spark Shell
• Running in python shell and ipython shell
• Running sample Spark code
• Using the Spark Session and Spark Context
• Creating a parallelized collection
Installing Spark
Module 2. Installing Spark 2
What we’ll cover:
Spark Cluster Components
Module 2. Installing Spark 3
Driver Program
Spark Session
● Spark Context
Worker Node
Executor
● Cache
● Task
● Task
● ...
Cluster Manager
Worker Node
Executor
● Cache
● Task
● Task
● ...
• Driver Node on the Cluster Manager, aka Master Node
• Driver Node in a Spark Cluster
• Driver Program
• Driver Process
“Driver”
4Module 2. Installing Spark
• Cluster Mode
o submit application to cluster manager
• Client Mode
o run the driver outside the cluster, with executors in the cluster
• Standalone Mode
o everything is on one machine
Execution Modes
Module 3. Spark Computing Framework 5
Standalone Spark Cluster
Module 2. Installing Spark 6
Driver Program
Spark Session
● Spark Context
Worker Node
Executor
● Cache
● Task
● Task
● ...
Cluster Manager
• Ensure that you have Java installed, and verify your version
• Most recent version you can install with Java 7: version 2.1.x
• If you have Java 8+ installed you can install version 2.2.x
Installing Local Standalone
Module 2. Installing Spark 7
• Work within a virtual environment or container
• If you are already familiar with this you can skip the next three slides
Recommendation for Success
Module 2. Installing Spark 8
• This is not required but is strongly recommended.
• a virtual environment creates an isolated environment for a project
o each project can have its own dependencies and not affect other projects
o you can easily and safely remove this environment when you are done with it
• Examples include Virtualenv, Anaconda, but there are many to good options
• You may also use a container-based environment, such as Docker.
• Use your preferred approach or organization’s recommended practice
Creating a virtual environment
Module 2. Installing Spark 9
Example : creating a virtual environment.
Anaconda
$ conda create -n my-env python
$ source activate my-env
# Here is the same thing, but specifying Python 2.7
$ conda create -n my-env python=2.7
$ source activate my-env
# Here is the same thing, but with Python 3.6
$ conda create -n my-env python=3.6
$ source activate my-env
Module 2. Installing Spark 10
Example : creating a virtual environment.
Virtualenv
$ cd /path/to/my/venvs
$ virtualenv ./my_venv
$ source ./my_env/bin/activate
Module 2. Installing Spark 11
• The most recent version of Pyspark you can install with Java 7 is version 2.1.x
• If you have Java 8+ installed you can install Pyspark 2.2.x
Install Java
Module 2. Installing Spark 12
The simple way to install Pyspark
Recommended for Spark 2.2.x onwards
$ pip install pyspark
Module 2. Installing Spark 13
• Download source and Build
o Browse to https://spark.apache.org/downloads.html
o Browse https://spark.apache.org/downloads.html
o Choose the following settings :
Installing Pyspark 2.1.2
Module 2. Installing Spark 14
o Click on spark-2.1.2.tgz
Installing Pyspark 2.1.2
Module 2. Installing Spark 15
o Save this to /ext/spark
Installing Pyspark 2.1.2
Recommended for following along in this course
$ curl http://apache.mirrors.lucidnetworks.net/spark/spark-2.1.2/spark-2.1.2.tgz -o spark-
2.1.2.tgz
# Confirm that the file size is as expected (~13MiB)
$ ls -lh spark-2.1.2.tgz
# Extract the contents
$ tar -xf spark-2.1.2.tgz
$ cd /ext/spark/spark-2.1.2
Module 2. Installing Spark 16
• Set Java home environment variable
o This is necessary only if Java home is not already set.
Installing Pyspark 2.1.2
Module 2. Installing Spark 17
Installing Pyspark 2.1.2
Set $JAVA_HOME
$ echo $JAVA_HOME
# If above returns empty, then you’ll need to set it. On Mac OS, you can use result of :
$ echo `/usr/libexec/java_home`
like so :
$ export JAVA_HOME=`/usr/libexec/java_home`
Module 2. Installing Spark 18
• Get the build instructions
o It is good practice to vet these yourself before running.
o Understand what you are going to do before doing it
o Spark evolves rapidly, so know how to upgrade!
• Browse to : https://spark.apache.org/documentation.html
o Scroll down to your version, click into that link
o For our case, this is https://spark.apache.org/docs/2.1.2/building-spark.html
o “Building Spark 2.2.1 using Maven requires Maven 3.3.9 or newer and Java 7+.”
o “Note that support for Java 7 was removed as of Spark 2.2.0.”
• Now to build it
Installing Pyspark 2.1.2
Module 2. Installing Spark 19
Installing Pyspark 2.1.2
Build it
$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
$ ./build/mvn -DskipTests clean package
[INFO] Total time: 11:11 min
$ sudo pip install py4j
Module 2. Installing Spark 20
Installing Pyspark 2.1.2
Pro tip : turn down overly verbose logs
$ cd /ext/spark/spark-2.1.2
$ cp conf/log4j.properties.template conf/log4j.properties
Module 2. Installing Spark 21
• Pro tip: ensure the following lines are in ~/.bash_profile
o export SPARK_HOME="/ext/spark/spark-2.1.2"
o export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PATH
• If you altered ~/.bash_profile, remember to run the following:
o $ source ~/.bash_profile
Done Installing Spark
Module 2. Installing Spark 22
Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>>
Module 2. Installing Spark 23
Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>
Module 2. Installing Spark 24
Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>
Module 2. Installing Spark 25
Running Pyspark Shell
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(["Hello", "World"])
>>> print(rdd.count())
2
>>> print(rdd.collect())
['Hello', 'World']
>>> print(rdd.take(2))
['Hello', 'World']
Module 2. Installing Spark 26
Running Pyspark in python shell
$ python
>>>
Module 2. Installing Spark 27
Running Pyspark in python shell
We’ll simplify this later -- for now let’s ensure we’re using what we just installed
$ python
>>> import sys
>>> import os
>>> SPARK_HOME = '/ext/spark/spark-2.1.2'
>>> SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip"
>>> sys.path.insert(0, os.path.join(SPARK_HOME, "python")) # precede pre-existing
>>> sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J))
>>> os.environ["SPARK_HOME"] = SPARK_HOME
>>>
Module 2. Installing Spark 28
Running Pyspark in python shell
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession 
... .builder 
... .appName("Python Spark SQL basic example") 
... .config("spark.some.config.option", "some-value") 
... .getOrCreate()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> spark
<pyspark.sql.session.SparkSession object at 0x1112e4650>
Module 2. Installing Spark 29
Running Pyspark in python shell
>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(["Hello", "World"])
>>> print(rdd.count())
2
>>> print(rdd.collect())
['Hello', 'World']
>>> print(rdd.take(2))
['Hello', 'World']
Module 2. Installing Spark 30
Running Pyspark in ipython shell
Repeat the steps we did in the python shell :
$ ipython
In [1]: import os
...: import sys
...: SPARK_HOME = '/ext/spark/spark-2.1.2'
...: SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip"
...: sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
...: sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) # must precede IDE's py4j
...:
Module 2. Installing Spark 31
Running Pyspark in ipython shell
In [2]: java_home = "/Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home"
...: os.environ["JAVA_HOME"] = java_home
...: os.environ["SPARK_HOME"] = SPARK_HOME
...: from pyspark.sql import SparkSession
...:
Module 2. Installing Spark 32
Running Pyspark in ipython shell
In [3]: spark = SparkSession 
...: .builder 
...: .appName("Python Spark SQL basic example") 
...: .config("spark.some.config.option", "some-value") 
...: .getOrCreate()
...:
Module 2. Installing Spark 33
Running Pyspark in ipython shell
In [4]: spark?
Type: SparkSession
String form: <pyspark.sql.session.SparkSession object at 0x1111f5f10>
File: /ext/spark/spark-2.1.2/python/pyspark/sql/session.py
Docstring:
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
tables, execute SQL over tables, cache tables, and read parquet files.
To create a SparkSession, use the following builder pattern:
[...]
Module 2. Installing Spark 34
Running Pyspark in ipython shell
Running the example code that we ran previously
In [5]: sc = spark.sparkContext
...: rdd = sc.parallelize(["Hello", "World"])
...: print(rdd.count())
...: print(rdd.collect())
...: print(rdd.take(2))
...:
2
['Hello', 'World']
['Hello', 'World']
Module 2. Installing Spark 35
Running Pyspark in ipython shell
In [6]: sc?
Type: SparkContext
String form: <pyspark.context.SparkContext object at 0x1111702d0>
File: /ext/spark/spark-2.1.2/python/pyspark/context.py
Docstring:
Main entry point for Spark functionality. A SparkContext represents the
connection to a Spark cluster, and can be used to create L{RDD} and
broadcast variables on that cluster.
Init docstring:
Create a new SparkContext. At least the master and app name should be set,
either through the named parameters here or through C{conf}.
Module 2. Installing Spark 36
Running Pyspark in ipython shell
In [7]: rdd?
Type: RDD
String form: ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475
File: /ext/spark/spark-2.1.2/python/pyspark/rdd.py
Docstring:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be
operated on in parallel.
Module 2. Installing Spark 37
Running Pyspark in ipython shell
Tab completion
T
In [8]: rdd.
rdd.aggregate rdd.coalesce rdd.context rdd.countByValue
rdd.aggregateByKey rdd.cogroup rdd.count rdd.ctx
rdd.cache rdd.collect rdd.countApprox rdd.distinct
rdd.cartesian rdd.collectAsMap rdd.countApproxDistinct rdd.filter
rdd.checkpoint rdd.combineByKey rdd.countByKey rdd.first
Module 2. Installing Spark 38
Running Pyspark in ipython shell
Magic functions
T
In [9]: %whos
Variable Type Data/Info
----------------------------------------
SPARK_HOME str /ext/spark/spark-2.1.2
SPARK_PY4J str python/lib/py4j-0.10.4-src.zip
SparkSession type <class 'pyspark.sql.session.SparkSession'>
java_home str /Library/Java/JavaVirtual<...>.7.0_80.jdk/Contents/Home
rdd RDD ParallelCollectionRDD[0] <...>ze at PythonRDD.scala:475
sc SparkContext <pyspark.context.SparkCon<...>xt object at 0x1111702d0>
spark SparkSession <pyspark.sql.session.Spar<...>on object at 0x1111f5f10>
Module 2. Installing Spark 39
• We learned several ways of installing Pyspark
• We ran the Spark Shell
• We showed how to run our new install in python and ipython
• We tested Pyspark by running sample code in all three of these shells
• We used two important objects: the Spark Session, and the Spark Context
• We created a RDD and inspected its contents.
What we covered
Module 2. Installing Spark 40
• Next time we’ll cover more ways of running Spark.
• We’ll show it in an IDE, and notebook.
• We’ll get our first view of the Spark UI
Next time: Running Spark
Module 2. Installing Spark 41

More Related Content

What's hot

Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDescribing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDale Lane
 
Jenkins log monitoring with elk stack
Jenkins log monitoring with elk stackJenkins log monitoring with elk stack
Jenkins log monitoring with elk stackSubhasis Roy
 
Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點William Yeh
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Puppet
 
Useful Kafka tools
Useful Kafka toolsUseful Kafka tools
Useful Kafka toolsDale Lane
 
“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.Graham Dumpleton
 
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.Graham Dumpleton
 
Velocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack WorkshopVelocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack WorkshopChef Software, Inc.
 
How to master OpenStack in 2 hours
How to master OpenStack in 2 hoursHow to master OpenStack in 2 hours
How to master OpenStack in 2 hoursOpenCity Community
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Actionjuvenxu
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introducejhao niu
 
AWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configurationAWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configurationLionel LONKAP TSAMBA
 
Cooking Perl with Chef
Cooking Perl with ChefCooking Perl with Chef
Cooking Perl with ChefDavid Golden
 
Build your own private openstack cloud
Build your own private openstack cloudBuild your own private openstack cloud
Build your own private openstack cloudNUTC, imac
 
Deploying OpenStack with Chef
Deploying OpenStack with ChefDeploying OpenStack with Chef
Deploying OpenStack with ChefMatt Ray
 
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...Nagios
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台NUTC, imac
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudemProgramowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudemMaciej Lasyk
 

What's hot (20)

Simple docker hosting in FIWARE Lab
Simple docker hosting in FIWARE LabSimple docker hosting in FIWARE Lab
Simple docker hosting in FIWARE Lab
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDescribing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPI
 
Jenkins log monitoring with elk stack
Jenkins log monitoring with elk stackJenkins log monitoring with elk stack
Jenkins log monitoring with elk stack
 
Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點Ansible 實戰:top down 觀點
Ansible 實戰:top down 觀點
 
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
Continuous Infrastructure: Modern Puppet for the Jenkins Project - PuppetConf...
 
Useful Kafka tools
Useful Kafka toolsUseful Kafka tools
Useful Kafka tools
 
“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.“warpdrive”, making Python web application deployment magically easy.
“warpdrive”, making Python web application deployment magically easy.
 
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
PyCon AU 2010 - Getting Started With Apache/mod_wsgi.
 
Velocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack WorkshopVelocity 2011 Chef OpenStack Workshop
Velocity 2011 Chef OpenStack Workshop
 
Ansible101
Ansible101Ansible101
Ansible101
 
How to master OpenStack in 2 hours
How to master OpenStack in 2 hoursHow to master OpenStack in 2 hours
How to master OpenStack in 2 hours
 
Zookeeper In Action
Zookeeper In ActionZookeeper In Action
Zookeeper In Action
 
Zookeeper Introduce
Zookeeper IntroduceZookeeper Introduce
Zookeeper Introduce
 
AWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configurationAWS ElasticBeanstalk Advanced configuration
AWS ElasticBeanstalk Advanced configuration
 
Cooking Perl with Chef
Cooking Perl with ChefCooking Perl with Chef
Cooking Perl with Chef
 
Build your own private openstack cloud
Build your own private openstack cloudBuild your own private openstack cloud
Build your own private openstack cloud
 
Deploying OpenStack with Chef
Deploying OpenStack with ChefDeploying OpenStack with Chef
Deploying OpenStack with Chef
 
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
Nagios Conference 2014 - Mike Weber - Expanding NRDS Capabilities on Linux Sy...
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
 
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudemProgramowanie AWSa z CLI, boto, Ansiblem i libcloudem
Programowanie AWSa z CLI, boto, Ansiblem i libcloudem
 

Similar to Apache Spark SQL- Installing Spark

Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkAnu Shetty
 
Node.js basics
Node.js basicsNode.js basics
Node.js basicsBen Lin
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvMarkus Zapke-Gründemann
 
Aug penguin16
Aug penguin16Aug penguin16
Aug penguin16alhino
 
Bamboo Hands on training 2016
Bamboo Hands on training 2016Bamboo Hands on training 2016
Bamboo Hands on training 2016Takahiro Yamaki
 
Docker for data science
Docker for data scienceDocker for data science
Docker for data scienceCalvin Giles
 
Yocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDK
Yocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDKYocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDK
Yocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDKMarco Cavallini
 
Webinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java AppsWebinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java AppsCodefresh
 
Start tracking your ruby infrastructure
Start tracking your ruby infrastructureStart tracking your ruby infrastructure
Start tracking your ruby infrastructureSergiy Kukunin
 
NetDevOps Developer Environments with Vagrant @ SCALE16x
NetDevOps Developer Environments with Vagrant @ SCALE16xNetDevOps Developer Environments with Vagrant @ SCALE16x
NetDevOps Developer Environments with Vagrant @ SCALE16xHank Preston
 
Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014biicode
 
Setting-Up Python Environment (Jupyter Notebook)
Setting-Up Python Environment (Jupyter Notebook)Setting-Up Python Environment (Jupyter Notebook)
Setting-Up Python Environment (Jupyter Notebook)NopphawanTamkuan
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Developing with-devstack
Developing with-devstackDeveloping with-devstack
Developing with-devstackDeepak Garg
 

Similar to Apache Spark SQL- Installing Spark (20)

Spark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing sparkSpark summit2014 techtalk - testing spark
Spark summit2014 techtalk - testing spark
 
Spark Working Environment in Windows OS
Spark Working Environment in Windows OSSpark Working Environment in Windows OS
Spark Working Environment in Windows OS
 
Final Report - Spark
Final Report - SparkFinal Report - Spark
Final Report - Spark
 
Node.js basics
Node.js basicsNode.js basics
Node.js basics
 
Arbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenvArbeiten mit distribute, pip und virtualenv
Arbeiten mit distribute, pip und virtualenv
 
Aug penguin16
Aug penguin16Aug penguin16
Aug penguin16
 
Bamboo Hands on training 2016
Bamboo Hands on training 2016Bamboo Hands on training 2016
Bamboo Hands on training 2016
 
Docker for data science
Docker for data scienceDocker for data science
Docker for data science
 
Yocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDK
Yocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDKYocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDK
Yocto Project Dev Day Prague 2017 - Advanced class - Kernel modules with eSDK
 
Composer
ComposerComposer
Composer
 
Webinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java AppsWebinar: Creating an Effective Docker Build Pipeline for Java Apps
Webinar: Creating an Effective Docker Build Pipeline for Java Apps
 
Build application using sbt
Build application using sbtBuild application using sbt
Build application using sbt
 
Start tracking your ruby infrastructure
Start tracking your ruby infrastructureStart tracking your ruby infrastructure
Start tracking your ruby infrastructure
 
Oracle WebLogic
Oracle WebLogicOracle WebLogic
Oracle WebLogic
 
infra-as-code
infra-as-codeinfra-as-code
infra-as-code
 
NetDevOps Developer Environments with Vagrant @ SCALE16x
NetDevOps Developer Environments with Vagrant @ SCALE16xNetDevOps Developer Environments with Vagrant @ SCALE16x
NetDevOps Developer Environments with Vagrant @ SCALE16x
 
Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014Dependencies Managers in C/C++. Using stdcpp 2014
Dependencies Managers in C/C++. Using stdcpp 2014
 
Setting-Up Python Environment (Jupyter Notebook)
Setting-Up Python Environment (Jupyter Notebook)Setting-Up Python Environment (Jupyter Notebook)
Setting-Up Python Environment (Jupyter Notebook)
 
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Running on a Cluster | Big Data Hadoop Spark Tutorial | CloudxLab
 
Developing with-devstack
Developing with-devstackDeveloping with-devstack
Developing with-devstack
 

More from Experfy

Predictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life InsurancePredictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life InsuranceExperfy
 
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...Experfy
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep LearningExperfy
 
Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial Experfy
 
Machine Learning in AI
Machine Learning in AIMachine Learning in AI
Machine Learning in AIExperfy
 
A Gentle Introduction to Genomics
A Gentle Introduction to GenomicsA Gentle Introduction to Genomics
A Gentle Introduction to GenomicsExperfy
 
A Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTechA Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTechExperfy
 
Health Insurance 101
Health Insurance 101Health Insurance 101
Health Insurance 101Experfy
 
Financial Derivatives
Financial Derivatives Financial Derivatives
Financial Derivatives Experfy
 
AI for executives
AI for executives AI for executives
AI for executives Experfy
 
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...Experfy
 
Microsoft Azure Power BI
Microsoft Azure Power BIMicrosoft Azure Power BI
Microsoft Azure Power BIExperfy
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...Experfy
 
Sales Forecasting
Sales ForecastingSales Forecasting
Sales ForecastingExperfy
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceExperfy
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Experfy
 
Introduction to Healthcare Analytics
Introduction to Healthcare Analytics Introduction to Healthcare Analytics
Introduction to Healthcare Analytics Experfy
 
Blockchain Technology Fundamentals
Blockchain Technology FundamentalsBlockchain Technology Fundamentals
Blockchain Technology FundamentalsExperfy
 
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...Experfy
 
Econometric Analysis | Methods and Applications
Econometric Analysis | Methods and ApplicationsEconometric Analysis | Methods and Applications
Econometric Analysis | Methods and ApplicationsExperfy
 

More from Experfy (20)

Predictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life InsurancePredictive Analytics and Modeling in Life Insurance
Predictive Analytics and Modeling in Life Insurance
 
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
Predictive Analytics and Modeling in Product Pricing (Personal and Commercial...
 
Graph Models for Deep Learning
Graph Models for Deep LearningGraph Models for Deep Learning
Graph Models for Deep Learning
 
Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial Apache HBase Crash Course - Quick Tutorial
Apache HBase Crash Course - Quick Tutorial
 
Machine Learning in AI
Machine Learning in AIMachine Learning in AI
Machine Learning in AI
 
A Gentle Introduction to Genomics
A Gentle Introduction to GenomicsA Gentle Introduction to Genomics
A Gentle Introduction to Genomics
 
A Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTechA Comprehensive Guide to Insurance Technology - InsurTech
A Comprehensive Guide to Insurance Technology - InsurTech
 
Health Insurance 101
Health Insurance 101Health Insurance 101
Health Insurance 101
 
Financial Derivatives
Financial Derivatives Financial Derivatives
Financial Derivatives
 
AI for executives
AI for executives AI for executives
AI for executives
 
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
Cloud Native Computing Foundation: How Virtualization and Containers are Chan...
 
Microsoft Azure Power BI
Microsoft Azure Power BIMicrosoft Azure Power BI
Microsoft Azure Power BI
 
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
March Towards Big Data - Big Data Implementation, Migration, Ingestion, Manag...
 
Sales Forecasting
Sales ForecastingSales Forecasting
Sales Forecasting
 
Uncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial IntelligenceUncertain Knowledge and Reasoning in Artificial Intelligence
Uncertain Knowledge and Reasoning in Artificial Intelligence
 
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
 
Introduction to Healthcare Analytics
Introduction to Healthcare Analytics Introduction to Healthcare Analytics
Introduction to Healthcare Analytics
 
Blockchain Technology Fundamentals
Blockchain Technology FundamentalsBlockchain Technology Fundamentals
Blockchain Technology Fundamentals
 
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
Data Quality: Are Your Data Suitable For Answering Your Questions? - Experfy ...
 
Econometric Analysis | Methods and Applications
Econometric Analysis | Methods and ApplicationsEconometric Analysis | Methods and Applications
Econometric Analysis | Methods and Applications
 

Recently uploaded

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaVirag Sontakke
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptxVS Mahajan Coaching Centre
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxAvyJaneVismanos
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfUjwalaBharambe
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxDr.Ibrahim Hassaan
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfMahmoud M. Sallam
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Celine George
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitolTechU
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...jaredbarbolino94
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxUnboundStockton
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfSumit Tiwari
 

Recently uploaded (20)

Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Painted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of IndiaPainted Grey Ware.pptx, PGW Culture of India
Painted Grey Ware.pptx, PGW Culture of India
 
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
call girls in Kamla Market (DELHI) 🔝 >༒9953330565🔝 genuine Escort Service 🔝✔️✔️
 
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions  for the students and aspirants of Chemistry12th.pptxOrganic Name Reactions  for the students and aspirants of Chemistry12th.pptx
Organic Name Reactions for the students and aspirants of Chemistry12th.pptx
 
Final demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptxFinal demo Grade 9 for demo Plan dessert.pptx
Final demo Grade 9 for demo Plan dessert.pptx
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdfFraming an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
Framing an Appropriate Research Question 6b9b26d93da94caf993c038d9efcdedb.pdf
 
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Bikash Puri  Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Bikash Puri Delhi reach out to us at 🔝9953056974🔝
 
Gas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptxGas measurement O2,Co2,& ph) 04/2024.pptx
Gas measurement O2,Co2,& ph) 04/2024.pptx
 
Pharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdfPharmacognosy Flower 3. Compositae 2023.pdf
Pharmacognosy Flower 3. Compositae 2023.pdf
 
Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17Computed Fields and api Depends in the Odoo 17
Computed Fields and api Depends in the Odoo 17
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Capitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptxCapitol Tech U Doctoral Presentation - April 2024.pptx
Capitol Tech U Doctoral Presentation - April 2024.pptx
 
Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...Historical philosophical, theoretical, and legal foundations of special and i...
Historical philosophical, theoretical, and legal foundations of special and i...
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Blooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docxBlooming Together_ Growing a Community Garden Worksheet.docx
Blooming Together_ Growing a Community Garden Worksheet.docx
 
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdfEnzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
Enzyme, Pharmaceutical Aids, Miscellaneous Last Part of Chapter no 5th.pdf
 

Apache Spark SQL- Installing Spark

  • 2. • High-level physical cluster architecture • Software architecture of a standalone cluster • Installing Spark standalone locally • Running the Spark Shell • Running in python shell and ipython shell • Running sample Spark code • Using the Spark Session and Spark Context • Creating a parallelized collection Installing Spark Module 2. Installing Spark 2 What we’ll cover:
  • 3. Spark Cluster Components Module 2. Installing Spark 3 Driver Program Spark Session ● Spark Context Worker Node Executor ● Cache ● Task ● Task ● ... Cluster Manager Worker Node Executor ● Cache ● Task ● Task ● ...
  • 4. • Driver Node on the Cluster Manager, aka Master Node • Driver Node in a Spark Cluster • Driver Program • Driver Process “Driver” 4Module 2. Installing Spark
  • 5. • Cluster Mode o submit application to cluster manager • Client Mode o run the driver outside the cluster, with executors in the cluster • Standalone Mode o everything is on one machine Execution Modes Module 3. Spark Computing Framework 5
  • 6. Standalone Spark Cluster Module 2. Installing Spark 6 Driver Program Spark Session ● Spark Context Worker Node Executor ● Cache ● Task ● Task ● ... Cluster Manager
  • 7. • Ensure that you have Java installed, and verify your version • Most recent version you can install with Java 7: version 2.1.x • If you have Java 8+ installed you can install version 2.2.x Installing Local Standalone Module 2. Installing Spark 7
  • 8. • Work within a virtual environment or container • If you are already familiar with this you can skip the next three slides Recommendation for Success Module 2. Installing Spark 8
  • 9. • This is not required but is strongly recommended. • a virtual environment creates an isolated environment for a project o each project can have its own dependencies and not affect other projects o you can easily and safely remove this environment when you are done with it • Examples include Virtualenv, Anaconda, but there are many to good options • You may also use a container-based environment, such as Docker. • Use your preferred approach or organization’s recommended practice Creating a virtual environment Module 2. Installing Spark 9
  • 10. Example : creating a virtual environment. Anaconda $ conda create -n my-env python $ source activate my-env # Here is the same thing, but specifying Python 2.7 $ conda create -n my-env python=2.7 $ source activate my-env # Here is the same thing, but with Python 3.6 $ conda create -n my-env python=3.6 $ source activate my-env Module 2. Installing Spark 10
  • 11. Example : creating a virtual environment. Virtualenv $ cd /path/to/my/venvs $ virtualenv ./my_venv $ source ./my_env/bin/activate Module 2. Installing Spark 11
  • 12. • The most recent version of Pyspark you can install with Java 7 is version 2.1.x • If you have Java 8+ installed you can install Pyspark 2.2.x Install Java Module 2. Installing Spark 12
  • 13. The simple way to install Pyspark Recommended for Spark 2.2.x onwards $ pip install pyspark Module 2. Installing Spark 13
  • 14. • Download source and Build o Browse to https://spark.apache.org/downloads.html o Browse https://spark.apache.org/downloads.html o Choose the following settings : Installing Pyspark 2.1.2 Module 2. Installing Spark 14 o Click on spark-2.1.2.tgz
  • 15. Installing Pyspark 2.1.2 Module 2. Installing Spark 15 o Save this to /ext/spark
  • 16. Installing Pyspark 2.1.2 Recommended for following along in this course $ curl http://apache.mirrors.lucidnetworks.net/spark/spark-2.1.2/spark-2.1.2.tgz -o spark- 2.1.2.tgz # Confirm that the file size is as expected (~13MiB) $ ls -lh spark-2.1.2.tgz # Extract the contents $ tar -xf spark-2.1.2.tgz $ cd /ext/spark/spark-2.1.2 Module 2. Installing Spark 16
  • 17. • Set Java home environment variable o This is necessary only if Java home is not already set. Installing Pyspark 2.1.2 Module 2. Installing Spark 17
  • 18. Installing Pyspark 2.1.2 Set $JAVA_HOME $ echo $JAVA_HOME # If above returns empty, then you’ll need to set it. On Mac OS, you can use result of : $ echo `/usr/libexec/java_home` like so : $ export JAVA_HOME=`/usr/libexec/java_home` Module 2. Installing Spark 18
  • 19. • Get the build instructions o It is good practice to vet these yourself before running. o Understand what you are going to do before doing it o Spark evolves rapidly, so know how to upgrade! • Browse to : https://spark.apache.org/documentation.html o Scroll down to your version, click into that link o For our case, this is https://spark.apache.org/docs/2.1.2/building-spark.html o “Building Spark 2.2.1 using Maven requires Maven 3.3.9 or newer and Java 7+.” o “Note that support for Java 7 was removed as of Spark 2.2.0.” • Now to build it Installing Pyspark 2.1.2 Module 2. Installing Spark 19
  • 20. Installing Pyspark 2.1.2 Build it $ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m" $ ./build/mvn -DskipTests clean package [INFO] Total time: 11:11 min $ sudo pip install py4j Module 2. Installing Spark 20
  • 21. Installing Pyspark 2.1.2 Pro tip : turn down overly verbose logs $ cd /ext/spark/spark-2.1.2 $ cp conf/log4j.properties.template conf/log4j.properties Module 2. Installing Spark 21
  • 22. • Pro tip: ensure the following lines are in ~/.bash_profile o export SPARK_HOME="/ext/spark/spark-2.1.2" o export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PATH • If you altered ~/.bash_profile, remember to run the following: o $ source ~/.bash_profile Done Installing Spark Module 2. Installing Spark 22
  • 23. Running Pyspark Shell $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.1.2 /_/ SparkSession available as 'spark'. >>> Module 2. Installing Spark 23
  • 24. Running Pyspark Shell $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.1.2 /_/ SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x1050f8d50> >>> Module 2. Installing Spark 24
  • 25. Running Pyspark Shell $ pyspark Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 2.1.2 /_/ SparkSession available as 'spark'. >>> spark <pyspark.sql.session.SparkSession object at 0x1050f8d50> >>> Module 2. Installing Spark 25
  • 26. Running Pyspark Shell >>> sc = spark.sparkContext >>> rdd = sc.parallelize(["Hello", "World"]) >>> print(rdd.count()) 2 >>> print(rdd.collect()) ['Hello', 'World'] >>> print(rdd.take(2)) ['Hello', 'World'] Module 2. Installing Spark 26
  • 27. Running Pyspark in python shell $ python >>> Module 2. Installing Spark 27
  • 28. Running Pyspark in python shell We’ll simplify this later -- for now let’s ensure we’re using what we just installed $ python >>> import sys >>> import os >>> SPARK_HOME = '/ext/spark/spark-2.1.2' >>> SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip" >>> sys.path.insert(0, os.path.join(SPARK_HOME, "python")) # precede pre-existing >>> sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) >>> os.environ["SPARK_HOME"] = SPARK_HOME >>> Module 2. Installing Spark 28
  • 29. Running Pyspark in python shell >>> from pyspark.sql import SparkSession >>> spark = SparkSession ... .builder ... .appName("Python Spark SQL basic example") ... .config("spark.some.config.option", "some-value") ... .getOrCreate() Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). >>> spark <pyspark.sql.session.SparkSession object at 0x1112e4650> Module 2. Installing Spark 29
  • 30. Running Pyspark in python shell >>> sc = spark.sparkContext >>> rdd = sc.parallelize(["Hello", "World"]) >>> print(rdd.count()) 2 >>> print(rdd.collect()) ['Hello', 'World'] >>> print(rdd.take(2)) ['Hello', 'World'] Module 2. Installing Spark 30
  • 31. Running Pyspark in ipython shell Repeat the steps we did in the python shell : $ ipython In [1]: import os ...: import sys ...: SPARK_HOME = '/ext/spark/spark-2.1.2' ...: SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip" ...: sys.path.insert(0, os.path.join(SPARK_HOME, "python")) ...: sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) # must precede IDE's py4j ...: Module 2. Installing Spark 31
  • 32. Running Pyspark in ipython shell In [2]: java_home = "/Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home" ...: os.environ["JAVA_HOME"] = java_home ...: os.environ["SPARK_HOME"] = SPARK_HOME ...: from pyspark.sql import SparkSession ...: Module 2. Installing Spark 32
  • 33. Running Pyspark in ipython shell In [3]: spark = SparkSession ...: .builder ...: .appName("Python Spark SQL basic example") ...: .config("spark.some.config.option", "some-value") ...: .getOrCreate() ...: Module 2. Installing Spark 33
  • 34. Running Pyspark in ipython shell In [4]: spark? Type: SparkSession String form: <pyspark.sql.session.SparkSession object at 0x1111f5f10> File: /ext/spark/spark-2.1.2/python/pyspark/sql/session.py Docstring: The entry point to programming Spark with the Dataset and DataFrame API. A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as tables, execute SQL over tables, cache tables, and read parquet files. To create a SparkSession, use the following builder pattern: [...] Module 2. Installing Spark 34
  • 35. Running Pyspark in ipython shell Running the example code that we ran previously In [5]: sc = spark.sparkContext ...: rdd = sc.parallelize(["Hello", "World"]) ...: print(rdd.count()) ...: print(rdd.collect()) ...: print(rdd.take(2)) ...: 2 ['Hello', 'World'] ['Hello', 'World'] Module 2. Installing Spark 35
  • 36. Running Pyspark in ipython shell In [6]: sc? Type: SparkContext String form: <pyspark.context.SparkContext object at 0x1111702d0> File: /ext/spark/spark-2.1.2/python/pyspark/context.py Docstring: Main entry point for Spark functionality. A SparkContext represents the connection to a Spark cluster, and can be used to create L{RDD} and broadcast variables on that cluster. Init docstring: Create a new SparkContext. At least the master and app name should be set, either through the named parameters here or through C{conf}. Module 2. Installing Spark 36
  • 37. Running Pyspark in ipython shell In [7]: rdd? Type: RDD String form: ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475 File: /ext/spark/spark-2.1.2/python/pyspark/rdd.py Docstring: A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. Represents an immutable, partitioned collection of elements that can be operated on in parallel. Module 2. Installing Spark 37
  • 38. Running Pyspark in ipython shell Tab completion T In [8]: rdd. rdd.aggregate rdd.coalesce rdd.context rdd.countByValue rdd.aggregateByKey rdd.cogroup rdd.count rdd.ctx rdd.cache rdd.collect rdd.countApprox rdd.distinct rdd.cartesian rdd.collectAsMap rdd.countApproxDistinct rdd.filter rdd.checkpoint rdd.combineByKey rdd.countByKey rdd.first Module 2. Installing Spark 38
  • 39. Running Pyspark in ipython shell Magic functions T In [9]: %whos Variable Type Data/Info ---------------------------------------- SPARK_HOME str /ext/spark/spark-2.1.2 SPARK_PY4J str python/lib/py4j-0.10.4-src.zip SparkSession type <class 'pyspark.sql.session.SparkSession'> java_home str /Library/Java/JavaVirtual<...>.7.0_80.jdk/Contents/Home rdd RDD ParallelCollectionRDD[0] <...>ze at PythonRDD.scala:475 sc SparkContext <pyspark.context.SparkCon<...>xt object at 0x1111702d0> spark SparkSession <pyspark.sql.session.Spar<...>on object at 0x1111f5f10> Module 2. Installing Spark 39
  • 40. • We learned several ways of installing Pyspark • We ran the Spark Shell • We showed how to run our new install in python and ipython • We tested Pyspark by running sample code in all three of these shells • We used two important objects: the Spark Session, and the Spark Context • We created a RDD and inspected its contents. What we covered Module 2. Installing Spark 40
  • 41. • Next time we’ll cover more ways of running Spark. • We’ll show it in an IDE, and notebook. • We’ll get our first view of the Spark UI Next time: Running Spark Module 2. Installing Spark 41

Editor's Notes

  1. Section Beginning (Dark Color Option )
  2. This module is geared toward getting your own local standalone version of Spark running. Several exercises get you working with your new Spark instance. This module is important as you will need an environment on which to compete the hands-on exercises in later modules. copyright Mark E Plutowski 2018
  3. https://spark.apache.org/docs/2.1.2/cluster-overview.html The cluster manager has its own abstractions which are separate from Spark but can use the same names ‘driver’ node, sometimes called its ‘master’ node This is different from the Spark Driver and the Driver Program ‘worker’ nodes These are tied to physical machines, whereas in Spark they are tied to processes. A Spark Application requests resources from the Cluster Manager depending on the application, this could include a place to run the Spark Driver Program or, just resources for running the Executors Here a Worker Node contains a single Executor. It can have more. How many executor processes run for each worker node in spark? If using Dynamic Allocation, Spark will decide this for you. Or, you can stipulate this When and How this is done is outside the scope of this course, however, this course will prepare you to dig deeper into the answers to this question. copyright Mark E Plutowski 2018
  4. The Cluster Manager has its own notion of Driver Node, not to be confused with Spark’s Driver Node The Driver Node in a cluster is the one that is running the Driver Program Driver Program The Driver Program is sometimes used interchangeably to refer to the Spark Application, and to code being executed within the Driver Process created within a Spark Session. Later in this module, we’ll illustrate this by visualizing the Spark Driver in the Spark UI The Driver Program declares the transformations and actions on data The main() method in the Spark application runs on the Driver, This is similar to but distinct from an Executor See the Cluster Overview documentation for reference http://spark.apache.org/docs/latest/cluster-overview.html Pro tip: to avoid confusion, when it isn’t apparent from context be clear what you mean when you say “Driver” when referring to the lines of code that comprise the Spark Application, I say Spark Application code or script when referring to the Driver object that is created by the Spark Session, I say Driver Process when referring to the server in a cluster that is running the driver, I say Master Node in writing, when necessary to disambiguate how you are using the term “Spark Driver” or “Driver Program”, provide a link to Apache Spark documentation for the particular usage you intend. We will see the Spark Driver Process in action using the Spark UI later in this module copyright Mark E Plutowski 2018
  5. We cover Execution Modes in more depth in Module 3 -- and I’ll touch on them briefly now. Cluster mode: cluster manager places the driver on a worker mode and the executors on other worker nodes application is provided as Jar, Egg, or application script (python, scala, R) Client mode: same as cluster mode except that driver stays on the client machine that submitted the application may be running he application from a machine not colocated with the workers in the cluster aka “gateway machine” or “edge node” Local : runs everything on a single machine You’ve seen how to create a Spark application that runs just as a traditional executable But you’ve also used spark-submit --- which is a primary way to launch jobs on a Spark cluster By “job” here I mean executing a Driver Program copyright Mark E Plutowski 2018
  6. In the standalone mode we install today these components will run on your single laptop or server. https://spark.apache.org/docs/2.1.2/cluster-overview.html copyright Mark E Plutowski 2018
  7. You can also use the Databricks Community Edition, which I will demonstrate in the next module. There are many excellent guides https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f http://sigdelta.com/blog/how-to-install-pyspark-locally/ You have a lot of flexibility here. There is a quick and easy installation On the other end of this scale is a full blown cluster configuration. What we will install here is a happy medium : fairly easy with ample flexibility copyright Mark E Plutowski 2018
  8. the main purpose of a virtual environment is to create an isolated environment for a project. This means that each project can have its own dependencies, regardless of what dependencies every other project has. You can use Virtualenv, Anaconda (also referred to as conda, because that is what it uses on the command line to invoke commands), I leave that choice up to you. It is not required that you create a new virtual environment for this project, but it is highly recommended. There are many quick and easy tutorials for learning how to install virtualenv, Anaconda, or other virtual environment. You may also use a container-based environment, such as Docker. Rather than impose one choice I leave that to you as typically each organization has their own choice for this and set of best practices. copyright Mark E Plutowski 2018
  9. the main purpose of a virtual environment is to create an isolated environment for a project. This means that each project can have its own dependencies, regardless of what de pendencies every other project has. You can use Virtualenv, Anaconda (also referred to as conda, because that is what it uses on the command line to invoke commands), I leave that choice up to you. It is not required that you create a new virtual environment for this project, but it is highly recommended. There are many quick and easy tutorials for learning how to install virtualenv, Anaconda, or other virtual environment. You may also use a container-based environment, such as Docker. Rather than impose one choice I leave that to you as typically each organization has their own choice for this and set of best practices. I emphasize this recommendation because of the potentially wide audience, with apologies to the many of you who already know this. As a software engineer you should be aware of dependencies between projects. Also, running the exact version that I am demonstrating is essential in case you have issues with your install in order to debug your setup. copyright Mark E Plutowski 2018
  10. copyright Mark E Plutowski 2018
  11. copyright Mark E Plutowski 2018
  12. In the following, bulleted notes are instructions for what to do in a browser, whereas black backgrounded slides contain command lines to be run from in a terminal console. copyright Mark E Plutowski 2018
  13. It can be that easy. However, depending on your particular environment and the version you need, this could end up being limiting or more involved. If this works for you, Great! You can skip the next 10 slides. If not, continue on -- it will take less than 15 minutes. in what follows I am going to provide instructions for installing from source. This allows complete access to the source, example code, and example datasets. If you want to be able to reproduce exactly what I do, please follow these steps. Otherwise, you may skip ahead. copyright Mark E Plutowski 2018
  14. This page gives instruction to download, build, and configure Spark 2.1.2 in standalone mode from source. I encourage you to use the latest, 2.2.x ; however, version 2.1.2 is well tested. It also runs in Java 7, whereas 2.2.x requires Java 8+. Also, if you want to ensure that your examples run as closely to mine as possible, use the setup instructions offered here. Standalone mode runs on a single server and is a breeze to use within python shell, iPython, IDE or from command line. These directions work on Ubuntu or Mac OS. copyright Mark E Plutowski 2018
  15. Save this to /ext/spark
  16. The first line is unnecessary if you did this as a save-as from within the browser. You could also use your File Browser to inspect the size instead of using the second line. This is to ensure that you didn’t mistakenly download the wrong file when doing save as from the browser. Extract the contents and change directory.
  17. This page gives instruction to download, build, and configure Spark 2.1.2 in standalone mode from source. Standalone mode runs on a single server and is a breeze to use within python shell, iPython, IDE or from command line. These directions work on Ubuntu or Mac OS. Why not just upgrade to Java 8+? If you are applying this to a production environment where it is still relying on Java 7, then doing so could delay getting your application into production, if it utilizes features that depend on Java 8+. Check the versions of Java being utilized in your deployment environment, respectively as well for Python, Scala, or R, if you are going to be using those for scripting your Spark application. copyright Mark E Plutowski 2018
  18. copyright Mark E Plutowski 2018
  19. Always Read the Notes. You might need to perform an additional installation to proceed, depending on your operating system, your environment, the version you chose, etc. copyright Mark E Plutowski 2018
  20. The build/mvn line will take ten to fifteen minutes. copyright Mark E Plutowski 2018
  21. The base install uses logging settings that are overly verbose. To enable settings that are less verbose, do this step. This also points you to where other logging configuration settings can be made.
  22. That is probably the trickiest part of this course. Once you have Spark installed, you should be able to follow along from here. If you have Pyspark 2.1.2 installed, then all subsequent code examples should work identically for you. Of course, there are always exceptions, depending upon your particular setup; however we are probably over the hardest part. Hopefully it wasn’t too tricky for any of you. For many of you this should have been pretty straightforwards. If you chose the quick and easy installation path, and are rejoining us now, welcome back! Do make sure that you know where to find code examples and example datasets that we’ll be referring to subsequently, which were downloaded along with the source. copyright Mark E Plutowski 2018
  23. That lowest line is important -- let’s see what this means. Enter spark at the prompt, like so ... copyright Mark E Plutowski 2018
  24. This is your Spark Session -- this provides the point of entry for interacting with Spark. This also provides access to the RDD Context (commonly referred to using the variable sc) It also provides access to the Sql Context (commonly referred to using the variable sqlContext) copyright Mark E Plutowski 2018
  25. This is your Spark Session -- this provides the point of entry for interacting with Spark. This also provides access to the RDD Context (commonly referred to using the variable sc) It also provides access to the Sql Context (commonly referred to using the variable sqlContext) copyright Mark E Plutowski 2018
  26. We obtained a handle to the Spark context from the Spark session Created an RDD, which is a type of parallelized collection. Confirmed that it contains two rows. Displayed its contents. Displayed its contents in a different way. copyright Mark E Plutowski 2018
  27. I’ll show you a way to use the version of pyspark that you just installed. If you get you environment variables and path settings configured copyright Mark E Plutowski 2018
  28. This helps ensure that python is using the version we installed if you set your environment variables and path settings properly, you can skip this step. However, this is a way to ensure that you are using the version we installed. We’ll simplify this in a later module. Note that we inserted the path instead of appending it … this avoids very wierd bugs that can arise, especially in an IDE already having a different version of Py4j installed. copyright Mark E Plutowski 2018
  29. Note that the Spark Shell gives an already instantiated Spark Session variable, ‘spark’. Here, we need to create that ourselves. copyright Mark E Plutowski 2018
  30. This runs the same example we ran previously in the Spark shell. Note that it is exactly the same. Once you get your development environment configured, you can typically run things one place or the other without modification. We obtained a handle to the Spark context from the Spark session Created an RDD, which is a type of parallelized collection. Confirmed that it contains two rows. Displayed its contents. Displayed its contents in a different way. copyright Mark E Plutowski 2018
  31. Repeat the steps we used in the python shell. copyright Mark E Plutowski 2018
  32. Repeat the steps we used in the python shell. copyright Mark E Plutowski 2018
  33. Repeat the steps we used in the python shell. copyright Mark E Plutowski 2018
  34. Repeat the steps we used in the python shell. Note that we can now inspect the object using object? notation. I clipped 20+ lines of output it generated, which gives additional tips on how to use this object. copyright Mark E Plutowski 2018
  35. We of course get the identical results using the code snippet we used to test it in the Spark shell and the python shell. copyright Mark E Plutowski 2018
  36. This tells us more about the sc variable we created. This is the Spark Context object. copyright Mark E Plutowski 2018
  37. This tells us more about the rdd variable we created, which is an RDD. copyright Mark E Plutowski 2018
  38. And of course we get the other niceties of the ipython shell, such as tab completion. copyright Mark E Plutowski 2018
  39. And of course we get the other niceties of the ipython shell, magic functions copyright Mark E Plutowski 2018
  40. copyright Mark E Plutowski 2018
  41. Next module shows how to run a Spark application. It shows more ways you can work with Spark, including within an IDE and notebook. We’ll get our first view of the Spark UI, using it to inspect a running application. copyright Mark E Plutowski 2018