4. • Driver Node on the Cluster Manager, aka Master Node
• Driver Node in a Spark Cluster
• Driver Program
• Driver Process
“Driver”
4Module 2. Installing Spark
5. • Cluster Mode
o submit application to cluster manager
• Client Mode
o run the driver outside the cluster, with executors in the cluster
• Standalone Mode
o everything is on one machine
Execution Modes
Module 3. Spark Computing Framework 5
7. • Ensure that you have Java installed, and verify your version
• Most recent version you can install with Java 7: version 2.1.x
• If you have Java 8+ installed you can install version 2.2.x
Installing Local Standalone
Module 2. Installing Spark 7
8. • Work within a virtual environment or container
• If you are already familiar with this you can skip the next three slides
Recommendation for Success
Module 2. Installing Spark 8
9. • This is not required but is strongly recommended.
• a virtual environment creates an isolated environment for a project
o each project can have its own dependencies and not affect other projects
o you can easily and safely remove this environment when you are done with it
• Examples include Virtualenv, Anaconda, but there are many to good options
• You may also use a container-based environment, such as Docker.
• Use your preferred approach or organization’s recommended practice
Creating a virtual environment
Module 2. Installing Spark 9
10. Example : creating a virtual environment.
Anaconda
$ conda create -n my-env python
$ source activate my-env
# Here is the same thing, but specifying Python 2.7
$ conda create -n my-env python=2.7
$ source activate my-env
# Here is the same thing, but with Python 3.6
$ conda create -n my-env python=3.6
$ source activate my-env
Module 2. Installing Spark 10
11. Example : creating a virtual environment.
Virtualenv
$ cd /path/to/my/venvs
$ virtualenv ./my_venv
$ source ./my_env/bin/activate
Module 2. Installing Spark 11
12. • The most recent version of Pyspark you can install with Java 7 is version 2.1.x
• If you have Java 8+ installed you can install Pyspark 2.2.x
Install Java
Module 2. Installing Spark 12
13. The simple way to install Pyspark
Recommended for Spark 2.2.x onwards
$ pip install pyspark
Module 2. Installing Spark 13
14. • Download source and Build
o Browse to https://spark.apache.org/downloads.html
o Browse https://spark.apache.org/downloads.html
o Choose the following settings :
Installing Pyspark 2.1.2
Module 2. Installing Spark 14
o Click on spark-2.1.2.tgz
16. Installing Pyspark 2.1.2
Recommended for following along in this course
$ curl http://apache.mirrors.lucidnetworks.net/spark/spark-2.1.2/spark-2.1.2.tgz -o spark-
2.1.2.tgz
# Confirm that the file size is as expected (~13MiB)
$ ls -lh spark-2.1.2.tgz
# Extract the contents
$ tar -xf spark-2.1.2.tgz
$ cd /ext/spark/spark-2.1.2
Module 2. Installing Spark 16
17. • Set Java home environment variable
o This is necessary only if Java home is not already set.
Installing Pyspark 2.1.2
Module 2. Installing Spark 17
18. Installing Pyspark 2.1.2
Set $JAVA_HOME
$ echo $JAVA_HOME
# If above returns empty, then you’ll need to set it. On Mac OS, you can use result of :
$ echo `/usr/libexec/java_home`
like so :
$ export JAVA_HOME=`/usr/libexec/java_home`
Module 2. Installing Spark 18
19. • Get the build instructions
o It is good practice to vet these yourself before running.
o Understand what you are going to do before doing it
o Spark evolves rapidly, so know how to upgrade!
• Browse to : https://spark.apache.org/documentation.html
o Scroll down to your version, click into that link
o For our case, this is https://spark.apache.org/docs/2.1.2/building-spark.html
o “Building Spark 2.2.1 using Maven requires Maven 3.3.9 or newer and Java 7+.”
o “Note that support for Java 7 was removed as of Spark 2.2.0.”
• Now to build it
Installing Pyspark 2.1.2
Module 2. Installing Spark 19
20. Installing Pyspark 2.1.2
Build it
$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
$ ./build/mvn -DskipTests clean package
[INFO] Total time: 11:11 min
$ sudo pip install py4j
Module 2. Installing Spark 20
21. Installing Pyspark 2.1.2
Pro tip : turn down overly verbose logs
$ cd /ext/spark/spark-2.1.2
$ cp conf/log4j.properties.template conf/log4j.properties
Module 2. Installing Spark 21
22. • Pro tip: ensure the following lines are in ~/.bash_profile
o export SPARK_HOME="/ext/spark/spark-2.1.2"
o export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PATH
• If you altered ~/.bash_profile, remember to run the following:
o $ source ~/.bash_profile
Done Installing Spark
Module 2. Installing Spark 22
23. Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>>
Module 2. Installing Spark 23
24. Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>
Module 2. Installing Spark 24
25. Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>
Module 2. Installing Spark 25
34. Running Pyspark in ipython shell
In [4]: spark?
Type: SparkSession
String form: <pyspark.sql.session.SparkSession object at 0x1111f5f10>
File: /ext/spark/spark-2.1.2/python/pyspark/sql/session.py
Docstring:
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
tables, execute SQL over tables, cache tables, and read parquet files.
To create a SparkSession, use the following builder pattern:
[...]
Module 2. Installing Spark 34
35. Running Pyspark in ipython shell
Running the example code that we ran previously
In [5]: sc = spark.sparkContext
...: rdd = sc.parallelize(["Hello", "World"])
...: print(rdd.count())
...: print(rdd.collect())
...: print(rdd.take(2))
...:
2
['Hello', 'World']
['Hello', 'World']
Module 2. Installing Spark 35
36. Running Pyspark in ipython shell
In [6]: sc?
Type: SparkContext
String form: <pyspark.context.SparkContext object at 0x1111702d0>
File: /ext/spark/spark-2.1.2/python/pyspark/context.py
Docstring:
Main entry point for Spark functionality. A SparkContext represents the
connection to a Spark cluster, and can be used to create L{RDD} and
broadcast variables on that cluster.
Init docstring:
Create a new SparkContext. At least the master and app name should be set,
either through the named parameters here or through C{conf}.
Module 2. Installing Spark 36
37. Running Pyspark in ipython shell
In [7]: rdd?
Type: RDD
String form: ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475
File: /ext/spark/spark-2.1.2/python/pyspark/rdd.py
Docstring:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be
operated on in parallel.
Module 2. Installing Spark 37
39. Running Pyspark in ipython shell
Magic functions
T
In [9]: %whos
Variable Type Data/Info
----------------------------------------
SPARK_HOME str /ext/spark/spark-2.1.2
SPARK_PY4J str python/lib/py4j-0.10.4-src.zip
SparkSession type <class 'pyspark.sql.session.SparkSession'>
java_home str /Library/Java/JavaVirtual<...>.7.0_80.jdk/Contents/Home
rdd RDD ParallelCollectionRDD[0] <...>ze at PythonRDD.scala:475
sc SparkContext <pyspark.context.SparkCon<...>xt object at 0x1111702d0>
spark SparkSession <pyspark.sql.session.Spar<...>on object at 0x1111f5f10>
Module 2. Installing Spark 39
40. • We learned several ways of installing Pyspark
• We ran the Spark Shell
• We showed how to run our new install in python and ipython
• We tested Pyspark by running sample code in all three of these shells
• We used two important objects: the Spark Session, and the Spark Context
• We created a RDD and inspected its contents.
What we covered
Module 2. Installing Spark 40
41. • Next time we’ll cover more ways of running Spark.
• We’ll show it in an IDE, and notebook.
• We’ll get our first view of the Spark UI
Next time: Running Spark
Module 2. Installing Spark 41
Editor's Notes
Section Beginning (Dark Color Option )
This module is geared toward getting your own local standalone version of Spark running. Several exercises get you working with your new Spark instance. This module is important as you will need an environment on which to compete the hands-on exercises in later modules.
copyright Mark E Plutowski 2018
https://spark.apache.org/docs/2.1.2/cluster-overview.html
The cluster manager has its own abstractions which are separate from Spark but can use the same names
‘driver’ node, sometimes called its ‘master’ node
This is different from the Spark Driver and the Driver Program
‘worker’ nodes
These are tied to physical machines, whereas in Spark they are tied to processes.
A Spark Application requests resources from the Cluster Manager
depending on the application, this could include
a place to run the Spark Driver Program
or, just resources for running the Executors
Here a Worker Node contains a single Executor. It can have more.
How many executor processes run for each worker node in spark?
If using Dynamic Allocation, Spark will decide this for you.
Or, you can stipulate this
When and How this is done is outside the scope of this course, however, this course will prepare you to dig deeper into the answers to this question.
copyright Mark E Plutowski 2018
The Cluster Manager has its own notion of Driver Node, not to be confused with Spark’s Driver Node
The Driver Node in a cluster is the one that is running the Driver Program
Driver Program
The Driver Program is sometimes used interchangeably to refer to
the Spark Application, and to
code being executed within the Driver Process created within a Spark Session.
Later in this module, we’ll illustrate this by visualizing the Spark Driver in the Spark UI
The Driver Program declares the transformations and actions on data
The main() method in the Spark application runs on the Driver,
This is similar to but distinct from an Executor
See the Cluster Overview documentation for reference
http://spark.apache.org/docs/latest/cluster-overview.html
Pro tip: to avoid confusion, when it isn’t apparent from context be clear what you mean when you say “Driver”
when referring to the lines of code that comprise the Spark Application, I say Spark Application code or script
when referring to the Driver object that is created by the Spark Session, I say Driver Process
when referring to the server in a cluster that is running the driver, I say Master Node
in writing, when necessary to disambiguate how you are using the term “Spark Driver” or “Driver Program”,
provide a link to Apache Spark documentation for the particular usage you intend.
We will see the Spark Driver Process in action using the Spark UI later in this module
copyright Mark E Plutowski 2018
We cover Execution Modes in more depth in Module 3 -- and I’ll touch on them briefly now.
Cluster mode: cluster manager places the driver on a worker mode and the executors on other worker nodes
application is provided as Jar, Egg, or application script (python, scala, R)
Client mode: same as cluster mode except that driver stays on the client machine that submitted the application
may be running he application from a machine not colocated with the workers in the cluster
aka “gateway machine” or “edge node”
Local : runs everything on a single machine
You’ve seen how to create a Spark application that runs just as a traditional executable
But you’ve also used spark-submit --- which is a primary way to launch jobs on a Spark cluster
By “job” here I mean executing a Driver Program
copyright Mark E Plutowski 2018
In the standalone mode we install today these components will run on your single laptop or server.
https://spark.apache.org/docs/2.1.2/cluster-overview.html
copyright Mark E Plutowski 2018
You can also use the Databricks Community Edition, which I will demonstrate in the next module.
There are many excellent guides
https://blog.sicara.com/get-started-pyspark-jupyter-guide-tutorial-ae2fe84f594f
http://sigdelta.com/blog/how-to-install-pyspark-locally/
You have a lot of flexibility here.
There is a quick and easy installation
On the other end of this scale is a full blown cluster configuration.
What we will install here is a happy medium : fairly easy with ample flexibility
copyright Mark E Plutowski 2018
the main purpose of a virtual environment is to create an isolated environment for a project. This means that each project can have its own dependencies, regardless of what dependencies every other project has.
You can use Virtualenv, Anaconda (also referred to as conda, because that is what it uses on the command line to invoke commands),
I leave that choice up to you. It is not required that you create a new virtual environment for this project, but it is highly recommended.
There are many quick and easy tutorials for learning how to install virtualenv, Anaconda, or other virtual environment.
You may also use a container-based environment, such as Docker. Rather than impose one choice I leave that to you as typically each organization has their own choice for this and set of best practices.
copyright Mark E Plutowski 2018
the main purpose of a virtual environment is to create an isolated environment for a project. This means that each project can have its own dependencies, regardless of what de
pendencies every other project has.
You can use Virtualenv, Anaconda (also referred to as conda, because that is what it uses on the command line to invoke commands),
I leave that choice up to you. It is not required that you create a new virtual environment for this project, but it is highly recommended.
There are many quick and easy tutorials for learning how to install virtualenv, Anaconda, or other virtual environment.
You may also use a container-based environment, such as Docker. Rather than impose one choice I leave that to you as typically each organization has their own choice for this and set of best practices.
I emphasize this recommendation because of the potentially wide audience, with apologies to the many of you who already know this. As a software engineer you should be aware of dependencies between projects.
Also, running the exact version that I am demonstrating is essential in case you have issues with your install in order to debug your setup.
copyright Mark E Plutowski 2018
copyright Mark E Plutowski 2018
copyright Mark E Plutowski 2018
In the following, bulleted notes are instructions for what to do in a browser, whereas black backgrounded slides contain command lines to be run from in a terminal console.
copyright Mark E Plutowski 2018
It can be that easy. However, depending on your particular environment and the version you need, this could end up being limiting or more involved. If this works for you, Great! You can skip the next 10 slides. If not, continue on -- it will take less than 15 minutes.
in what follows I am going to provide instructions for installing from source. This allows complete access to the source, example code, and example datasets. If you want to be able to reproduce exactly what I do, please follow these steps. Otherwise, you may skip ahead.
copyright Mark E Plutowski 2018
This page gives instruction to download, build, and configure Spark 2.1.2 in standalone mode from source.
I encourage you to use the latest, 2.2.x ; however, version 2.1.2 is well tested. It also runs in Java 7, whereas 2.2.x requires Java 8+. Also, if you want to ensure that your examples run as closely to mine as possible, use the setup instructions offered here.
Standalone mode runs on a single server and is a breeze to use within python shell, iPython, IDE or from command line. These directions work on Ubuntu or Mac OS.
copyright Mark E Plutowski 2018
Save this to /ext/spark
The first line is unnecessary if you did this as a save-as from within the browser.
You could also use your File Browser to inspect the size instead of using the second line.
This is to ensure that you didn’t mistakenly download the wrong file when doing save as from the browser.
Extract the contents and change directory.
This page gives instruction to download, build, and configure Spark 2.1.2 in standalone mode from source.
Standalone mode runs on a single server and is a breeze to use within python shell, iPython, IDE or from command line. These directions work on Ubuntu or Mac OS.
Why not just upgrade to Java 8+? If you are applying this to a production environment where it is still relying on Java 7, then doing so could delay getting your application into production, if it utilizes features that depend on Java 8+. Check the versions of Java being utilized in your deployment environment, respectively as well for Python, Scala, or R, if you are going to be using those for scripting your Spark application.
copyright Mark E Plutowski 2018
copyright Mark E Plutowski 2018
Always Read the Notes. You might need to perform an additional installation to proceed, depending on your operating system, your environment, the version you chose, etc.
copyright Mark E Plutowski 2018
The build/mvn line will take ten to fifteen minutes.
copyright Mark E Plutowski 2018
The base install uses logging settings that are overly verbose. To enable settings that are less verbose, do this step. This also points you to where other logging configuration settings can be made.
That is probably the trickiest part of this course.
Once you have Spark installed, you should be able to follow along from here.
If you have Pyspark 2.1.2 installed, then all subsequent code examples should work identically for you.
Of course, there are always exceptions, depending upon your particular setup; however we are probably over the hardest part. Hopefully it wasn’t too tricky for any of you. For many of you this should have been pretty straightforwards.
If you chose the quick and easy installation path, and are rejoining us now, welcome back!
Do make sure that you know where to find code examples and example datasets that we’ll be referring to subsequently, which were downloaded along with the source.
copyright Mark E Plutowski 2018
That lowest line is important -- let’s see what this means. Enter spark at the prompt, like so ...
copyright Mark E Plutowski 2018
This is your Spark Session -- this provides the point of entry for interacting with Spark.
This also provides access to the RDD Context (commonly referred to using the variable sc)
It also provides access to the Sql Context (commonly referred to using the variable sqlContext)
copyright Mark E Plutowski 2018
This is your Spark Session -- this provides the point of entry for interacting with Spark.
This also provides access to the RDD Context (commonly referred to using the variable sc)
It also provides access to the Sql Context (commonly referred to using the variable sqlContext)
copyright Mark E Plutowski 2018
We obtained a handle to the Spark context from the Spark session
Created an RDD, which is a type of parallelized collection.
Confirmed that it contains two rows.
Displayed its contents.
Displayed its contents in a different way.
copyright Mark E Plutowski 2018
I’ll show you a way to use the version of pyspark that you just installed.
If you get you environment variables and path settings configured
copyright Mark E Plutowski 2018
This helps ensure that python is using the version we installed
if you set your environment variables and path settings properly, you can skip this step.
However, this is a way to ensure that you are using the version we installed.
We’ll simplify this in a later module.
Note that we inserted the path instead of appending it … this avoids very wierd bugs that can arise, especially in an IDE already having a different version of Py4j installed.
copyright Mark E Plutowski 2018
Note that the Spark Shell gives an already instantiated Spark Session variable, ‘spark’.
Here, we need to create that ourselves.
copyright Mark E Plutowski 2018
This runs the same example we ran previously in the Spark shell.
Note that it is exactly the same. Once you get your development environment configured, you can typically run things one place or the other without modification.
We obtained a handle to the Spark context from the Spark session
Created an RDD, which is a type of parallelized collection.
Confirmed that it contains two rows.
Displayed its contents.
Displayed its contents in a different way.
copyright Mark E Plutowski 2018
Repeat the steps we used in the python shell.
copyright Mark E Plutowski 2018
Repeat the steps we used in the python shell.
copyright Mark E Plutowski 2018
Repeat the steps we used in the python shell.
copyright Mark E Plutowski 2018
Repeat the steps we used in the python shell.
Note that we can now inspect the object using object? notation.
I clipped 20+ lines of output it generated, which gives additional tips on how to use this object.
copyright Mark E Plutowski 2018
We of course get the identical results using the code snippet we used to test it in the Spark shell and the python shell.
copyright Mark E Plutowski 2018
This tells us more about the sc variable we created. This is the Spark Context object.
copyright Mark E Plutowski 2018
This tells us more about the rdd variable we created, which is an RDD.
copyright Mark E Plutowski 2018
And of course we get the other niceties of the ipython shell, such as tab completion.
copyright Mark E Plutowski 2018
And of course we get the other niceties of the ipython shell, magic functions
copyright Mark E Plutowski 2018
copyright Mark E Plutowski 2018
Next module shows how to run a Spark application. It shows more ways you can work with Spark, including within an IDE and notebook. We’ll get our first view of the Spark UI, using it to inspect a running application.
copyright Mark E Plutowski 2018