Apache Spark SQL- Installing Spark

• High-level physical cluster architecture
• Software architecture of a standalone cluster
• Installing Spark standalone locally
• Running the Spark Shell
• Running in python shell and ipython shell
• Running sample Spark code
• Using the Spark Session and Spark Context
• Creating a parallelized collection
Installing Spark
Module 2. Installing Spark 2
What we’ll cover:

Spark Cluster Components
Driver Program
Spark Session
● Spark Context
Worker Node
Executor
● Cache
● Task
● Task
● ...
Cluster Manager
Worker Node
Executor
● Cache
● Task
● Task
● ...

• Driver Node on the Cluster Manager, aka Master Node
• Driver Node in a Spark Cluster
• Driver Program
• Driver Process
“Driver”
4Module 2. Installing Spark

• Cluster Mode
o submit application to cluster manager
• Client Mode
o run the driver outside the cluster, with executors in the cluster
• Standalone Mode
o everything is on one machine
Execution Modes
Module 3. Spark Computing Framework 5

Standalone Spark Cluster
Driver Program
Spark Session
● Spark Context
Worker Node
Executor
● Cache
● Task
● Task
● ...
Cluster Manager

• Ensure that you have Java installed, and verify your version
• Most recent version you can install with Java 7: version 2.1.x
• If you have Java 8+ installed you can install version 2.2.x
Installing Local Standalone

• Work within a virtual environment or container
• If you are already familiar with this you can skip the next three slides
Recommendation for Success

• This is not required but is strongly recommended.
• a virtual environment creates an isolated environment for a project
o each project can have its own dependencies and not affect other projects
o you can easily and safely remove this environment when you are done with it
• Examples include Virtualenv, Anaconda, but there are many to good options
• You may also use a container-based environment, such as Docker.
• Use your preferred approach or organization’s recommended practice
Creating a virtual environment

Example : creating a virtual environment.
Anaconda
$ conda create -n my-env python
$ source activate my-env
# Here is the same thing, but specifying Python 2.7
$ conda create -n my-env python=2.7
# Here is the same thing, but with Python 3.6
$ conda create -n my-env python=3.6

Example : creating a virtual environment.
Virtualenv
$ cd /path/to/my/venvs
$ virtualenv ./my_venv
$ source ./my_env/bin/activate

• The most recent version of Pyspark you can install with Java 7 is version 2.1.x
• If you have Java 8+ installed you can install Pyspark 2.2.x
Install Java

The simple way to install Pyspark
Recommended for Spark 2.2.x onwards
$ pip install pyspark

• Download source and Build
o Browse to https://spark.apache.org/downloads.html
o Browse https://spark.apache.org/downloads.html
o Choose the following settings :
Installing Pyspark 2.1.2
o Click on spark-2.1.2.tgz

o Save this to /ext/spark

Recommended for following along in this course
$ curl http://apache.mirrors.lucidnetworks.net/spark/spark-2.1.2/spark-2.1.2.tgz -o spark-
2.1.2.tgz
# Confirm that the file size is as expected (~13MiB)
$ ls -lh spark-2.1.2.tgz
# Extract the contents
$ tar -xf spark-2.1.2.tgz
$ cd /ext/spark/spark-2.1.2

• Set Java home environment variable
o This is necessary only if Java home is not already set.

Set $JAVA_HOME
$ echo $JAVA_HOME
# If above returns empty, then you’ll need to set it. On Mac OS, you can use result of :
$ echo `/usr/libexec/java_home`
like so :
$ export JAVA_HOME=`/usr/libexec/java_home`

• Get the build instructions
o It is good practice to vet these yourself before running.
o Understand what you are going to do before doing it
o Spark evolves rapidly, so know how to upgrade!
• Browse to : https://spark.apache.org/documentation.html
o Scroll down to your version, click into that link
o For our case, this is https://spark.apache.org/docs/2.1.2/building-spark.html
o “Building Spark 2.2.1 using Maven requires Maven 3.3.9 or newer and Java 7+.”
o “Note that support for Java 7 was removed as of Spark 2.2.0.”
• Now to build it

Build it
$ export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
$ ./build/mvn -DskipTests clean package
[INFO] Total time: 11:11 min
$ sudo pip install py4j

Pro tip : turn down overly verbose logs
$ cd /ext/spark/spark-2.1.2
$ cp conf/log4j.properties.template conf/log4j.properties

• Pro tip: ensure the following lines are in ~/.bash_profile
o export SPARK_HOME="/ext/spark/spark-2.1.2"
o export PATH=$SPARK_HOME/bin:$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.10.4-src.zip:$PATH
• If you altered ~/.bash_profile, remember to run the following:
o $ source ~/.bash_profile
Done Installing Spark

Running Pyspark Shell
$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
SparkSession available as 'spark'.
>>>

$ pyspark
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 2.1.2
/_/
>>> spark
<pyspark.sql.session.SparkSession object at 0x1050f8d50>
>>>

>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(["Hello", "World"])
>>> print(rdd.count())
2
>>> print(rdd.collect())
['Hello', 'World']
>>> print(rdd.take(2))
['Hello', 'World']

Running Pyspark in python shell
$ python
>>>

We’ll simplify this later -- for now let’s ensure we’re using what we just installed
$ python
>>> import sys
>>> import os
>>> SPARK_HOME = '/ext/spark/spark-2.1.2'
>>> SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip"
>>> sys.path.insert(0, os.path.join(SPARK_HOME, "python")) # precede pre-existing
>>> sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J))
>>> os.environ["SPARK_HOME"] = SPARK_HOME
>>>

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession
... .builder
... .appName("Python Spark SQL basic example")
... .config("spark.some.config.option", "some-value")
... .getOrCreate()
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
>>> spark
<pyspark.sql.session.SparkSession object at 0x1112e4650>

>>> sc = spark.sparkContext
>>> rdd = sc.parallelize(["Hello", "World"])
>>> print(rdd.count())
2
>>> print(rdd.collect())
['Hello', 'World']
>>> print(rdd.take(2))
['Hello', 'World']

Running Pyspark in ipython shell
Repeat the steps we did in the python shell :
$ ipython
In [1]: import os
...: import sys
...: SPARK_HOME = '/ext/spark/spark-2.1.2'
...: SPARK_PY4J = "python/lib/py4j-0.10.4-src.zip"
...: sys.path.insert(0, os.path.join(SPARK_HOME, "python"))
...: sys.path.insert(0, os.path.join(SPARK_HOME, SPARK_PY4J)) # must precede IDE's py4j
...:

In [2]: java_home = "/Library/Java/JavaVirtualMachines/jdk1.7.0_80.jdk/Contents/Home"
...: os.environ["JAVA_HOME"] = java_home
...: os.environ["SPARK_HOME"] = SPARK_HOME
...: from pyspark.sql import SparkSession
...:

In [3]: spark = SparkSession
...: .builder
...: .appName("Python Spark SQL basic example")
...: .config("spark.some.config.option", "some-value")
...: .getOrCreate()
...:

In [4]: spark?
Type: SparkSession
String form: <pyspark.sql.session.SparkSession object at 0x1111f5f10>
File: /ext/spark/spark-2.1.2/python/pyspark/sql/session.py
Docstring:
The entry point to programming Spark with the Dataset and DataFrame API.
A SparkSession can be used create :class:`DataFrame`, register :class:`DataFrame` as
tables, execute SQL over tables, cache tables, and read parquet files.
To create a SparkSession, use the following builder pattern:
[...]

Running the example code that we ran previously
In [5]: sc = spark.sparkContext
...: rdd = sc.parallelize(["Hello", "World"])
...: print(rdd.count())
...: print(rdd.collect())
...: print(rdd.take(2))
...:
2
['Hello', 'World']
['Hello', 'World']

In [6]: sc?
Type: SparkContext
String form: <pyspark.context.SparkContext object at 0x1111702d0>
File: /ext/spark/spark-2.1.2/python/pyspark/context.py
Docstring:
Main entry point for Spark functionality. A SparkContext represents the
connection to a Spark cluster, and can be used to create L{RDD} and
broadcast variables on that cluster.
Init docstring:
Create a new SparkContext. At least the master and app name should be set,
either through the named parameters here or through C{conf}.

In [7]: rdd?
Type: RDD
String form: ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:475
File: /ext/spark/spark-2.1.2/python/pyspark/rdd.py
Docstring:
A Resilient Distributed Dataset (RDD), the basic abstraction in Spark.
Represents an immutable, partitioned collection of elements that can be
operated on in parallel.

Tab completion
T
In [8]: rdd.
rdd.aggregate rdd.coalesce rdd.context rdd.countByValue
rdd.aggregateByKey rdd.cogroup rdd.count rdd.ctx
rdd.cache rdd.collect rdd.countApprox rdd.distinct
rdd.cartesian rdd.collectAsMap rdd.countApproxDistinct rdd.filter
rdd.checkpoint rdd.combineByKey rdd.countByKey rdd.first

Magic functions
T
In [9]: %whos
Variable Type Data/Info
----------------------------------------
SPARK_HOME str /ext/spark/spark-2.1.2
SPARK_PY4J str python/lib/py4j-0.10.4-src.zip
SparkSession type <class 'pyspark.sql.session.SparkSession'>
java_home str /Library/Java/JavaVirtual<...>.7.0_80.jdk/Contents/Home
rdd RDD ParallelCollectionRDD[0] <...>ze at PythonRDD.scala:475
sc SparkContext <pyspark.context.SparkCon<...>xt object at 0x1111702d0>
spark SparkSession <pyspark.sql.session.Spar<...>on object at 0x1111f5f10>

• We learned several ways of installing Pyspark
• We ran the Spark Shell
• We showed how to run our new install in python and ipython
• We tested Pyspark by running sample code in all three of these shells
• We used two important objects: the Spark Session, and the Spark Context
• We created a RDD and inspected its contents.
What we covered

• Next time we’ll cover more ways of running Spark.
• We’ll show it in an IDE, and notebook.
• We’ll get our first view of the Spark UI
Next time: Running Spark

Apache Spark SQL- Installing Spark

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Apache Spark SQL- Installing Spark

Similar to Apache Spark SQL- Installing Spark (20)

More from Experfy

More from Experfy (20)

Recently uploaded

Recently uploaded (20)

Apache Spark SQL- Installing Spark

Editor's Notes