Install Eclipse for Spark-Python (deprecated slides, see below)

Philippe ROSSIGNOL : 2015/06/12
How to configure Eclipse in order
to develop with Spark and Python
http://enahwe.blogspot.fr/p/philippe-rossignol-20150612-how-to.html
Introduction
Under the cover of PySpark
Requirements
A brief note about Scala
1°) Install Eclipse
2°) Install Spark
3°) Install PyDev
4°) Configure PyDev with a Python interpreter
5°) Configure PyDev with the Spark Python sources
6°) Configure PyDev with the Spark Environment variables
7°) Create the Spark-Python project “CountWords”
8°) Run the Spark-Python project “CountWords”

Introduction
This document shows a way to configure Eclipse IDE in order to develop with Spark 1.3.1 and Python via the plugin PyDev.
PyDev is a plugin that enables Eclipse to be used as a Python IDE.
First we will install Eclipse, then Spark 1.3.1 and PyDev, then we will configure PyDev.
Finally we will finish by developing and by testing a well-known example code named “Word Counts” written in Python and running on Spark.
Under the cover of PySpark
The Spark Python API (PySpark) exposes the Spark programming model to Python.
By default, PySpark requires python (2.6 or higher) to be available on the system PATH and use it to run programs.
Let’s note that PySpark applications are executed by using a standard CPython interpreter (in order to support Python modules that use C extensions).
But an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable.
All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.
In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java.
For more details please visit the pages below :
Installing and Configuring PySpark : https://spark.apache.org/docs/0.9.2/python-programming-guide.html
PySpark Internals : https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals

Requirements
Let’s note that Spark V1.3.1 runs on Java 6+ and Python 2.6+, so you will need on your computer :
●
A JVM 6 or higher (JVM 7 may be a good compromising)
●
A python 2.6 or higher
The following installation has been carried out with a JVM 7 and a Python interpreter 2.7.
A brief note about Scala
Keep in mind that a great idea would consist to use the following Eclipse IDE for Spark in order to develop later both in Python and Scala.
To allow this, it’s important to know that Spark 1.3.1 needs to use a Scala API that is compatible with Scala version 2.10.x.
That’s the reason why the following installation uses Eclipse 4.3 (Kepler) because of its compatibility with Scala 2.10.

1°) Install Eclipse
G o t o t h e E c l i p s e w e b s i t e t h e n d o w n l o a d a n d u n c o m p r e s s E c l i p s e 4 . 3 ( K e p l e r ) o n y o u r c o m p u t e r :
http://www.eclipse.org/downloads/packages/release/Kepler/SR2
Finally launch Eclipse and create your workspace as usually.
2°) Install Spark
Go to the Spark website then download and uncompress Spark 1.3.1 (e.g: “Pre-built for Hadoop 2.6 and later”) on your computer :
https://spark.apache.org/downloads.html

3°) Install PyDev
From Eclipse IDE :
Go to the menu Help > Install New Software...
From the “Install“ window :
Click on the button [Add…]
From the “Add Repository” dialog box :
Fill the field Name: PyDev
Fill the field Location: http://pydev.sf.net/updates
Validate with the button [OK]
From the “Install“ window :
Check the name PyDev and click twice on the button [Next >]
Accept the terms of the license agreement and click on the button [Finish]
If a “Security Warning” window appears :
If the following warning message appears : “Warning: you are installing software that contains unsigned content…” :
Click on the button [OK]
From the “Sofware Updates” window :
Click the button [Yes] to restart Eclipse and for the changes to take effect.
Now PyDev (e.g: 4.1.0) is installed in your Eclipse.
But you can’t develop in Python, because PyDev isn’t configured yet.

4°) Configure PyDev with a Python interpreter
As PySpark, PyDev requires a Python interpreter installed on your computer.
Remember that with PySpark, Py4J is not a Python interpreter.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
The following installation has been carried out with a Python interpreter 2.7.
From Eclipse IDE :
Open the PyDev perspective (on top right of the Eclipse IDE)
Go to the menu Eclipse > Preferences… (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the button [Advanced Auto-Config]
Eclipse with introspect all the Python installations on your computer.
Choose a Python version 2.7 (e.g: /usr/bin/python2.7 on mac) then validate with the button [OK]
From the “Selection needed” window :
Click on the button [OK] to accept the folders to be added to the system PYTHONPATH
Now PyDev is configured in your Eclipse.
You are able to develop in Python but not with Spark yet.

5°) Configure PyDev with the Spark Python sources
Now we are going to configure PyDev with the Spark Python sources.
From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)
Click on the button [New Folder]
Choose the python folder just under your Spark install directory and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python
Note : This path must be absolute (don’t use the Spark home environment variable)
Click on the button [New Egg/Zip(s)]
From the File Explorer, select [*.zip] rather [*.egg]
Choose the file py4j-0.8.2.1-src.zip just under your Spark python folder and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip
Note : This path must be absolute (don’t use the Spark home environment variable)
Now PyDev is configured with Spark Python sources.
But we can’t execute Spark while the Environment variables aren’t configured.

6°) Configure PyDev with the Spark Environment variables
From Eclipse IDE :
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)
Click on the central button [Environment]
Click on the button [New...] (close to the button [Select...]) to add a new Environment variable.
Add the environment variable SPARK_HOME and validate :
e.g 1 : Name: SPARK_HOME, Value: /home/foo/Spark_1.3.1-Hadoop_2.6
e.g 2 : Name: SPARK_HOME, Value: ${eclipse_home}../Spark_1.3.1-Hadoop_2.6
Note : Don’t use the system environment variables such as Spark home
It’s recommended to handle your own "log4j.properties" file in each of your project.
To allow that, adds the environment variable SPARK_CONF_DIR as previously and validates :
e.g : Name: SPARK_CONF_DIR, Value: ${project_loc}/conf
If you experience some problems with the variable ${project_loc} (e.g: with Linux OS), specify an absolute path instead.
Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configurations…,
then create your SPARK_CONF_DIR variable in the Environment tab as described previously
Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on :
e.g 1 : Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm
e.g 2 : Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address)
Now PyDev is full ready to develop with Spark in Python.

7°) Create the Spark-Python project “CountWords”
Now we can develop any kind of Spark project written in Python, we will now create the code example named “CountWords”.
This example will count the frequency of each word present in the “README.md” file belonging to the Spark installation.
To allow a such counting, the well-known MapReduce paradigm will be operated in memory by using the two Spark transformations named “flatMap” and
“reduceByKey”.
Create the new project :
Go to the Eclipse menu File > New > PyDev project
Name your new project “PythonSpark”, then click on the button [Finish]
Create a source folder :
To add a source folder (in order to create soon your Python source), right-click on the project icone and New > Folder
Name the new folder “src”, then click on the button [Finish]
To add the new Python source, right-click on the source folder icon and New > PyDev Module
Name the new Python source “WordCounts”, then click on the button [Finish], then click on the button [OK]

Copy-paste the following Python code into your PyDev module WordCounts.py :
# Imports
# Take care about unused imports (and also unused variables), please comment them all, otherwise you will get any
errors at the execution. Note that neither the directives "@PydevCodeAnalysisIgnore" nor "@UnusedImport" will be
able to solve that issue.
#from pyspark.mllib.clustering import KMeans
from pyspark import SparkConf, SparkContext
import os
# Configure the Spark environment
sparkConf = SparkConf().setAppName("WordCounts").setMaster("local")
sc = SparkContext(conf = sparkConf)
# The WordCounts Spark program
textFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md")
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
for wc in wordCounts.collect(): print wc
In PyDev take care about unused imports and also unused variables.
Please comment them all, otherwise you will get any errors at the execution.
Note that neither the directives @PydevCodeAnalysisIgnore nor @UnusedImport will be able to solve that issue.
Create a config folder :
To add a config folder (useful for log4j), right-click on the project icon and New > Folder
Name the new folder “conf”, then click on the button [Finish]
To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File
Name the new config file “log4j.properties”, then click on the button [Finish], then click on the button [OK]
Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties”
Edit your own config file ”log4j.properties” to change as much as you want the level of logs (e.g : INFO to WARN, or INFO to ERROR...)

8°) Run the Spark-Python project “CountWords"
To execute your code, right-click on the Python module “WordCounts.py”, then choose Run As > 1 Python Run
Have fun :-)

Install Eclipse for Spark-Python (deprecated slides, see below)

Recommended

Recommended

More Related Content

Recently uploaded

Recently uploaded (20)

Featured

Featured (20)

Install Eclipse for Spark-Python (deprecated slides, see below)