SlideShare a Scribd company logo
1 of 11
Download to read offline
Philippe ROSSIGNOL : 2015/06/12
How to configure Eclipse in order
to develop with Spark and Python
http://enahwe.blogspot.fr/p/philippe-rossignol-20150612-how-to.html
Introduction
Under the cover of PySpark
Requirements
A brief note about Scala
1°) Install Eclipse
2°) Install Spark
3°) Install PyDev
4°) Configure PyDev with a Python interpreter
5°) Configure PyDev with the Spark Python sources
6°) Configure PyDev with the Spark Environment variables
7°) Create the Spark-Python project “CountWords”
8°) Run the Spark-Python project “CountWords”
Introduction
This document shows a way to configure Eclipse IDE in order to develop with Spark 1.3.1 and Python via the plugin PyDev.
PyDev is a plugin that enables Eclipse to be used as a Python IDE.
First we will install Eclipse, then Spark 1.3.1 and PyDev, then we will configure PyDev.
Finally we will finish by developing and by testing a well-known example code named “Word Counts” written in Python and running on Spark.
Under the cover of PySpark
The Spark Python API (PySpark) exposes the Spark programming model to Python.
By default, PySpark requires python (2.6 or higher) to be available on the system PATH and use it to run programs.
Let’s note that PySpark applications are executed by using a standard CPython interpreter (in order to support Python modules that use C extensions).
But an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable.
All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported.
In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
RDD transformations in Python are mapped to transformations on PythonRDD objects in Java.
For more details please visit the pages below :
Installing and Configuring PySpark : https://spark.apache.org/docs/0.9.2/python-programming-guide.html
PySpark Internals : https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
Requirements
Let’s note that Spark V1.3.1 runs on Java 6+ and Python 2.6+, so you will need on your computer :
●
A JVM 6 or higher (JVM 7 may be a good compromising)
●
A python 2.6 or higher
The following installation has been carried out with a JVM 7 and a Python interpreter 2.7.
A brief note about Scala
Keep in mind that a great idea would consist to use the following Eclipse IDE for Spark in order to develop later both in Python and Scala.
To allow this, it’s important to know that Spark 1.3.1 needs to use a Scala API that is compatible with Scala version 2.10.x.
That’s the reason why the following installation uses Eclipse 4.3 (Kepler) because of its compatibility with Scala 2.10.
1°) Install Eclipse
G o t o t h e E c l i p s e w e b s i t e t h e n d o w n l o a d a n d u n c o m p r e s s E c l i p s e 4 . 3 ( K e p l e r ) o n y o u r c o m p u t e r :
http://www.eclipse.org/downloads/packages/release/Kepler/SR2
Finally launch Eclipse and create your workspace as usually.
2°) Install Spark
Go to the Spark website then download and uncompress Spark 1.3.1 (e.g: “Pre-built for Hadoop 2.6 and later”) on your computer :
https://spark.apache.org/downloads.html
3°) Install PyDev
From Eclipse IDE :
Go to the menu Help > Install New Software...
From the “Install“ window :
Click on the button [Add…]
From the “Add Repository” dialog box :
Fill the field Name: PyDev
Fill the field Location: http://pydev.sf.net/updates
Validate with the button [OK]
From the “Install“ window :
Check the name PyDev and click twice on the button [Next >]
Accept the terms of the license agreement and click on the button [Finish]
If a “Security Warning” window appears :
If the following warning message appears : “Warning: you are installing software that contains unsigned content…” :
Click on the button [OK]
From the “Sofware Updates” window :
Click the button [Yes] to restart Eclipse and for the changes to take effect.
Now PyDev (e.g: 4.1.0) is installed in your Eclipse.
But you can’t develop in Python, because PyDev isn’t configured yet.
4°) Configure PyDev with a Python interpreter
As PySpark, PyDev requires a Python interpreter installed on your computer.
Remember that with PySpark, Py4J is not a Python interpreter.
Py4J is only used on the driver for local communication between the Python and Java SparkContext objects.
The following installation has been carried out with a Python interpreter 2.7.
From Eclipse IDE :
Open the PyDev perspective (on top right of the Eclipse IDE)
Go to the menu Eclipse > Preferences… (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the button [Advanced Auto-Config]
Eclipse with introspect all the Python installations on your computer.
Choose a Python version 2.7 (e.g: /usr/bin/python2.7 on mac) then validate with the button [OK]
From the “Selection needed” window :
Click on the button [OK] to accept the folders to be added to the system PYTHONPATH
From the “Preferences” window :
Validate with the button [OK]
Now PyDev is configured in your Eclipse.
You are able to develop in Python but not with Spark yet.
5°) Configure PyDev with the Spark Python sources
Now we are going to configure PyDev with the Spark Python sources.
From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the button [New Folder]
Choose the python folder just under your Spark install directory and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python
Note : This path must be absolute (don’t use the Spark home environment variable)
Click on the button [New Egg/Zip(s)]
From the File Explorer, select [*.zip] rather [*.egg]
Choose the file py4j-0.8.2.1-src.zip just under your Spark python folder and validate :
e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip
Note : This path must be absolute (don’t use the Spark home environment variable)
Validate with the button [OK]
Now PyDev is configured with Spark Python sources.
But we can’t execute Spark while the Environment variables aren’t configured.
6°) Configure PyDev with the Spark Environment variables
From Eclipse IDE :
Check that you are on the PyDev perspective
Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows)
From the “Preferences” window :
Go to PyDev > Interpreters > Python Interpreter
Click on the central button [Environment]
Click on the button [New...] (close to the button [Select...]) to add a new Environment variable.
Add the environment variable SPARK_HOME and validate :
e.g 1 : Name: SPARK_HOME, Value: /home/foo/Spark_1.3.1-Hadoop_2.6
e.g 2 : Name: SPARK_HOME, Value: ${eclipse_home}../Spark_1.3.1-Hadoop_2.6
Note : Don’t use the system environment variables such as Spark home
It’s recommended to handle your own "log4j.properties" file in each of your project.
To allow that, adds the environment variable SPARK_CONF_DIR as previously and validates :
e.g : Name: SPARK_CONF_DIR, Value: ${project_loc}/conf
If you experience some problems with the variable ${project_loc} (e.g: with Linux OS), specify an absolute path instead.
Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configurations…,
then create your SPARK_CONF_DIR variable in the Environment tab as described previously
Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on :
e.g 1 : Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm
e.g 2 : Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address)
Validate with the button [OK]
Now PyDev is full ready to develop with Spark in Python.
7°) Create the Spark-Python project “CountWords”
Now we can develop any kind of Spark project written in Python, we will now create the code example named “CountWords”.
This example will count the frequency of each word present in the “README.md” file belonging to the Spark installation.
To allow a such counting, the well-known MapReduce paradigm will be operated in memory by using the two Spark transformations named “flatMap” and
“reduceByKey”.
Create the new project :
Check that you are on the PyDev perspective
Go to the Eclipse menu File > New > PyDev project
Name your new project “PythonSpark”, then click on the button [Finish]
Create a source folder :
To add a source folder (in order to create soon your Python source), right-click on the project icone and New > Folder
Name the new folder “src”, then click on the button [Finish]
To add the new Python source, right-click on the source folder icon and New > PyDev Module
Name the new Python source “WordCounts”, then click on the button [Finish], then click on the button [OK]
Copy-paste the following Python code into your PyDev module WordCounts.py :
# Imports
# Take care about unused imports (and also unused variables), please comment them all, otherwise you will get any
errors at the execution. Note that neither the directives "@PydevCodeAnalysisIgnore" nor "@UnusedImport" will be
able to solve that issue.
#from pyspark.mllib.clustering import KMeans
from pyspark import SparkConf, SparkContext
import os
# Configure the Spark environment
sparkConf = SparkConf().setAppName("WordCounts").setMaster("local")
sc = SparkContext(conf = sparkConf)
# The WordCounts Spark program
textFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md")
wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
for wc in wordCounts.collect(): print wc
In PyDev take care about unused imports and also unused variables.
Please comment them all, otherwise you will get any errors at the execution.
Note that neither the directives @PydevCodeAnalysisIgnore nor @UnusedImport will be able to solve that issue.
Create a config folder :
To add a config folder (useful for log4j), right-click on the project icon and New > Folder
Name the new folder “conf”, then click on the button [Finish]
To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File
Name the new config file “log4j.properties”, then click on the button [Finish], then click on the button [OK]
Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties”
Edit your own config file ”log4j.properties” to change as much as you want the level of logs (e.g : INFO to WARN, or INFO to ERROR...)
8°) Run the Spark-Python project “CountWords"
To execute your code, right-click on the Python module “WordCounts.py”, then choose Run As > 1 Python Run
Have fun :-)

More Related Content

Recently uploaded

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
masabamasaba
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
masabamasaba
 

Recently uploaded (20)

%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
Announcing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK SoftwareAnnouncing Codolex 2.0 from GDK Software
Announcing Codolex 2.0 from GDK Software
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
Abortion Pill Prices Tembisa [(+27832195400*)] 🏥 Women's Abortion Clinic in T...
 
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
%in Stilfontein+277-882-255-28 abortion pills for sale in Stilfontein
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With SimplicityWSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
WSO2Con2024 - Enabling Transactional System's Exponential Growth With Simplicity
 
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
MarTech Trend 2024 Book : Marketing Technology Trends (2024 Edition) How Data...
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - KeynoteWSO2Con204 - Hard Rock Presentation - Keynote
WSO2Con204 - Hard Rock Presentation - Keynote
 
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Toronto Psychic Readings, Attraction spells,Brin...
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
%+27788225528 love spells in new york Psychic Readings, Attraction spells,Bri...
 

Featured

How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
ThinkNow
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
Kurio // The Social Media Age(ncy)
 

Featured (20)

Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
Unlocking the Power of ChatGPT and AI in Testing - A Real-World Look, present...
 
12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work12 Ways to Increase Your Influence at Work
12 Ways to Increase Your Influence at Work
 

Install Eclipse for Spark-Python (deprecated slides, see below)

  • 1. Philippe ROSSIGNOL : 2015/06/12 How to configure Eclipse in order to develop with Spark and Python http://enahwe.blogspot.fr/p/philippe-rossignol-20150612-how-to.html Introduction Under the cover of PySpark Requirements A brief note about Scala 1°) Install Eclipse 2°) Install Spark 3°) Install PyDev 4°) Configure PyDev with a Python interpreter 5°) Configure PyDev with the Spark Python sources 6°) Configure PyDev with the Spark Environment variables 7°) Create the Spark-Python project “CountWords” 8°) Run the Spark-Python project “CountWords”
  • 2. Introduction This document shows a way to configure Eclipse IDE in order to develop with Spark 1.3.1 and Python via the plugin PyDev. PyDev is a plugin that enables Eclipse to be used as a Python IDE. First we will install Eclipse, then Spark 1.3.1 and PyDev, then we will configure PyDev. Finally we will finish by developing and by testing a well-known example code named “Word Counts” written in Python and running on Spark. Under the cover of PySpark The Spark Python API (PySpark) exposes the Spark programming model to Python. By default, PySpark requires python (2.6 or higher) to be available on the system PATH and use it to run programs. Let’s note that PySpark applications are executed by using a standard CPython interpreter (in order to support Python modules that use C extensions). But an alternate Python executable may be specified by setting the PYSPARK_PYTHON environment variable. All of PySpark’s library dependencies, including Py4J, are bundled with PySpark and automatically imported. In the Python driver program, SparkContext uses Py4J to launch a JVM and create a JavaSparkContext. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects. RDD transformations in Python are mapped to transformations on PythonRDD objects in Java. For more details please visit the pages below : Installing and Configuring PySpark : https://spark.apache.org/docs/0.9.2/python-programming-guide.html PySpark Internals : https://cwiki.apache.org/confluence/display/SPARK/PySpark+Internals
  • 3. Requirements Let’s note that Spark V1.3.1 runs on Java 6+ and Python 2.6+, so you will need on your computer : ● A JVM 6 or higher (JVM 7 may be a good compromising) ● A python 2.6 or higher The following installation has been carried out with a JVM 7 and a Python interpreter 2.7. A brief note about Scala Keep in mind that a great idea would consist to use the following Eclipse IDE for Spark in order to develop later both in Python and Scala. To allow this, it’s important to know that Spark 1.3.1 needs to use a Scala API that is compatible with Scala version 2.10.x. That’s the reason why the following installation uses Eclipse 4.3 (Kepler) because of its compatibility with Scala 2.10.
  • 4. 1°) Install Eclipse G o t o t h e E c l i p s e w e b s i t e t h e n d o w n l o a d a n d u n c o m p r e s s E c l i p s e 4 . 3 ( K e p l e r ) o n y o u r c o m p u t e r : http://www.eclipse.org/downloads/packages/release/Kepler/SR2 Finally launch Eclipse and create your workspace as usually. 2°) Install Spark Go to the Spark website then download and uncompress Spark 1.3.1 (e.g: “Pre-built for Hadoop 2.6 and later”) on your computer : https://spark.apache.org/downloads.html
  • 5. 3°) Install PyDev From Eclipse IDE : Go to the menu Help > Install New Software... From the “Install“ window : Click on the button [Add…] From the “Add Repository” dialog box : Fill the field Name: PyDev Fill the field Location: http://pydev.sf.net/updates Validate with the button [OK] From the “Install“ window : Check the name PyDev and click twice on the button [Next >] Accept the terms of the license agreement and click on the button [Finish] If a “Security Warning” window appears : If the following warning message appears : “Warning: you are installing software that contains unsigned content…” : Click on the button [OK] From the “Sofware Updates” window : Click the button [Yes] to restart Eclipse and for the changes to take effect. Now PyDev (e.g: 4.1.0) is installed in your Eclipse. But you can’t develop in Python, because PyDev isn’t configured yet.
  • 6. 4°) Configure PyDev with a Python interpreter As PySpark, PyDev requires a Python interpreter installed on your computer. Remember that with PySpark, Py4J is not a Python interpreter. Py4J is only used on the driver for local communication between the Python and Java SparkContext objects. The following installation has been carried out with a Python interpreter 2.7. From Eclipse IDE : Open the PyDev perspective (on top right of the Eclipse IDE) Go to the menu Eclipse > Preferences… (on Mac), or Window > Preferences... (on Linux and Windows) From the “Preferences” window : Go to PyDev > Interpreters > Python Interpreter Click on the button [Advanced Auto-Config] Eclipse with introspect all the Python installations on your computer. Choose a Python version 2.7 (e.g: /usr/bin/python2.7 on mac) then validate with the button [OK] From the “Selection needed” window : Click on the button [OK] to accept the folders to be added to the system PYTHONPATH From the “Preferences” window : Validate with the button [OK] Now PyDev is configured in your Eclipse. You are able to develop in Python but not with Spark yet.
  • 7. 5°) Configure PyDev with the Spark Python sources Now we are going to configure PyDev with the Spark Python sources. From Eclipse IDE : Check that you are on the PyDev perspective Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows) From the “Preferences” window : Go to PyDev > Interpreters > Python Interpreter Click on the button [New Folder] Choose the python folder just under your Spark install directory and validate : e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python Note : This path must be absolute (don’t use the Spark home environment variable) Click on the button [New Egg/Zip(s)] From the File Explorer, select [*.zip] rather [*.egg] Choose the file py4j-0.8.2.1-src.zip just under your Spark python folder and validate : e.g : /home/foo/Spark_1.3.1-Hadoop_2.6/python/py4j-0.8.2.1-src.zip Note : This path must be absolute (don’t use the Spark home environment variable) Validate with the button [OK] Now PyDev is configured with Spark Python sources. But we can’t execute Spark while the Environment variables aren’t configured.
  • 8. 6°) Configure PyDev with the Spark Environment variables From Eclipse IDE : Check that you are on the PyDev perspective Go to the menu Eclipse > Preferences... (on Mac), or Window > Preferences... (on Linux and Windows) From the “Preferences” window : Go to PyDev > Interpreters > Python Interpreter Click on the central button [Environment] Click on the button [New...] (close to the button [Select...]) to add a new Environment variable. Add the environment variable SPARK_HOME and validate : e.g 1 : Name: SPARK_HOME, Value: /home/foo/Spark_1.3.1-Hadoop_2.6 e.g 2 : Name: SPARK_HOME, Value: ${eclipse_home}../Spark_1.3.1-Hadoop_2.6 Note : Don’t use the system environment variables such as Spark home It’s recommended to handle your own "log4j.properties" file in each of your project. To allow that, adds the environment variable SPARK_CONF_DIR as previously and validates : e.g : Name: SPARK_CONF_DIR, Value: ${project_loc}/conf If you experience some problems with the variable ${project_loc} (e.g: with Linux OS), specify an absolute path instead. Or if you want to keep ${project_loc}, right-click on every Python source and: Runs As > Run Configurations…, then create your SPARK_CONF_DIR variable in the Environment tab as described previously Occasionally, you can add other environment variables such as TERM, SPARK_LOCAL_IP and so on : e.g 1 : Name: TERM, Value on Mac: xterm-256color, Value on Linux: xterm e.g 2 : Name: SPARK_LOCAL_IP, Value: 127.0.0.1 (it’s recommended to specify your real local IP address) Validate with the button [OK] Now PyDev is full ready to develop with Spark in Python.
  • 9. 7°) Create the Spark-Python project “CountWords” Now we can develop any kind of Spark project written in Python, we will now create the code example named “CountWords”. This example will count the frequency of each word present in the “README.md” file belonging to the Spark installation. To allow a such counting, the well-known MapReduce paradigm will be operated in memory by using the two Spark transformations named “flatMap” and “reduceByKey”. Create the new project : Check that you are on the PyDev perspective Go to the Eclipse menu File > New > PyDev project Name your new project “PythonSpark”, then click on the button [Finish] Create a source folder : To add a source folder (in order to create soon your Python source), right-click on the project icone and New > Folder Name the new folder “src”, then click on the button [Finish] To add the new Python source, right-click on the source folder icon and New > PyDev Module Name the new Python source “WordCounts”, then click on the button [Finish], then click on the button [OK]
  • 10. Copy-paste the following Python code into your PyDev module WordCounts.py : # Imports # Take care about unused imports (and also unused variables), please comment them all, otherwise you will get any errors at the execution. Note that neither the directives "@PydevCodeAnalysisIgnore" nor "@UnusedImport" will be able to solve that issue. #from pyspark.mllib.clustering import KMeans from pyspark import SparkConf, SparkContext import os # Configure the Spark environment sparkConf = SparkConf().setAppName("WordCounts").setMaster("local") sc = SparkContext(conf = sparkConf) # The WordCounts Spark program textFile = sc.textFile(os.environ["SPARK_HOME"] + "/README.md") wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b) for wc in wordCounts.collect(): print wc In PyDev take care about unused imports and also unused variables. Please comment them all, otherwise you will get any errors at the execution. Note that neither the directives @PydevCodeAnalysisIgnore nor @UnusedImport will be able to solve that issue. Create a config folder : To add a config folder (useful for log4j), right-click on the project icon and New > Folder Name the new folder “conf”, then click on the button [Finish] To add your new config file (the “log4j.properties” file) right-click on the config folder icon and New > File Name the new config file “log4j.properties”, then click on the button [Finish], then click on the button [OK] Copy-paste the content of the file “log4j.properties.template” (under $SPARK_HOME/conf) to your new config file ”log4j.properties” Edit your own config file ”log4j.properties” to change as much as you want the level of logs (e.g : INFO to WARN, or INFO to ERROR...)
  • 11. 8°) Run the Spark-Python project “CountWords" To execute your code, right-click on the Python module “WordCounts.py”, then choose Run As > 1 Python Run Have fun :-)