More Related Content
Similar to PySpark Best Practices (20)
More from Cloudera, Inc. (20)
PySpark Best Practices
- 1. ‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland
Sept 2015
@j_houg
PySpark Best Practices
- 3. ‹#›© Cloudera, Inc. All rights reserved.
• Core written, operates on the JVM
• Also has Python and Java APIs
• Hadoop Friendly
• Input from HDFS, HBase, Kafka
• Management via YARN
• Interactive REPL
• ML library == MLLib
Spark
- 4. ‹#›© Cloudera, Inc. All rights reserved.
Spark MLLib
• Model building and eval
• Fast
• Basics covered
• LR, SVM, Decision tree
• PCA, SVD
• K-means
• ALS
• Algorithms expect RDDs of
consistent types (i.e.
LabeledPoints)
!
- 5. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
- 6. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
- 7. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
- 8. ‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
- 9. ‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
- 12. ‹#›© Cloudera, Inc. All rights reserved.
PySpark Driver Program
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
Function closures need to be executed
on worker nodes by a python process.
- 13. ‹#›© Cloudera, Inc. All rights reserved.
How do we ship around Python functions?
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
- 15. ‹#›© Cloudera, Inc. All rights reserved.
Pickle!
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
- 19. ‹#›© Cloudera, Inc. All rights reserved.
Standard Python Project
my_pyspark_proj/
awesome/
__init__.py
bin/
docs/
setup.py
tests/
awesome_tests.py
__init__.py
- 20. ‹#›© Cloudera, Inc. All rights reserved.
What is the shape of a PySpark job?
https://flic.kr/p/4vWP6U
- 21. ‹#›© Cloudera, Inc. All rights reserved.
!
• Parse CLI args & configure Spark App
• Read in data
• Raw data into features
• Fancy Maths with Spark
• Write out data
PySpark Structure?
https://flic.kr/p/ZW54
Shout out to my
colleagues in the UK
- 22. ‹#›© Cloudera, Inc. All rights reserved.
PySpark Structure?
my_pyspark_proj/
awesome/
__init__.py
DataIO.py
Featurize.py
Model.py
bin/
docs/
setup.py
tests/
__init__.py
awesome_tests.py
resources/
data_source_sample.csv
!
• Parse CLI args & configure Spark App
• Read in data
• Raw data into features
• Fancy Maths with Spark
• Write out data
- 24. ‹#›© Cloudera, Inc. All rights reserved.
• Write a function for
anything inside an
transformation
• Make it static
• Separate Feature
generation or data
standardization
from your modeling
Write Testable Code
Featurize.py
…
!
@static_method
def label(single_record):
…
return label_as_a_double
@static_method
def descriptive_name_of_feature1():
...
return a_double
!
@static_method
def create_labeled_point(data_usage_rdd, sms_usage_rdd):
...
return LabeledPoint(label, [feature1])
- 25. ‹#›© Cloudera, Inc. All rights reserved.
• Functions and the contexts
they need to execute
(closures) must be
serializable
• Keep functions simple. I
suggest static methods.
• Some things are impossiblish
• DB connections => Use
mapPartitions instead
Write Serializable Code
https://flic.kr/p/za5cy
- 26. ‹#›© Cloudera, Inc. All rights reserved.
• Provides a SparkContext
configures Spark master
• Quiets Py4J
• https://github.com/holdenk/
spark-testing-base
Testing with SparkTestingBase
- 27. ‹#›© Cloudera, Inc. All rights reserved.
• Unit test as much as possible
• Integration test the whole flow
!
• Test for:
• Deviations of data from
expected format
• RDDs with an empty partitions
• Correctness of results
Testing Suggestions
https://flic.kr/p/tucHHL
- 29. ‹#›© Cloudera, Inc. All rights reserved.
Writing distributed code is the easy part…
Running it is hard.
- 30. ‹#›© Cloudera, Inc. All rights reserved.
Get Serious About Logs
• Get the YARN app id from
the WebUI or Console
• yarn logs <app-id>
• Quiet down Py4J
• Log records that have
trouble getting processed
• Earlier exceptions more
relevant than later ones
• Look at both the Python
and Java stack traces
- 31. ‹#›© Cloudera, Inc. All rights reserved.
Know your environment
• You may want to use
python packages on your
cluster
• Actively manage
dependencies on your
cluster
• Anaconda or virtualenv is
good for this.
• Spark versions <1.4.0
require the same version of
Python on driver and
workers
- 33. ‹#›© Cloudera, Inc. All rights reserved.
Many Python Environments Path to Python binary to use
on the cluster can be set with
PYSPARK_PYTHON
!
Can be set it in spark-env.sh
if [ -n “${PYSPARK_PYTHON}" ]; then
export PYSPARK_PYTHON=<path>
fi