Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland
Spark Summit Europe 2015
@j_houg
PySpark Best Practices
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDF...
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sak...
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
P...
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
P...
‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
P...
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
‹#›© Cloudera, Inc. All rights reserved.
PySpark Execution Model
‹#›© Cloudera, Inc. All rights reserved.
PySpark Driver Program
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_out...
‹#›© Cloudera, Inc. All rights reserved.
How do we ship around Python functions?
sc.textFile(“hdfs://…”, 4)
.map(to_series...
‹#›© Cloudera, Inc. All rights reserved.
Pickle!
https://flic.kr/p/c8N4sE
‹#›© Cloudera, Inc. All rights reserved.
Pickle!
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
Best Practices for Writing PySpark
‹#›© Cloudera, Inc. All rights reserved.
REPLs and Notebooks
https://flic.kr/p/5hnPZp
‹#›© Cloudera, Inc. All rights reserved.
Share your code
https://flic.kr/p/sw2cnL
‹#›© Cloudera, Inc. All rights reserved.
Standard Python Project
my_pyspark_proj/
awesome/
__init__.py
bin/
docs/
setup.py...
‹#›© Cloudera, Inc. All rights reserved.
What is the shape of a PySpark job?
https://flic.kr/p/4vWP6U
‹#›© Cloudera, Inc. All rights reserved.
• Parse CLI args & configure Spark App
• Read in data
• Raw data into features
• ...
‹#›© Cloudera, Inc. All rights reserved.
PySpark Structure?
my_pyspark_proj/
awesome/
__init__.py
DataIO.py
Featurize.py
M...
‹#›© Cloudera, Inc. All rights reserved.
Simple Main Method
‹#›© Cloudera, Inc. All rights reserved.
• Write a function for
anything inside an
transformation
• Make it static
• Separ...
‹#›© Cloudera, Inc. All rights reserved.
• Functions and the contexts
they need to execute
(closures) must be
serializable...
‹#›© Cloudera, Inc. All rights reserved.
• Provides a SparkContext
configures Spark master
• Quiets Py4J
• https://github....
‹#›© Cloudera, Inc. All rights reserved.
• Unit test as much as possible
• Integration test the whole flow
• Use sample of...
‹#›© Cloudera, Inc. All rights reserved.
Best Practices for Running PySpark
‹#›© Cloudera, Inc. All rights reserved.
Writing distributed code is the easy part…
Running it is hard.
‹#›© Cloudera, Inc. All rights reserved.
Get Serious About Logs
• Get the YARN app id from
the WebUI or Console
• yarn log...
‹#›© Cloudera, Inc. All rights reserved.
Know your environment
• You may want to use
python packages on your
cluster
• Act...
‹#›© Cloudera, Inc. All rights reserved.
Complex Dependencies
‹#›© Cloudera, Inc. All rights reserved.
Many Python Environments
Path to Python binary to use
on the cluster can be set w...
‹#›© Cloudera, Inc. All rights reserved.
Thank You
Questions?
@j_houg
juliet@cloudera.com
Upcoming SlideShare
Loading in …5
×

PySpark Best Practices by Juliet Hougland

5,898 views

Published on

PySpark Best Practices by Juliet Hougland

Published in: Data & Analytics
  • Be the first to comment

PySpark Best Practices by Juliet Hougland

  1. 1. ‹#›© Cloudera, Inc. All rights reserved. Juliet Hougland Spark Summit Europe 2015 @j_houg PySpark Best Practices
  2. 2. ‹#›© Cloudera, Inc. All rights reserved.
  3. 3. ‹#›© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis
  4. 4. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  5. 5. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  6. 6. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  7. 7. ‹#›© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  8. 8. ‹#›© Cloudera, Inc. All rights reserved. Spark Execution Model
  9. 9. ‹#›© Cloudera, Inc. All rights reserved. PySpark Execution Model
  10. 10. ‹#›© Cloudera, Inc. All rights reserved. PySpark Driver Program sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() Function closures need to be executed on worker nodes by a python process.
  11. 11. ‹#›© Cloudera, Inc. All rights reserved. How do we ship around Python functions? sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  12. 12. ‹#›© Cloudera, Inc. All rights reserved. Pickle! https://flic.kr/p/c8N4sE
  13. 13. ‹#›© Cloudera, Inc. All rights reserved. Pickle! sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  14. 14. ‹#›© Cloudera, Inc. All rights reserved. Best Practices for Writing PySpark
  15. 15. ‹#›© Cloudera, Inc. All rights reserved. REPLs and Notebooks https://flic.kr/p/5hnPZp
  16. 16. ‹#›© Cloudera, Inc. All rights reserved. Share your code https://flic.kr/p/sw2cnL
  17. 17. ‹#›© Cloudera, Inc. All rights reserved. Standard Python Project my_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py
  18. 18. ‹#›© Cloudera, Inc. All rights reserved. What is the shape of a PySpark job? https://flic.kr/p/4vWP6U
  19. 19. ‹#›© Cloudera, Inc. All rights reserved. • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data PySpark Structure? https://flic.kr/p/ZW54 Shout out to my colleagues in the UK
  20. 20. ‹#›© Cloudera, Inc. All rights reserved. PySpark Structure? my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data
  21. 21. ‹#›© Cloudera, Inc. All rights reserved. Simple Main Method
  22. 22. ‹#›© Cloudera, Inc. All rights reserved. • Write a function for anything inside an transformation • Make it static • Separate Feature generation or data standardization from your modeling Write Testable Code Featurize.py … @static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double @static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])
  23. 23. ‹#›© Cloudera, Inc. All rights reserved. • Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead Write Serializable Code https://flic.kr/p/za5cy
  24. 24. ‹#›© Cloudera, Inc. All rights reserved. • Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/ spark-testing-base Testing with SparkTestingBase
  25. 25. ‹#›© Cloudera, Inc. All rights reserved. • Unit test as much as possible • Integration test the whole flow • Use sample of real data • Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results Testing Suggestions https://flic.kr/p/tucHHL
  26. 26. ‹#›© Cloudera, Inc. All rights reserved. Best Practices for Running PySpark
  27. 27. ‹#›© Cloudera, Inc. All rights reserved. Writing distributed code is the easy part… Running it is hard.
  28. 28. ‹#›© Cloudera, Inc. All rights reserved. Get Serious About Logs • Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces
  29. 29. ‹#›© Cloudera, Inc. All rights reserved. Know your environment • You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Spark versions <1.4.0 require the same version of Python on driver and workers
  30. 30. ‹#›© Cloudera, Inc. All rights reserved. Complex Dependencies
  31. 31. ‹#›© Cloudera, Inc. All rights reserved. Many Python Environments Path to Python binary to use on the cluster can be set with PYSPARK_PYTHON Can be set it in spark-env.sh if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi http://blog.cloudera.com/blog/2015/09/how-to-prepare-your-apache-hadoop-cluster-for-pyspark-jobs/
  32. 32. ‹#›© Cloudera, Inc. All rights reserved. Thank You Questions? @j_houg juliet@cloudera.com

×