SlideShare a Scribd company logo
1 of 34
Download to read offline
‹#›© Cloudera, Inc. All rights reserved.
Juliet Hougland
Sept 2015
@j_houg
PySpark Best Practices
‹#›© Cloudera, Inc. All rights reserved.
‹#›© Cloudera, Inc. All rights reserved.
• Core written, operates on the JVM
• Also has Python and Java APIs
• Hadoop Friendly
• Input from HDFS, HBase, Kafka
• Management via YARN
• Interactive REPL
• ML library == MLLib
Spark
‹#›© Cloudera, Inc. All rights reserved.
Spark MLLib
• Model building and eval
• Fast
• Basics covered
• LR, SVM, Decision tree
• PCA, SVD
• K-means
• ALS
• Algorithms expect RDDs of
consistent types (i.e.
LabeledPoints)
!
‹#›© Cloudera, Inc. All rights reserved.
RDDs
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
RDDs
…RDD …RDD
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
…RDD …RDD
RDDs
HDFS
Partition 1
Partition 2
Partition 3
Partition 4
Partition 1
Partition 2
Partition 3
Partition 4
…RDD
Partition 1
Partition 2
Partition 3
Partition 4
Count
Thanks: Kostas Sakellis
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
Spark Execution Model
‹#›© Cloudera, Inc. All rights reserved.
PySpark Execution Model
‹#›© Cloudera, Inc. All rights reserved.
PySpark Driver Program
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
Function closures need to be executed
on worker nodes by a python process.
‹#›© Cloudera, Inc. All rights reserved.
How do we ship around Python functions?
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
Pickle!
https://flic.kr/p/c8N4sE
‹#›© Cloudera, Inc. All rights reserved.
Pickle!
sc.textFile(“hdfs://…”, 4)
.map(to_series)
.filter(has_outlier)
.count()
‹#›© Cloudera, Inc. All rights reserved.
Best Practices for Writing PySpark
‹#›© Cloudera, Inc. All rights reserved.
REPLs and Notebooks
https://flic.kr/p/5hnPZp
‹#›© Cloudera, Inc. All rights reserved.
Share your code
https://flic.kr/p/sw2cnL
‹#›© Cloudera, Inc. All rights reserved.
Standard Python Project
my_pyspark_proj/
awesome/
__init__.py
bin/
docs/
setup.py
tests/
awesome_tests.py
__init__.py
‹#›© Cloudera, Inc. All rights reserved.
What is the shape of a PySpark job?
https://flic.kr/p/4vWP6U
‹#›© Cloudera, Inc. All rights reserved.
!
• Parse CLI args & configure Spark App
• Read in data
• Raw data into features
• Fancy Maths with Spark
• Write out data
PySpark Structure?
https://flic.kr/p/ZW54
Shout out to my
colleagues in the UK
‹#›© Cloudera, Inc. All rights reserved.
PySpark Structure?
my_pyspark_proj/
awesome/
__init__.py
DataIO.py
Featurize.py
Model.py
bin/
docs/
setup.py
tests/
__init__.py
awesome_tests.py
resources/
data_source_sample.csv
!
• Parse CLI args & configure Spark App
• Read in data
• Raw data into features
• Fancy Maths with Spark
• Write out data
‹#›© Cloudera, Inc. All rights reserved.
Simple Main Method
‹#›© Cloudera, Inc. All rights reserved.
• Write a function for
anything inside an
transformation
• Make it static
• Separate Feature
generation or data
standardization
from your modeling
Write Testable Code
Featurize.py
…
!
@static_method
def label(single_record):
…
return label_as_a_double
@static_method
def descriptive_name_of_feature1():
...
return a_double
!
@static_method
def create_labeled_point(data_usage_rdd, sms_usage_rdd):
...
return LabeledPoint(label, [feature1])
‹#›© Cloudera, Inc. All rights reserved.
• Functions and the contexts
they need to execute
(closures) must be
serializable
• Keep functions simple. I
suggest static methods.
• Some things are impossiblish
• DB connections => Use
mapPartitions instead
Write Serializable Code
https://flic.kr/p/za5cy
‹#›© Cloudera, Inc. All rights reserved.
• Provides a SparkContext
configures Spark master
• Quiets Py4J
• https://github.com/holdenk/
spark-testing-base
Testing with SparkTestingBase
‹#›© Cloudera, Inc. All rights reserved.
• Unit test as much as possible
• Integration test the whole flow
!
• Test for:
• Deviations of data from
expected format
• RDDs with an empty partitions
• Correctness of results
Testing Suggestions
https://flic.kr/p/tucHHL
‹#›© Cloudera, Inc. All rights reserved.
Best Practices for Running PySpark
‹#›© Cloudera, Inc. All rights reserved.
Writing distributed code is the easy part…
Running it is hard.
‹#›© Cloudera, Inc. All rights reserved.
Get Serious About Logs
• Get the YARN app id from
the WebUI or Console
• yarn logs <app-id>
• Quiet down Py4J
• Log records that have
trouble getting processed
• Earlier exceptions more
relevant than later ones
• Look at both the Python
and Java stack traces
‹#›© Cloudera, Inc. All rights reserved.
Know your environment
• You may want to use
python packages on your
cluster
• Actively manage
dependencies on your
cluster
• Anaconda or virtualenv is
good for this.
• Spark versions <1.4.0
require the same version of
Python on driver and
workers
‹#›© Cloudera, Inc. All rights reserved.
Complex Dependencies
‹#›© Cloudera, Inc. All rights reserved.
Many Python Environments Path to Python binary to use
on the cluster can be set with
PYSPARK_PYTHON
!
Can be set it in spark-env.sh
if [ -n “${PYSPARK_PYTHON}" ]; then
export PYSPARK_PYTHON=<path>
fi
‹#›© Cloudera, Inc. All rights reserved.
Thank You
Questions?
!
@j_houg

More Related Content

What's hot

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsDatabricks
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Databricks
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Edureka!
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsDatabricks
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDatabricks
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta LakeDatabricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...Edureka!
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta LakeKnoldus Inc.
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaEdureka!
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeDatabricks
 

What's hot (20)

Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Understanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIsUnderstanding Query Plans and Spark UIs
Understanding Query Plans and Spark UIs
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
Improving SparkSQL Performance by 30%: How We Optimize Parquet Pushdown and P...
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
Pyspark Tutorial | Introduction to Apache Spark with Python | PySpark Trainin...
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and PitfallsRunning Apache Spark on Kubernetes: Best Practices and Pitfalls
Running Apache Spark on Kubernetes: Best Practices and Pitfalls
 
Deep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache SparkDeep Dive: Memory Management in Apache Spark
Deep Dive: Memory Management in Apache Spark
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
What is Apache Spark | Apache Spark Tutorial For Beginners | Apache Spark Tra...
 
Spark with Delta Lake
Spark with Delta LakeSpark with Delta Lake
Spark with Delta Lake
 
What Is RDD In Spark? | Edureka
What Is RDD In Spark? | EdurekaWhat Is RDD In Spark? | Edureka
What Is RDD In Spark? | Edureka
 
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at RuntimeAdaptive Query Execution: Speeding Up Spark SQL at Runtime
Adaptive Query Execution: Speeding Up Spark SQL at Runtime
 

Viewers also liked

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at ClouderaDataconomy Media
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Jon Haddad
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache SparkCloudera, Inc.
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
 

Viewers also liked (6)

"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
"Petascale Genomics with Spark", Sean Owen,Director of Data Science at Cloudera
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Intro to py spark (and cassandra)
Intro to py spark (and cassandra)Intro to py spark (and cassandra)
Intro to py spark (and cassandra)
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
Anomaly Detection with Apache Spark
Anomaly Detection with Apache SparkAnomaly Detection with Apache Spark
Anomaly Detection with Apache Spark
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 

Similar to PySpark Best Practices for Writing, Testing and Running Distributed Python Code

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Cloudera, Inc.
 
5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 MinutesCloudera, Inc.
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkJeremy Beard
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsDr. Mirko Kämpf
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsDr. Mirko Kämpf
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduJeremy Beard
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLhuguk
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseCloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Cloudera, Inc.
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedwhoschek
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform WebinarCloudera, Inc.
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark OperationsCloudera, Inc.
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingHari Shreedharan
 

Similar to PySpark Best Practices for Writing, Testing and Running Distributed Python Code (20)

Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
Introduction to Machine Learning on Apache Spark MLlib by Juliet Hougland, Se...
 
5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes5 Apache Spark Tips in 5 Minutes
5 Apache Spark Tips in 5 Minutes
 
Building Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache SparkBuilding Efficient Pipelines in Apache Spark
Building Efficient Pipelines in Apache Spark
 
Apache Spark in Scientific Applciations
Apache Spark in Scientific ApplciationsApache Spark in Scientific Applciations
Apache Spark in Scientific Applciations
 
Apache Spark in Scientific Applications
Apache Spark in Scientific ApplicationsApache Spark in Scientific Applications
Apache Spark in Scientific Applications
 
Next-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
 
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and KuduBuilding Effective Near-Real-Time Analytics with Spark Streaming and Kudu
Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Data Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
 
Spark etl
Spark etlSpark etl
Spark etl
 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015Real Time Data Processing using Spark Streaming | Data Day Texas 2015
Real Time Data Processing using Spark Streaming | Data Day Texas 2015
 
Data Science and CDSW
Data Science and CDSWData Science and CDSW
Data Science and CDSW
 
Ingesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmedIngesting hdfs intosolrusingsparktrimmed
Ingesting hdfs intosolrusingsparktrimmed
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Spark One Platform Webinar
Spark One Platform WebinarSpark One Platform Webinar
Spark One Platform Webinar
 
Apache Spark Operations
Apache Spark OperationsApache Spark Operations
Apache Spark Operations
 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

 
Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformCloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...ICS
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...aditisharan08
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 

Recently uploaded (20)

chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
The Real-World Challenges of Medical Device Cybersecurity- Mitigating Vulnera...
 
Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...Unit 1.1 Excite Part 1, class 9, cbse...
Unit 1.1 Excite Part 1, class 9, cbse...
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 

PySpark Best Practices for Writing, Testing and Running Distributed Python Code

  • 1. ‹#›© Cloudera, Inc. All rights reserved. Juliet Hougland Sept 2015 @j_houg PySpark Best Practices
  • 2. ‹#›© Cloudera, Inc. All rights reserved.
  • 3. ‹#›© Cloudera, Inc. All rights reserved. • Core written, operates on the JVM • Also has Python and Java APIs • Hadoop Friendly • Input from HDFS, HBase, Kafka • Management via YARN • Interactive REPL • ML library == MLLib Spark
  • 4. ‹#›© Cloudera, Inc. All rights reserved. Spark MLLib • Model building and eval • Fast • Basics covered • LR, SVM, Decision tree • PCA, SVD • K-means • ALS • Algorithms expect RDDs of consistent types (i.e. LabeledPoints) !
  • 5. ‹#›© Cloudera, Inc. All rights reserved. RDDs sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis
  • 6. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 7. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 8. ‹#›© Cloudera, Inc. All rights reserved. RDDs …RDD …RDD HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 9. ‹#›© Cloudera, Inc. All rights reserved. …RDD …RDD RDDs HDFS Partition 1 Partition 2 Partition 3 Partition 4 Partition 1 Partition 2 Partition 3 Partition 4 …RDD Partition 1 Partition 2 Partition 3 Partition 4 Count Thanks: Kostas Sakellis sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 10. ‹#›© Cloudera, Inc. All rights reserved. Spark Execution Model
  • 11. ‹#›© Cloudera, Inc. All rights reserved. PySpark Execution Model
  • 12. ‹#›© Cloudera, Inc. All rights reserved. PySpark Driver Program sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count() Function closures need to be executed on worker nodes by a python process.
  • 13. ‹#›© Cloudera, Inc. All rights reserved. How do we ship around Python functions? sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 14. ‹#›© Cloudera, Inc. All rights reserved. Pickle! https://flic.kr/p/c8N4sE
  • 15. ‹#›© Cloudera, Inc. All rights reserved. Pickle! sc.textFile(“hdfs://…”, 4) .map(to_series) .filter(has_outlier) .count()
  • 16. ‹#›© Cloudera, Inc. All rights reserved. Best Practices for Writing PySpark
  • 17. ‹#›© Cloudera, Inc. All rights reserved. REPLs and Notebooks https://flic.kr/p/5hnPZp
  • 18. ‹#›© Cloudera, Inc. All rights reserved. Share your code https://flic.kr/p/sw2cnL
  • 19. ‹#›© Cloudera, Inc. All rights reserved. Standard Python Project my_pyspark_proj/ awesome/ __init__.py bin/ docs/ setup.py tests/ awesome_tests.py __init__.py
  • 20. ‹#›© Cloudera, Inc. All rights reserved. What is the shape of a PySpark job? https://flic.kr/p/4vWP6U
  • 21. ‹#›© Cloudera, Inc. All rights reserved. ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data PySpark Structure? https://flic.kr/p/ZW54 Shout out to my colleagues in the UK
  • 22. ‹#›© Cloudera, Inc. All rights reserved. PySpark Structure? my_pyspark_proj/ awesome/ __init__.py DataIO.py Featurize.py Model.py bin/ docs/ setup.py tests/ __init__.py awesome_tests.py resources/ data_source_sample.csv ! • Parse CLI args & configure Spark App • Read in data • Raw data into features • Fancy Maths with Spark • Write out data
  • 23. ‹#›© Cloudera, Inc. All rights reserved. Simple Main Method
  • 24. ‹#›© Cloudera, Inc. All rights reserved. • Write a function for anything inside an transformation • Make it static • Separate Feature generation or data standardization from your modeling Write Testable Code Featurize.py … ! @static_method def label(single_record): … return label_as_a_double @static_method def descriptive_name_of_feature1(): ... return a_double ! @static_method def create_labeled_point(data_usage_rdd, sms_usage_rdd): ... return LabeledPoint(label, [feature1])
  • 25. ‹#›© Cloudera, Inc. All rights reserved. • Functions and the contexts they need to execute (closures) must be serializable • Keep functions simple. I suggest static methods. • Some things are impossiblish • DB connections => Use mapPartitions instead Write Serializable Code https://flic.kr/p/za5cy
  • 26. ‹#›© Cloudera, Inc. All rights reserved. • Provides a SparkContext configures Spark master • Quiets Py4J • https://github.com/holdenk/ spark-testing-base Testing with SparkTestingBase
  • 27. ‹#›© Cloudera, Inc. All rights reserved. • Unit test as much as possible • Integration test the whole flow ! • Test for: • Deviations of data from expected format • RDDs with an empty partitions • Correctness of results Testing Suggestions https://flic.kr/p/tucHHL
  • 28. ‹#›© Cloudera, Inc. All rights reserved. Best Practices for Running PySpark
  • 29. ‹#›© Cloudera, Inc. All rights reserved. Writing distributed code is the easy part… Running it is hard.
  • 30. ‹#›© Cloudera, Inc. All rights reserved. Get Serious About Logs • Get the YARN app id from the WebUI or Console • yarn logs <app-id> • Quiet down Py4J • Log records that have trouble getting processed • Earlier exceptions more relevant than later ones • Look at both the Python and Java stack traces
  • 31. ‹#›© Cloudera, Inc. All rights reserved. Know your environment • You may want to use python packages on your cluster • Actively manage dependencies on your cluster • Anaconda or virtualenv is good for this. • Spark versions <1.4.0 require the same version of Python on driver and workers
  • 32. ‹#›© Cloudera, Inc. All rights reserved. Complex Dependencies
  • 33. ‹#›© Cloudera, Inc. All rights reserved. Many Python Environments Path to Python binary to use on the cluster can be set with PYSPARK_PYTHON ! Can be set it in spark-env.sh if [ -n “${PYSPARK_PYTHON}" ]; then export PYSPARK_PYTHON=<path> fi
  • 34. ‹#›© Cloudera, Inc. All rights reserved. Thank You Questions? ! @j_houg