SlideShare a Scribd company logo
10.03.19, 17)15Life of PySpark
Page 1 of 68http://localhost:8000/?print-pdf#/
A TALE OF TWO ENVIRONMENTS
LIFE OF PYSPARK
Mohanababu Sathyakumari Shankar
10.03.19, 17)15Life of PySpark
Page 2 of 68http://localhost:8000/?print-pdf#/
CONTENTS
Who I am!
A Brief History of Spark
Grapes of Spark
The Metamorphosis
Brave New PySpark
To Kill a Mocking Bear
Pride and Production
Sense and Scalability
A Song of Scala and Python
The Finkler Questions
The Sense of an Ending
10.03.19, 17)15Life of PySpark
Page 3 of 68http://localhost:8000/?print-pdf#/
WHO I AM
by day
by night
, all day long
Natural habitat:
MSc Computer Science,
So!ware Engineer,
, Bangalore
Data Engineer
Data Scientist
Data Geek
KI labs
TU München
Oracle Financial Services
So!ware
10.03.19, 17)15Life of PySpark
Page 4 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
10.03.19, 17)15Life of PySpark
Page 5 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
- Disk based access
- More lines of code
- Default
- Nothing more
- Not available
- Java, Python (verbose)
MAPREDUCE AND RECYCLE
Slow
Cumbersome programming
Abstractions-less
Batch processing
Built-in Interactive mode
Support
10.03.19, 17)15Life of PySpark
Page 6 of 68http://localhost:8000/?print-pdf#/
10.03.19, 17)15Life of PySpark
Page 7 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
, UC Berkeley
, Spark v0.6.0
, Apache Incubator
Unified analytics engine
Matei Zaharia
AMPLab
October 2012
June 2013
10.03.19, 17)15Life of PySpark
Page 8 of 68http://localhost:8000/?print-pdf#/
A BRIEF HISTORY OF SPARK
- Processing in-memory
- Lesser lines of code
- RDDs ++
Java, Scala, Python & R
Fast
Concise
Special Abstractions
Stream and batch processing
Built-in Interactive mode
10.03.19, 17)15Life of PySpark
Page 9 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
10.03.19, 17)15Life of PySpark
Page 10 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
processing
- Non-linear flow
- Query optimiser
FASTER PROCESSING
Stream and batch processing
In-memory
DAG
Lazy Evaluation
Calcite
10.03.19, 17)15Life of PySpark
Page 11 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
- Less lines of code
- Java, Scala, Python & R
high-level operators
atop Spark
EASE OF USE
Concise
Support
80
Built-in Interactive mode
Numerous projects
10.03.19, 17)15Life of PySpark
Page 12 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
multiple libraries
- SQL-styled processing
- streaming data
- Machine Learning
- Graphs
- SQL Analytics
DIVERSITY
Leverages
Spark SQL
Spark Streaming
MLlib
GraphX and GraphFrames
BlinkDB/Tachyon
10.03.19, 17)15Life of PySpark
Page 13 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
abstraction in Spark
- File
- RDDs
- Read-only
- Across nodes
- Parallel
- Lineage
- Java/Scala
RDD
Primary
Created
Created
Immutable
Partitioned
Distributed
Fault-tolerant
Object collection
10.03.19, 17)15Life of PySpark
Page 14 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
by DFs in R/Python
- Table structure
- Columns
- Defined by a schema
- API, build query plans
- Query optimiser
DATAFRAMES
Inspired
Relational
Named
Schema
SQL
Catalyst
10.03.19, 17)15Life of PySpark
Page 15 of 68http://localhost:8000/?print-pdf#/
GRAPES OF SPARK
of RDDs and DFs
- Columns
- No schema
- No table
- Compile-time type safety
DATASETS
Best features
Unnamed
Schema-less
Non-relational
Type safe
10.03.19, 17)15Life of PySpark
Page 16 of 68http://localhost:8000/?print-pdf#/
THE METAMORPHOSIS
10.03.19, 17)15Life of PySpark
Page 17 of 68http://localhost:8000/?print-pdf#/
THE METAMORPHOSIS
on RDDs and DFs
- new RDDs
- a DAG
map, filter, groupBy, sortBy
union, intersection, distinct
TRANSFORMATIONS
Operates
Creates
RDD Lineage
Lazy Evaluation
10.03.19, 17)15Life of PySpark
Page 18 of 68http://localhost:8000/?print-pdf#/
THE METAMORPHOSIS
on RDDs and DFs
- applied on RDDs
- No new RDDs
- Initiator
unt, reduce, collect
aggregate, first, take, sum
ACTIONS
Operates
Functions
Triggered
Lazy Evaluation
10.03.19, 17)15Life of PySpark
Page 19 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
10.03.19, 17)15Life of PySpark
Page 20 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
PYTHON + SPARK
10.03.19, 17)15Life of PySpark
Page 21 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
TIME AND COMPLEXITY
10.03.19, 17)15Life of PySpark
Page 22 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
NOTEBOOK INTEGRATION
10.03.19, 17)15Life of PySpark
Page 23 of 68http://localhost:8000/?print-pdf#/
OPTION 1: DOWNLOAD TAR RELEASE
BRAVE NEW PYSPARK
SETUP
wget https://www.apache.org/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz
tar -xzf spark-2.4.0-bin-hadoop2.7.tgz
PATH="$PATH:$(pwd)/spark-2.4.0-bin-hadoop2.7/bin
10.03.19, 17)15Life of PySpark
Page 24 of 68http://localhost:8000/?print-pdf#/
10.03.19, 17)15Life of PySpark
Page 25 of 68http://localhost:8000/?print-pdf#/
OPTION 2: USING BREW ON MACOS
BRAVE NEW PYSPARK
SETUP
brew install apache-spark
10.03.19, 17)15Life of PySpark
Page 26 of 68http://localhost:8000/?print-pdf#/
OPTION 3: USING PYPI
BRAVE NEW PYSPARK
SETUP
pip install pyspark
10.03.19, 17)15Life of PySpark
Page 27 of 68http://localhost:8000/?print-pdf#/
OPTION 4: USING CONDA
BRAVE NEW PYSPARK
SETUP
conda install -c conda-forge pyspark=2.3.1
10.03.19, 17)15Life of PySpark
Page 28 of 68http://localhost:8000/?print-pdf#/
CONFIGURE AND START
BRAVE NEW PYSPARK
SETUP
## Running PySpark in cluster mode inside Jupyter
## Include additional python modules
IPYTHON_OPTS="notebook" pyspark 
--master spark://localhost:7077 
--executor-memory 7g 
--py-files tensorflow-py2.7.egg
10.03.19, 17)15Life of PySpark
Page 29 of 68http://localhost:8000/?print-pdf#/
10.03.19, 17)15Life of PySpark
Page 30 of 68http://localhost:8000/?print-pdf#/
BRAVE NEW PYSPARK
EASY TO PROTOTYPE
10.03.19, 17)15Life of PySpark
Page 31 of 68http://localhost:8000/?print-pdf#/
TO KILL A MOCKING BEAR
10.03.19, 17)15Life of PySpark
Page 32 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
LOADING CSV
df = pd.read_csv("world_rankings.csv")
df = sql.context.read.format('com.databricks.spark.csv')
.options(header='true', inferschema='true')
.load("world_rankings.csv")
10.03.19, 17)15Life of PySpark
Page 33 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
VIEW DATAFRAME
df
df.head(10)
df
df.show(10)
10.03.19, 17)15Life of PySpark
Page 34 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
COLUMNS AND DATATYPES
df.columns
df.dtypes
df.columns
df.dtypes
10.03.19, 17)15Life of PySpark
Page 35 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
DROP COLUMN
df.drop('column1', axis=1)
df.drop('column1')
10.03.19, 17)15Life of PySpark
Page 36 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
FILL NULLS
df.fillna(0)
df.fillna(0)
10.03.19, 17)15Life of PySpark
Page 37 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
AGGREGATION
df.groupby(['column1', 'column2']) 
.agg({"column3": "mean", "column4": "min"})
df.groupby(['column1', 'column2']) 
.agg({"column3": "mean", "column4": "min"})
10.03.19, 17)15Life of PySpark
Page 38 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
MERGE/JOIN DATAFRAMES
left.merge(right, on='key')
left.merge(right, left_on='column1', right_on='column2')
left.join(right, on='key')
left.join(right, left.column1 == right.column2
10.03.19, 17)15Life of PySpark
Page 39 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
SUMMARY STATISTICS
df.describe()
df.describe().show()
10.03.19, 17)15Life of PySpark
Page 40 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
RENAME COLUMNS
df.columns = ['C1', 'C2', 'C3']
df.rename(columns = {"C1": "c1", "C2": "c2", "C3": "c3"})
df.toDF('C1', 'C2', 'C3')
df.withColumnRenamed('C1', 'c1')
10.03.19, 17)15Life of PySpark
Page 41 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
FILTER COLUMNS
df[(df.column1 < 10) && (df.column2 == 100)]
df.filter((df.column1 < 10) && (df.column2 == 100))
10.03.19, 17)15Life of PySpark
Page 42 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
ADD COLUMN
df[df.column] = 1 / df.column
df.withColumn('df.column', 1 / df.column)
10.03.19, 17)15Life of PySpark
Page 43 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
STANDARD TRANSFORMATIONS
import numpy as np
df['log_values'] = np.log(df.values)
import pyspark.sql.functions as F
df.withColumn('log_values', F.log(df.values))
10.03.19, 17)15Life of PySpark
Page 44 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
ROW CONDITIONAL STATEMENTS
df['conditional'] = df.apply(lambda x: 1 if x.column1 > 20
else 10 if x.column2 == 100 else 42, axis=1)
import pyspark.sql.functions as F
df.withColumn('conditional', 
F.when(df.column1 > 20, 1) 
.when(df.column2 == 100, 10) 
.otherwise(42))
10.03.19, 17)15Life of PySpark
Page 45 of 68http://localhost:8000/?print-pdf#/
10.03.19, 17)15Life of PySpark
Page 46 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
PIVOT TABLE
pd.pivot_table(df, values='column4', 
index=['column1', 'column2'], columns=['column3], 
aggfunc=np.sum)
df.groupBy("column1", "column2").pivot("column3").sum("column4")
10.03.19, 17)15Life of PySpark
Page 47 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
HISTOGRAM
df.hist()
df.sample(False, 0.1).toPandas().hist()
10.03.19, 17)15Life of PySpark
Page 48 of 68http://localhost:8000/?print-pdf#/
Pandas Dataframe
PySpark Dataframe
TO KILL A MOCKING BEAR
SQL QUERIES
Not Applicable
df.createOrReplaceTempView('TempTable')
df_query = spark.sql('select * from TempTable')
10.03.19, 17)15Life of PySpark
Page 49 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
10.03.19, 17)15Life of PySpark
Page 50 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
through complete data
access too slow
chunks of data
environment
No functions
PYTHON FUNCTIONS IN SPARK
Iterate
Row-by-row
Distributed
Production
Conventional
10.03.19, 17)15Life of PySpark
Page 51 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
Python functions
and
is specified
operations only
access too slow
ser/deser
PYSPARK UDFS
(ROW-AT-A-TIME UDFS)
Primitive
map() apply()
Output data type
Series/Scalar
Row-by-row
Non-vectorized
10.03.19, 17)15Life of PySpark
Page 52 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
Python functions
Pandas & Scikit-learn
based
ser/deser
required
required
required
and
PANDAS UDFS
(VECTORIZED UDFS)
Optimised
Supports
Apache Arrow
Vectorized
Output data type
PandasUDFType
DataFrame Schema
Scalar Grouped Map
10.03.19, 17)15Life of PySpark
Page 53 of 68http://localhost:8000/?print-pdf#/
DIFFERENCES
PRIDE AND PRODUCTION
SCALAR AND GROUPEDBY UDFS
10.03.19, 17)15Life of PySpark
Page 54 of 68http://localhost:8000/?print-pdf#/
PERFORMANCE
PRIDE AND PRODUCTION
SCALAR AND GROUPEDBY UDFS
10.03.19, 17)15Life of PySpark
Page 55 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
detection
data from trucks
- No
Complexity
exist
and with bugs
required
- -
DBSCAN ON SPARK
Density-Based Spatial Clustering
Stay Points
Telematics
Spark MLlib DBSCAN
O(n^2)
Implementations
Non-performant
Scikit-learn
ELKI O(nlogn) JAVA
10.03.19, 17)15Life of PySpark
Page 56 of 68http://localhost:8000/?print-pdf#/
PRIDE AND PRODUCTION
DBSCAN USING PANDAS UDF
10.03.19, 17)15Life of PySpark
Page 57 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
10.03.19, 17)15Life of PySpark
Page 58 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
in native Python
objects
required
best approach
only scope
avoided
SCALA UDFS
Driver
Non-native JVM
2x Ser/Deser
Scala UDFs
Spark v2.1
JVM
Unnecessary Ser/Deser
10.03.19, 17)15Life of PySpark
Page 59 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
as Scala project
using SBT
to PySpark session
the Scala UDF
only scope
SCALA UDFS
Create Scala UDF
Build JAR
Submit JAR
Register
JVM
10.03.19, 17)15Life of PySpark
Page 60 of 68http://localhost:8000/?print-pdf#/
SENSE AND SCALABILITY
Benchmark Python UDF vs Pandas UDF vs Scala UDF
10.03.19, 17)15Life of PySpark
Page 61 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
10.03.19, 17)15Life of PySpark
Page 62 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
expertise is high
not mature enough
required
of UDFs
usage
avoided
PATCH-22
Python
Spark MLlib
Pandas and Scikit-learn
Blackbox behaviour
High-level column based
Objects conversion
10.03.19, 17)15Life of PySpark
Page 63 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
THE PY4J REDEMPTION
10.03.19, 17)15Life of PySpark
Page 64 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
NO PYTHON FOR SPARK MAIN()
10.03.19, 17)15Life of PySpark
Page 65 of 68http://localhost:8000/?print-pdf#/
A SONG OF SCALA AND PYTHON
10.03.19, 17)15Life of PySpark
Page 66 of 68http://localhost:8000/?print-pdf#/
10.03.19, 17)15Life of PySpark
Page 67 of 68http://localhost:8000/?print-pdf#/
THE FINKLER QUESTIONS
10.03.19, 17)15Life of PySpark
Page 68 of 68http://localhost:8000/?print-pdf#/
THE SENSE OF AN ENDING

More Related Content

What's hot

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
Jaemun Jung
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
Takuya UESHIN
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
Databricks
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
Spark Summit
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark Summit
 
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Ryo 亮 Kawahara 河原
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Edureka!
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
Ryuji Tamagawa
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Databricks
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
Tuhin Mahmud
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
Craig Chao
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
Databricks
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Databricks
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Spark Summit
 

What's hot (20)

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
PySpark dataframe
PySpark dataframePySpark dataframe
PySpark dataframe
 
Apache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache SparkApache Arrow and Pandas UDF on Apache Spark
Apache Arrow and Pandas UDF on Apache Spark
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Getting The Best Performance With PySpark
Getting The Best Performance With PySparkGetting The Best Performance With PySpark
Getting The Best Performance With PySpark
 
High Performance Python on Apache Spark
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
 
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther KundinIntegrating Existing C++ Libraries into PySpark with Esther Kundin
Integrating Existing C++ Libraries into PySpark with Esther Kundin
 
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
Spark and Object Stores —What You Need to Know: Spark Summit East talk by Ste...
 
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
Apache Sparkを用いたスケーラブルな時系列データの異常検知モデル学習ソフトウェアの開発
 
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
Apache Spark Tutorial | Spark Tutorial for Beginners | Apache Spark Training ...
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
Accelerating Apache Spark by Several Orders of Magnitude with GPUs and RAPIDS...
 
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
Apache Spark Performance Troubleshooting at Scale, Challenges, Tools, and Met...
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Build a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimizationBuild a deep learning pipeline on apache spark for ads optimization
Build a deep learning pipeline on apache spark for ads optimization
 
Spark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science LondonSpark Under the Hood - Meetup @ Data Science London
Spark Under the Hood - Meetup @ Data Science London
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 

Similar to Life of PySpark - A tale of two environments

Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
Redge Technologies
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PROIDEA
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
Databricks
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
Jos Boumans
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
Sasha Goldshtein
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kazuhito Ohkawa
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users
ICTeam S.p.A.
 
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAMSparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Data Science Milan
 
Latin America Tour 2019 - 18c and 19c featues
Latin America Tour 2019   - 18c and 19c featuesLatin America Tour 2019   - 18c and 19c featues
Latin America Tour 2019 - 18c and 19c featues
Connor McDonald
 
MLflow with R
MLflow with RMLflow with R
MLflow with R
Databricks
 
Seeing Like Software
Seeing Like SoftwareSeeing Like Software
Seeing Like Software
Andrew Lovett-Barron
 
Eta lang Beauty And The Beast
Eta lang Beauty And The Beast Eta lang Beauty And The Beast
Eta lang Beauty And The Beast
Jarek Ratajski
 
Running R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on SparkRunning R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on Spark
Databricks
 
Metasepi team meeting #20: Start! ATS programming on MCU
Metasepi team meeting #20: Start! ATS programming on MCUMetasepi team meeting #20: Start! ATS programming on MCU
Metasepi team meeting #20: Start! ATS programming on MCU
Kiwamu Okabe
 
04 - I love my OS, he protects me (sometimes, in specific circumstances)
04 - I love my OS, he protects me (sometimes, in specific circumstances)04 - I love my OS, he protects me (sometimes, in specific circumstances)
04 - I love my OS, he protects me (sometimes, in specific circumstances)
Alexandre Moneger
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
Taro L. Saito
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Chris Fregly
 
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
PROIDEA
 
18c and 19c features for DBAs
18c and 19c features for DBAs18c and 19c features for DBAs
18c and 19c features for DBAs
Connor McDonald
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
Chris Fregly
 

Similar to Life of PySpark - A tale of two environments (20)

Stress your DUT
Stress your DUTStress your DUT
Stress your DUT
 
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
 
Briefing on the Modern ML Stack with R
 Briefing on the Modern ML Stack with R Briefing on the Modern ML Stack with R
Briefing on the Modern ML Stack with R
 
How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...How to measure everything - a million metrics per second with minimal develop...
How to measure everything - a million metrics per second with minimal develop...
 
The Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF PrimerThe Next Linux Superpower: eBPF Primer
The Next Linux Superpower: eBPF Primer
 
Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例Kauli SSPにおけるVyOSの導入事例
Kauli SSPにおけるVyOSの導入事例
 
Sparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R usersSparklyr: Big Data enabler for R users
Sparklyr: Big Data enabler for R users
 
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAMSparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
Sparklyr: Big Data enabler for R users - Serena Signorelli, ICTEAM
 
Latin America Tour 2019 - 18c and 19c featues
Latin America Tour 2019   - 18c and 19c featuesLatin America Tour 2019   - 18c and 19c featues
Latin America Tour 2019 - 18c and 19c featues
 
MLflow with R
MLflow with RMLflow with R
MLflow with R
 
Seeing Like Software
Seeing Like SoftwareSeeing Like Software
Seeing Like Software
 
Eta lang Beauty And The Beast
Eta lang Beauty And The Beast Eta lang Beauty And The Beast
Eta lang Beauty And The Beast
 
Running R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on SparkRunning R at Scale with Apache Arrow on Spark
Running R at Scale with Apache Arrow on Spark
 
Metasepi team meeting #20: Start! ATS programming on MCU
Metasepi team meeting #20: Start! ATS programming on MCUMetasepi team meeting #20: Start! ATS programming on MCU
Metasepi team meeting #20: Start! ATS programming on MCU
 
04 - I love my OS, he protects me (sometimes, in specific circumstances)
04 - I love my OS, he protects me (sometimes, in specific circumstances)04 - I love my OS, he protects me (sometimes, in specific circumstances)
04 - I love my OS, he protects me (sometimes, in specific circumstances)
 
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
td-spark internals: Extending Spark with Airframe - Spark Meetup Tokyo #3 2020
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
PLNOG 13: P. Kupisiewicz, O. Pelerin: Make IOS-XE Troubleshooting Easy – Pack...
 
18c and 19c features for DBAs
18c and 19c features for DBAs18c and 19c features for DBAs
18c and 19c features for DBAs
 
Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015Budapest Big Data Meetup Nov 26 2015
Budapest Big Data Meetup Nov 26 2015
 

Recently uploaded

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Boston Institute of Analytics
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
AlejandraGmez176757
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
MaleehaSheikh2
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
StarCompliance.io
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 

Recently uploaded (20)

Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project PresentationPredicting Product Ad Campaign Performance: A Data Analysis Project Presentation
Predicting Product Ad Campaign Performance: A Data Analysis Project Presentation
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
Business update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMIBusiness update Q1 2024 Lar España Real Estate SOCIMI
Business update Q1 2024 Lar España Real Estate SOCIMI
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
FP Growth Algorithm and its Applications
FP Growth Algorithm and its ApplicationsFP Growth Algorithm and its Applications
FP Growth Algorithm and its Applications
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Investigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_CrimesInvestigate & Recover / StarCompliance.io / Crypto_Crimes
Investigate & Recover / StarCompliance.io / Crypto_Crimes
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 

Life of PySpark - A tale of two environments

  • 1. 10.03.19, 17)15Life of PySpark Page 1 of 68http://localhost:8000/?print-pdf#/ A TALE OF TWO ENVIRONMENTS LIFE OF PYSPARK Mohanababu Sathyakumari Shankar
  • 2. 10.03.19, 17)15Life of PySpark Page 2 of 68http://localhost:8000/?print-pdf#/ CONTENTS Who I am! A Brief History of Spark Grapes of Spark The Metamorphosis Brave New PySpark To Kill a Mocking Bear Pride and Production Sense and Scalability A Song of Scala and Python The Finkler Questions The Sense of an Ending
  • 3. 10.03.19, 17)15Life of PySpark Page 3 of 68http://localhost:8000/?print-pdf#/ WHO I AM by day by night , all day long Natural habitat: MSc Computer Science, So!ware Engineer, , Bangalore Data Engineer Data Scientist Data Geek KI labs TU München Oracle Financial Services So!ware
  • 4. 10.03.19, 17)15Life of PySpark Page 4 of 68http://localhost:8000/?print-pdf#/ A BRIEF HISTORY OF SPARK
  • 5. 10.03.19, 17)15Life of PySpark Page 5 of 68http://localhost:8000/?print-pdf#/ A BRIEF HISTORY OF SPARK - Disk based access - More lines of code - Default - Nothing more - Not available - Java, Python (verbose) MAPREDUCE AND RECYCLE Slow Cumbersome programming Abstractions-less Batch processing Built-in Interactive mode Support
  • 6. 10.03.19, 17)15Life of PySpark Page 6 of 68http://localhost:8000/?print-pdf#/
  • 7. 10.03.19, 17)15Life of PySpark Page 7 of 68http://localhost:8000/?print-pdf#/ A BRIEF HISTORY OF SPARK , UC Berkeley , Spark v0.6.0 , Apache Incubator Unified analytics engine Matei Zaharia AMPLab October 2012 June 2013
  • 8. 10.03.19, 17)15Life of PySpark Page 8 of 68http://localhost:8000/?print-pdf#/ A BRIEF HISTORY OF SPARK - Processing in-memory - Lesser lines of code - RDDs ++ Java, Scala, Python & R Fast Concise Special Abstractions Stream and batch processing Built-in Interactive mode
  • 9. 10.03.19, 17)15Life of PySpark Page 9 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK
  • 10. 10.03.19, 17)15Life of PySpark Page 10 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK processing - Non-linear flow - Query optimiser FASTER PROCESSING Stream and batch processing In-memory DAG Lazy Evaluation Calcite
  • 11. 10.03.19, 17)15Life of PySpark Page 11 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK - Less lines of code - Java, Scala, Python & R high-level operators atop Spark EASE OF USE Concise Support 80 Built-in Interactive mode Numerous projects
  • 12. 10.03.19, 17)15Life of PySpark Page 12 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK multiple libraries - SQL-styled processing - streaming data - Machine Learning - Graphs - SQL Analytics DIVERSITY Leverages Spark SQL Spark Streaming MLlib GraphX and GraphFrames BlinkDB/Tachyon
  • 13. 10.03.19, 17)15Life of PySpark Page 13 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK abstraction in Spark - File - RDDs - Read-only - Across nodes - Parallel - Lineage - Java/Scala RDD Primary Created Created Immutable Partitioned Distributed Fault-tolerant Object collection
  • 14. 10.03.19, 17)15Life of PySpark Page 14 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK by DFs in R/Python - Table structure - Columns - Defined by a schema - API, build query plans - Query optimiser DATAFRAMES Inspired Relational Named Schema SQL Catalyst
  • 15. 10.03.19, 17)15Life of PySpark Page 15 of 68http://localhost:8000/?print-pdf#/ GRAPES OF SPARK of RDDs and DFs - Columns - No schema - No table - Compile-time type safety DATASETS Best features Unnamed Schema-less Non-relational Type safe
  • 16. 10.03.19, 17)15Life of PySpark Page 16 of 68http://localhost:8000/?print-pdf#/ THE METAMORPHOSIS
  • 17. 10.03.19, 17)15Life of PySpark Page 17 of 68http://localhost:8000/?print-pdf#/ THE METAMORPHOSIS on RDDs and DFs - new RDDs - a DAG map, filter, groupBy, sortBy union, intersection, distinct TRANSFORMATIONS Operates Creates RDD Lineage Lazy Evaluation
  • 18. 10.03.19, 17)15Life of PySpark Page 18 of 68http://localhost:8000/?print-pdf#/ THE METAMORPHOSIS on RDDs and DFs - applied on RDDs - No new RDDs - Initiator unt, reduce, collect aggregate, first, take, sum ACTIONS Operates Functions Triggered Lazy Evaluation
  • 19. 10.03.19, 17)15Life of PySpark Page 19 of 68http://localhost:8000/?print-pdf#/ BRAVE NEW PYSPARK
  • 20. 10.03.19, 17)15Life of PySpark Page 20 of 68http://localhost:8000/?print-pdf#/ BRAVE NEW PYSPARK PYTHON + SPARK
  • 21. 10.03.19, 17)15Life of PySpark Page 21 of 68http://localhost:8000/?print-pdf#/ BRAVE NEW PYSPARK TIME AND COMPLEXITY
  • 22. 10.03.19, 17)15Life of PySpark Page 22 of 68http://localhost:8000/?print-pdf#/ BRAVE NEW PYSPARK NOTEBOOK INTEGRATION
  • 23. 10.03.19, 17)15Life of PySpark Page 23 of 68http://localhost:8000/?print-pdf#/ OPTION 1: DOWNLOAD TAR RELEASE BRAVE NEW PYSPARK SETUP wget https://www.apache.org/spark-2.4.0/spark-2.4.0-bin-hadoop2.7.tgz tar -xzf spark-2.4.0-bin-hadoop2.7.tgz PATH="$PATH:$(pwd)/spark-2.4.0-bin-hadoop2.7/bin
  • 24. 10.03.19, 17)15Life of PySpark Page 24 of 68http://localhost:8000/?print-pdf#/
  • 25. 10.03.19, 17)15Life of PySpark Page 25 of 68http://localhost:8000/?print-pdf#/ OPTION 2: USING BREW ON MACOS BRAVE NEW PYSPARK SETUP brew install apache-spark
  • 26. 10.03.19, 17)15Life of PySpark Page 26 of 68http://localhost:8000/?print-pdf#/ OPTION 3: USING PYPI BRAVE NEW PYSPARK SETUP pip install pyspark
  • 27. 10.03.19, 17)15Life of PySpark Page 27 of 68http://localhost:8000/?print-pdf#/ OPTION 4: USING CONDA BRAVE NEW PYSPARK SETUP conda install -c conda-forge pyspark=2.3.1
  • 28. 10.03.19, 17)15Life of PySpark Page 28 of 68http://localhost:8000/?print-pdf#/ CONFIGURE AND START BRAVE NEW PYSPARK SETUP ## Running PySpark in cluster mode inside Jupyter ## Include additional python modules IPYTHON_OPTS="notebook" pyspark --master spark://localhost:7077 --executor-memory 7g --py-files tensorflow-py2.7.egg
  • 29. 10.03.19, 17)15Life of PySpark Page 29 of 68http://localhost:8000/?print-pdf#/
  • 30. 10.03.19, 17)15Life of PySpark Page 30 of 68http://localhost:8000/?print-pdf#/ BRAVE NEW PYSPARK EASY TO PROTOTYPE
  • 31. 10.03.19, 17)15Life of PySpark Page 31 of 68http://localhost:8000/?print-pdf#/ TO KILL A MOCKING BEAR
  • 32. 10.03.19, 17)15Life of PySpark Page 32 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR LOADING CSV df = pd.read_csv("world_rankings.csv") df = sql.context.read.format('com.databricks.spark.csv') .options(header='true', inferschema='true') .load("world_rankings.csv")
  • 33. 10.03.19, 17)15Life of PySpark Page 33 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR VIEW DATAFRAME df df.head(10) df df.show(10)
  • 34. 10.03.19, 17)15Life of PySpark Page 34 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR COLUMNS AND DATATYPES df.columns df.dtypes df.columns df.dtypes
  • 35. 10.03.19, 17)15Life of PySpark Page 35 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR DROP COLUMN df.drop('column1', axis=1) df.drop('column1')
  • 36. 10.03.19, 17)15Life of PySpark Page 36 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR FILL NULLS df.fillna(0) df.fillna(0)
  • 37. 10.03.19, 17)15Life of PySpark Page 37 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR AGGREGATION df.groupby(['column1', 'column2']) .agg({"column3": "mean", "column4": "min"}) df.groupby(['column1', 'column2']) .agg({"column3": "mean", "column4": "min"})
  • 38. 10.03.19, 17)15Life of PySpark Page 38 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR MERGE/JOIN DATAFRAMES left.merge(right, on='key') left.merge(right, left_on='column1', right_on='column2') left.join(right, on='key') left.join(right, left.column1 == right.column2
  • 39. 10.03.19, 17)15Life of PySpark Page 39 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR SUMMARY STATISTICS df.describe() df.describe().show()
  • 40. 10.03.19, 17)15Life of PySpark Page 40 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR RENAME COLUMNS df.columns = ['C1', 'C2', 'C3'] df.rename(columns = {"C1": "c1", "C2": "c2", "C3": "c3"}) df.toDF('C1', 'C2', 'C3') df.withColumnRenamed('C1', 'c1')
  • 41. 10.03.19, 17)15Life of PySpark Page 41 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR FILTER COLUMNS df[(df.column1 < 10) && (df.column2 == 100)] df.filter((df.column1 < 10) && (df.column2 == 100))
  • 42. 10.03.19, 17)15Life of PySpark Page 42 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR ADD COLUMN df[df.column] = 1 / df.column df.withColumn('df.column', 1 / df.column)
  • 43. 10.03.19, 17)15Life of PySpark Page 43 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR STANDARD TRANSFORMATIONS import numpy as np df['log_values'] = np.log(df.values) import pyspark.sql.functions as F df.withColumn('log_values', F.log(df.values))
  • 44. 10.03.19, 17)15Life of PySpark Page 44 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR ROW CONDITIONAL STATEMENTS df['conditional'] = df.apply(lambda x: 1 if x.column1 > 20 else 10 if x.column2 == 100 else 42, axis=1) import pyspark.sql.functions as F df.withColumn('conditional', F.when(df.column1 > 20, 1) .when(df.column2 == 100, 10) .otherwise(42))
  • 45. 10.03.19, 17)15Life of PySpark Page 45 of 68http://localhost:8000/?print-pdf#/
  • 46. 10.03.19, 17)15Life of PySpark Page 46 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR PIVOT TABLE pd.pivot_table(df, values='column4', index=['column1', 'column2'], columns=['column3], aggfunc=np.sum) df.groupBy("column1", "column2").pivot("column3").sum("column4")
  • 47. 10.03.19, 17)15Life of PySpark Page 47 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR HISTOGRAM df.hist() df.sample(False, 0.1).toPandas().hist()
  • 48. 10.03.19, 17)15Life of PySpark Page 48 of 68http://localhost:8000/?print-pdf#/ Pandas Dataframe PySpark Dataframe TO KILL A MOCKING BEAR SQL QUERIES Not Applicable df.createOrReplaceTempView('TempTable') df_query = spark.sql('select * from TempTable')
  • 49. 10.03.19, 17)15Life of PySpark Page 49 of 68http://localhost:8000/?print-pdf#/ PRIDE AND PRODUCTION
  • 50. 10.03.19, 17)15Life of PySpark Page 50 of 68http://localhost:8000/?print-pdf#/ PRIDE AND PRODUCTION through complete data access too slow chunks of data environment No functions PYTHON FUNCTIONS IN SPARK Iterate Row-by-row Distributed Production Conventional
  • 51. 10.03.19, 17)15Life of PySpark Page 51 of 68http://localhost:8000/?print-pdf#/ PRIDE AND PRODUCTION Python functions and is specified operations only access too slow ser/deser PYSPARK UDFS (ROW-AT-A-TIME UDFS) Primitive map() apply() Output data type Series/Scalar Row-by-row Non-vectorized
  • 52. 10.03.19, 17)15Life of PySpark Page 52 of 68http://localhost:8000/?print-pdf#/ PRIDE AND PRODUCTION Python functions Pandas & Scikit-learn based ser/deser required required required and PANDAS UDFS (VECTORIZED UDFS) Optimised Supports Apache Arrow Vectorized Output data type PandasUDFType DataFrame Schema Scalar Grouped Map
  • 53. 10.03.19, 17)15Life of PySpark Page 53 of 68http://localhost:8000/?print-pdf#/ DIFFERENCES PRIDE AND PRODUCTION SCALAR AND GROUPEDBY UDFS
  • 54. 10.03.19, 17)15Life of PySpark Page 54 of 68http://localhost:8000/?print-pdf#/ PERFORMANCE PRIDE AND PRODUCTION SCALAR AND GROUPEDBY UDFS
  • 55. 10.03.19, 17)15Life of PySpark Page 55 of 68http://localhost:8000/?print-pdf#/ PRIDE AND PRODUCTION detection data from trucks - No Complexity exist and with bugs required - - DBSCAN ON SPARK Density-Based Spatial Clustering Stay Points Telematics Spark MLlib DBSCAN O(n^2) Implementations Non-performant Scikit-learn ELKI O(nlogn) JAVA
  • 56. 10.03.19, 17)15Life of PySpark Page 56 of 68http://localhost:8000/?print-pdf#/ PRIDE AND PRODUCTION DBSCAN USING PANDAS UDF
  • 57. 10.03.19, 17)15Life of PySpark Page 57 of 68http://localhost:8000/?print-pdf#/ SENSE AND SCALABILITY
  • 58. 10.03.19, 17)15Life of PySpark Page 58 of 68http://localhost:8000/?print-pdf#/ SENSE AND SCALABILITY in native Python objects required best approach only scope avoided SCALA UDFS Driver Non-native JVM 2x Ser/Deser Scala UDFs Spark v2.1 JVM Unnecessary Ser/Deser
  • 59. 10.03.19, 17)15Life of PySpark Page 59 of 68http://localhost:8000/?print-pdf#/ SENSE AND SCALABILITY as Scala project using SBT to PySpark session the Scala UDF only scope SCALA UDFS Create Scala UDF Build JAR Submit JAR Register JVM
  • 60. 10.03.19, 17)15Life of PySpark Page 60 of 68http://localhost:8000/?print-pdf#/ SENSE AND SCALABILITY Benchmark Python UDF vs Pandas UDF vs Scala UDF
  • 61. 10.03.19, 17)15Life of PySpark Page 61 of 68http://localhost:8000/?print-pdf#/ A SONG OF SCALA AND PYTHON
  • 62. 10.03.19, 17)15Life of PySpark Page 62 of 68http://localhost:8000/?print-pdf#/ A SONG OF SCALA AND PYTHON expertise is high not mature enough required of UDFs usage avoided PATCH-22 Python Spark MLlib Pandas and Scikit-learn Blackbox behaviour High-level column based Objects conversion
  • 63. 10.03.19, 17)15Life of PySpark Page 63 of 68http://localhost:8000/?print-pdf#/ A SONG OF SCALA AND PYTHON THE PY4J REDEMPTION
  • 64. 10.03.19, 17)15Life of PySpark Page 64 of 68http://localhost:8000/?print-pdf#/ A SONG OF SCALA AND PYTHON NO PYTHON FOR SPARK MAIN()
  • 65. 10.03.19, 17)15Life of PySpark Page 65 of 68http://localhost:8000/?print-pdf#/ A SONG OF SCALA AND PYTHON
  • 66. 10.03.19, 17)15Life of PySpark Page 66 of 68http://localhost:8000/?print-pdf#/
  • 67. 10.03.19, 17)15Life of PySpark Page 67 of 68http://localhost:8000/?print-pdf#/ THE FINKLER QUESTIONS
  • 68. 10.03.19, 17)15Life of PySpark Page 68 of 68http://localhost:8000/?print-pdf#/ THE SENSE OF AN ENDING