SlideShare a Scribd company logo
1 of 27
Download to read offline
Sparkling Pandas
Scaling Pandas beyond a single machine
(or letting Pandas Roam)
With Special thanks to Juliet Hougland :)
Sparkling Pandas
Scaling Pandas beyond a single machine
(or letting Pandas Roam)
With Special thanks to Juliet Hougland :)
Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Engineer at Alpine Data Labs
○ previously DataBricks, Google, Foursquare, Amazon
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
What is Pandas?
user_id panda_ty
pe
01234 giant
12345 red
23456 giant
34567 giant
45678 red
56789 giant
● DataFrames--Indexed, tabular data structures
● Easy slicing, indexing, subsetting/filtering
● Excellent support for time series data
● Data alignment and reshaping
http://pandas.pydata.org/
What is Spark?
Fast general engine for in memory data
processing.
tl;dr - 100x faster than Hadoop MapReduce*
The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML bagel &
Grah X
MLLib
Community
Packages
Some Spark terms
Spark Context (aka sc)
● The window to the world of Spark
sqlContext
● The window to the world of DataFrames
Transformation
● Takes an RDD (or DataFrame) and returns a new RDD
or DataFrame
Action
● Causes an RDD to be evaluated (often storing the
result)
Dataframes between Spark & Pandas
Spark
● Fast
● Distributed
● Limited API
● Some ML
● I/O Options
● Not indexed
Pandas
● Fast
● Single Machine
● Full Feature API
● Integration with ML
● Different I/O
Options
● Indexed
● Easy to visualize
Panda IMG by Peter
Beardsley
Simple Spark SQL Example
input = sqlContext.jsonFile(inputFile)
input.registerTempTable("tweets")
topTweets = sqlContext.sql("SELECT text, retweetCount" +
"FROM tweets ORDER BY retweetCount LIMIT 10")
local = topTweets.collect()
Convert a Spark DataFrame to Pandas
import pandas
...
ddf = sqlContext.read.json("hdfs://...")
# Some Spark transformations
transformedDdf = ddf.filter(ddf['age'] > 21)
return transformedDdf.toPandas()
Convert a Pandas DataFrame to Spark
import pandas
...
df = panda.DataFrame(...)
...
ddf = sqlContext.DataFrame(df)
Let’s combine the two
● Spark DataFrames already provides some of what we
need
○ Add UDFs / UDAFS
○ Use bits of Pandas code
● http://spark-packages.org - excellent pace to get
libraries
So where does the PB&J go?
Spark
DataFrame
Sparkling
Pandas API
Custom
UDFS
Pandas
Code
Sparkling
Pandas
Scala Code
PySpark
RDDs
Pandas
Code
Internal
State
Extending Spark - adding index support
self._index_names
def collect(self):
"""Collect the elements in an Dataframe
and concatenate the partition."""
df = self._schema_rdd.toPandas()
df = _update_index_on_df(df, self._index_names)
return df
Extending Spark - adding index support
def _update_index_on_df(df, index_names):
if index_names:
df = df.set_index(index_names)
# Remove names from unnamed indexes
index_names = _denormalize_names(index_names)
df.index.names = index_names
return df
Adding a UDF in Python
sqlContext.registerFunction("strLenPython", lambda x:
len(x), IntegerType())
Extending Spark SQL w/Scala for fun &
profit
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column =
new Column(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}
Extending Spark SQL w/Scala for fun &
profit
def _create_function(name, doc=""):
def _(col):
sc = SparkContext._active_spark_context
f = sc._jvm.com.sparklingpandas.functions, name
jc = getattr(f)(col._jc if isinstance(col, Column) else
col)
return Column(jc)
return _
_functions = {
'kurtosis': 'Calculate the kurtosis, maybe!',
}
Simple graphing with Sparkling Pandas
import matplotlib.pyplot as plt
plot = speaker_pronouns["pronoun"].plot()
plot.get_figure().savefig("/tmp/fig")
Not yet
merged in
Why is SparklingPandas fast*?
Keep stuff in the JVM as much as
possible.
Lazy operations
Distributed
*For really flexible versions of the word fast
Coffee
by eltpics
Panda image by Stéfan
Panda image by cactusroot
Supported operations:
DataFrames
● to_spark_sql
● applymap
● groupby
● collect
● stats
● query
● axes
● ftype
● dtype
Context
● simple
● read_csv
● from_data_frame
● parquetFile
● read_json
● stop
GroupBy
● groups
● indices
● first
● median
● mean
● sum
● aggregate
Always onwards and upwards
Now
Hypothetical, Wonderful Future
Workdone
Time
Related Works
Blaze
● http://continuum.io/blog/blaze
AdaTao’s Distributed DataFrame
● http://spark-summit.org/2014/talk/distributed-dataframe-
ddf-on-apache-spark-simplifying-big-data-for-the-rest-of-
us
Numba
● http://numba.pydata.org/
Using Sparkling Pandas
You can get Sparkling Pandas from
● Website:
http://www.sparklingpandas.com
● Code:
https://github.com/sparklingpandas/sparklingpandas
● Mailing List
https://groups.google.com/d/forum/sparklingpandas
Getting Sparkling Pandas friends
The examples from this will get merged into master.
Pandas
● http://pandas.pydata.org/ (or pip)
Spark
● http://spark.apache.org/
many pandas by David Goehring
Any
questions?

More Related Content

What's hot

Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Holden Karau
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesIlya Ganelin
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...Holden Karau
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupHolden Karau
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Holden Karau
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkDuyhai Doan
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDsDean Chen
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)hiteshnd
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introductioncolorant
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsAnton Kirillov
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 

What's hot (20)

Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016Beyond shuffling - Strata London 2016
Beyond shuffling - Strata London 2016
 
Up and running with pyspark
Up and running with pysparkUp and running with pyspark
Up and running with pyspark
 
PySaprk
PySaprkPySaprk
PySaprk
 
Frustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFramesFrustration-Reduced PySpark: Data engineering with DataFrames
Frustration-Reduced PySpark: Data engineering with DataFrames
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Scaling with apache spark (a lesson in unintended consequences) strange loo...
Scaling with apache spark (a lesson in unintended consequences)   strange loo...Scaling with apache spark (a lesson in unintended consequences)   strange loo...
Scaling with apache spark (a lesson in unintended consequences) strange loo...
 
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark MeetupBeyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
Beyond Shuffling and Streaming Preview - Salt Lake City Spark Meetup
 
Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017Spark ML for custom models - FOSDEM HPC 2017
Spark ML for custom models - FOSDEM HPC 2017
 
Spark tutorial
Spark tutorialSpark tutorial
Spark tutorial
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Apache Spark RDDs
Apache Spark RDDsApache Spark RDDs
Apache Spark RDDs
 
Python and Bigdata - An Introduction to Spark (PySpark)
Python and Bigdata -  An Introduction to Spark (PySpark)Python and Bigdata -  An Introduction to Spark (PySpark)
Python and Bigdata - An Introduction to Spark (PySpark)
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Spark shuffle introduction
Spark shuffle introductionSpark shuffle introduction
Spark shuffle introduction
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Apache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & InternalsApache Spark in Depth: Core Concepts, Architecture & Internals
Apache Spark in Depth: Core Concepts, Architecture & Internals
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 

Viewers also liked

El 7 de febrero
El 7 de febreroEl 7 de febrero
El 7 de febrerolroczey
 
El 23 de febrero
El 23 de febreroEl 23 de febrero
El 23 de febrerolroczey
 
Virtual child (infant & toddler)
Virtual child (infant & toddler)Virtual child (infant & toddler)
Virtual child (infant & toddler)khiara_albaran
 
Bio chapter 37
Bio  chapter 37Bio  chapter 37
Bio chapter 37allybove
 
El 8 de febrero
El 8 de febreroEl 8 de febrero
El 8 de febrerolroczey
 
Cover letter and resume rene
Cover letter and resume   reneCover letter and resume   rene
Cover letter and resume renekhiara_albaran
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Holden Karau
 
IGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching PresentationIGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching Presentationze1337
 
Cover Letter and Resume
Cover Letter and ResumeCover Letter and Resume
Cover Letter and Resumekhiara_albaran
 
El 10 de febrero
El 10 de febrero El 10 de febrero
El 10 de febrero lroczey
 
Virtual child health (infant & toddler)
Virtual child   health (infant & toddler)Virtual child   health (infant & toddler)
Virtual child health (infant & toddler)khiara_albaran
 
El 6 de febrero
El 6 de febreroEl 6 de febrero
El 6 de febrerolroczey
 
El 13 de febrero
El 13 de febrero El 13 de febrero
El 13 de febrero lroczey
 
El 2 de enero
El 2 de eneroEl 2 de enero
El 2 de enerolroczey
 
El tres de marzo
El tres de marzoEl tres de marzo
El tres de marzolroczey
 
El 3 de enero
El 3 de eneroEl 3 de enero
El 3 de enerolroczey
 
El 19 de diciembre
El 19 de diciembreEl 19 de diciembre
El 19 de diciembrelroczey
 

Viewers also liked (20)

El 7 de febrero
El 7 de febreroEl 7 de febrero
El 7 de febrero
 
El 23 de febrero
El 23 de febreroEl 23 de febrero
El 23 de febrero
 
Virtual child (infant & toddler)
Virtual child (infant & toddler)Virtual child (infant & toddler)
Virtual child (infant & toddler)
 
Nadal2011 rosa
Nadal2011 rosaNadal2011 rosa
Nadal2011 rosa
 
Bio chapter 37
Bio  chapter 37Bio  chapter 37
Bio chapter 37
 
El 8 de febrero
El 8 de febreroEl 8 de febrero
El 8 de febrero
 
Cover letter and resume rene
Cover letter and resume   reneCover letter and resume   rene
Cover letter and resume rene
 
Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014Spark with Elasticsearch - umd version 2014
Spark with Elasticsearch - umd version 2014
 
IGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching PresentationIGPS I Assignment 4: Overarching Presentation
IGPS I Assignment 4: Overarching Presentation
 
Cover Letter and Resume
Cover Letter and ResumeCover Letter and Resume
Cover Letter and Resume
 
El 10 de febrero
El 10 de febrero El 10 de febrero
El 10 de febrero
 
Virtual child health (infant & toddler)
Virtual child   health (infant & toddler)Virtual child   health (infant & toddler)
Virtual child health (infant & toddler)
 
Experience
Experience Experience
Experience
 
Helping agencies
Helping agenciesHelping agencies
Helping agencies
 
El 6 de febrero
El 6 de febreroEl 6 de febrero
El 6 de febrero
 
El 13 de febrero
El 13 de febrero El 13 de febrero
El 13 de febrero
 
El 2 de enero
El 2 de eneroEl 2 de enero
El 2 de enero
 
El tres de marzo
El tres de marzoEl tres de marzo
El tres de marzo
 
El 3 de enero
El 3 de eneroEl 3 de enero
El 3 de enero
 
El 19 de diciembre
El 19 de diciembreEl 19 de diciembre
El 19 de diciembre
 

Similar to Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsHolden Karau
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Holden Karau
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastHolden Karau
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018Holden Karau
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018Holden Karau
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data CertificationAdam Doyle
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckData Con LA
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Anant Corporation
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningDatabricks
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)Durga Gadiraju
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015dhiguero
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMHolden Karau
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Holden Karau
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxRahul Borate
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Databricks
 

Similar to Sparkling pandas Letting Pandas Roam - PyData Seattle 2015 (20)

A fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFsA fast introduction to PySpark with a quick look at Arrow based UDFs
A fast introduction to PySpark with a quick look at Arrow based UDFs
 
Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016Getting started with Apache Spark in Python - PyLadies Toronto 2016
Getting started with Apache Spark in Python - PyLadies Toronto 2016
 
Introduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at lastIntroduction to Spark Datasets - Functional and relational together at last
Introduction to Spark Datasets - Functional and relational together at last
 
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018Beyond Wordcount  with spark datasets (and scalaing) - Nide PDX Jan 2018
Beyond Wordcount with spark datasets (and scalaing) - Nide PDX Jan 2018
 
Big data beyond the JVM - DDTX 2018
Big data beyond the JVM -  DDTX 2018Big data beyond the JVM -  DDTX 2018
Big data beyond the JVM - DDTX 2018
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Big Data Certification
Big Data CertificationBig Data Certification
Big Data Certification
 
An introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuckAn introduction into Spark ML plus how to go beyond when you get stuck
An introduction into Spark ML plus how to go beyond when you get stuck
 
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
Data Engineer's Lunch #82: Automating Apache Cassandra Operations with Apache...
 
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep LearningLeveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
Leveraging Apache Spark for Scalable Data Prep and Inference in Deep Learning
 
HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)HDPCD Spark using Python (pyspark)
HDPCD Spark using Python (pyspark)
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015Adios hadoop, Hola Spark! T3chfest 2015
Adios hadoop, Hola Spark! T3chfest 2015
 
A super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAMA super fast introduction to Spark and glance at BEAM
A super fast introduction to Spark and glance at BEAM
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
Apache spark as a gateway drug to FP concepts taught and broken - Curry On 2018
 
Unit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptxUnit II Real Time Data Processing tools.pptx
Unit II Real Time Data Processing tools.pptx
 
Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?Koalas: How Well Does Koalas Work?
Koalas: How Well Does Koalas Work?
 
Big data clustering
Big data clusteringBig data clustering
Big data clustering
 

Recently uploaded

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一F La
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 

Recently uploaded (20)

RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort ServiceCall Girls in Saket 99530🔝 56974 Escort Service
Call Girls in Saket 99530🔝 56974 Escort Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
办理(UWIC毕业证书)英国卡迪夫城市大学毕业证成绩单原版一比一
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 

Sparkling pandas Letting Pandas Roam - PyData Seattle 2015

  • 1. Sparkling Pandas Scaling Pandas beyond a single machine (or letting Pandas Roam) With Special thanks to Juliet Hougland :)
  • 2. Sparkling Pandas Scaling Pandas beyond a single machine (or letting Pandas Roam) With Special thanks to Juliet Hougland :)
  • 3. Who am I? Holden ● I prefer she/her for pronouns ● Co-author of the Learning Spark book ● Engineer at Alpine Data Labs ○ previously DataBricks, Google, Foursquare, Amazon ● @holdenkarau ● http://www.slideshare.net/hkarau ● https://www.linkedin.com/in/holdenkarau
  • 4. What is Pandas? user_id panda_ty pe 01234 giant 12345 red 23456 giant 34567 giant 45678 red 56789 giant ● DataFrames--Indexed, tabular data structures ● Easy slicing, indexing, subsetting/filtering ● Excellent support for time series data ● Data alignment and reshaping http://pandas.pydata.org/
  • 5. What is Spark? Fast general engine for in memory data processing. tl;dr - 100x faster than Hadoop MapReduce*
  • 6. The different pieces of Spark Apache Spark SQL & DataFrames Streaming Language APIs Scala, Java, Python, & R Graph Tools Spark ML bagel & Grah X MLLib Community Packages
  • 7. Some Spark terms Spark Context (aka sc) ● The window to the world of Spark sqlContext ● The window to the world of DataFrames Transformation ● Takes an RDD (or DataFrame) and returns a new RDD or DataFrame Action ● Causes an RDD to be evaluated (often storing the result)
  • 8. Dataframes between Spark & Pandas Spark ● Fast ● Distributed ● Limited API ● Some ML ● I/O Options ● Not indexed Pandas ● Fast ● Single Machine ● Full Feature API ● Integration with ML ● Different I/O Options ● Indexed ● Easy to visualize
  • 9. Panda IMG by Peter Beardsley
  • 10. Simple Spark SQL Example input = sqlContext.jsonFile(inputFile) input.registerTempTable("tweets") topTweets = sqlContext.sql("SELECT text, retweetCount" + "FROM tweets ORDER BY retweetCount LIMIT 10") local = topTweets.collect()
  • 11. Convert a Spark DataFrame to Pandas import pandas ... ddf = sqlContext.read.json("hdfs://...") # Some Spark transformations transformedDdf = ddf.filter(ddf['age'] > 21) return transformedDdf.toPandas()
  • 12. Convert a Pandas DataFrame to Spark import pandas ... df = panda.DataFrame(...) ... ddf = sqlContext.DataFrame(df)
  • 13. Let’s combine the two ● Spark DataFrames already provides some of what we need ○ Add UDFs / UDAFS ○ Use bits of Pandas code ● http://spark-packages.org - excellent pace to get libraries
  • 14. So where does the PB&J go? Spark DataFrame Sparkling Pandas API Custom UDFS Pandas Code Sparkling Pandas Scala Code PySpark RDDs Pandas Code Internal State
  • 15. Extending Spark - adding index support self._index_names def collect(self): """Collect the elements in an Dataframe and concatenate the partition.""" df = self._schema_rdd.toPandas() df = _update_index_on_df(df, self._index_names) return df
  • 16. Extending Spark - adding index support def _update_index_on_df(df, index_names): if index_names: df = df.set_index(index_names) # Remove names from unnamed indexes index_names = _denormalize_names(index_names) df.index.names = index_names return df
  • 17. Adding a UDF in Python sqlContext.registerFunction("strLenPython", lambda x: len(x), IntegerType())
  • 18. Extending Spark SQL w/Scala for fun & profit // functions we want to be callable from python object functions { def kurtosis(e: Column): Column = new Column(Kurtosis(EvilSqlTools.getExpr(e))) def registerUdfs(sqlCtx: SQLContext): Unit = { sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _) } }
  • 19. Extending Spark SQL w/Scala for fun & profit def _create_function(name, doc=""): def _(col): sc = SparkContext._active_spark_context f = sc._jvm.com.sparklingpandas.functions, name jc = getattr(f)(col._jc if isinstance(col, Column) else col) return Column(jc) return _ _functions = { 'kurtosis': 'Calculate the kurtosis, maybe!', }
  • 20. Simple graphing with Sparkling Pandas import matplotlib.pyplot as plt plot = speaker_pronouns["pronoun"].plot() plot.get_figure().savefig("/tmp/fig") Not yet merged in
  • 21. Why is SparklingPandas fast*? Keep stuff in the JVM as much as possible. Lazy operations Distributed *For really flexible versions of the word fast Coffee by eltpics Panda image by Stéfan Panda image by cactusroot
  • 22. Supported operations: DataFrames ● to_spark_sql ● applymap ● groupby ● collect ● stats ● query ● axes ● ftype ● dtype Context ● simple ● read_csv ● from_data_frame ● parquetFile ● read_json ● stop GroupBy ● groups ● indices ● first ● median ● mean ● sum ● aggregate
  • 23. Always onwards and upwards Now Hypothetical, Wonderful Future Workdone Time
  • 24. Related Works Blaze ● http://continuum.io/blog/blaze AdaTao’s Distributed DataFrame ● http://spark-summit.org/2014/talk/distributed-dataframe- ddf-on-apache-spark-simplifying-big-data-for-the-rest-of- us Numba ● http://numba.pydata.org/
  • 25. Using Sparkling Pandas You can get Sparkling Pandas from ● Website: http://www.sparklingpandas.com ● Code: https://github.com/sparklingpandas/sparklingpandas ● Mailing List https://groups.google.com/d/forum/sparklingpandas
  • 26. Getting Sparkling Pandas friends The examples from this will get merged into master. Pandas ● http://pandas.pydata.org/ (or pip) Spark ● http://spark.apache.org/
  • 27. many pandas by David Goehring Any questions?