Pandas is a fast and expressive library for data analysis that doesn’t naturally scale to more data than can fit in memory. PySpark is the Python API for Apache Spark that is designed to scale to huge amounts of data but lacks the natural expressiveness of Pandas. This talk introduces Sparkling Pandas, a library that brings together the best features of Pandas and PySpark; Expressiveness, speed, and scalability.
While both Spark 1.3 and Pandas have classes named ‘DataFrame’ the Pandas DataFrame API is broader and not fully covered by the ‘DataFrame’ class in Spark. This talk will explore some of the differences between Spark’s DataFrames and Panda’s DataFrames and then examine some of the work done to implement Panda’s like DataFrames on top of Spark. In some cases, providing Pandas like functionality is computationally expensive in a distributed environment, and we will explore some techniques to minimize this cost.
At the end of this talk you should have a better understanding of both Sparkling Pandas and Spark’s own DataFrames. Whether you end up using Sparkling Pandas or Spark directly, you will have a greater understanding of how to work with structured data in a distributed context using Apache Spark and familiar DataFrame APIs.
3. Who am I?
Holden
● I prefer she/her for pronouns
● Co-author of the Learning Spark book
● Engineer at Alpine Data Labs
○ previously DataBricks, Google, Foursquare, Amazon
● @holdenkarau
● http://www.slideshare.net/hkarau
● https://www.linkedin.com/in/holdenkarau
4. What is Pandas?
user_id panda_ty
pe
01234 giant
12345 red
23456 giant
34567 giant
45678 red
56789 giant
● DataFrames--Indexed, tabular data structures
● Easy slicing, indexing, subsetting/filtering
● Excellent support for time series data
● Data alignment and reshaping
http://pandas.pydata.org/
5. What is Spark?
Fast general engine for in memory data
processing.
tl;dr - 100x faster than Hadoop MapReduce*
6. The different pieces of Spark
Apache Spark
SQL &
DataFrames
Streaming
Language
APIs
Scala,
Java,
Python, &
R
Graph
Tools
Spark ML bagel &
Grah X
MLLib
Community
Packages
7. Some Spark terms
Spark Context (aka sc)
● The window to the world of Spark
sqlContext
● The window to the world of DataFrames
Transformation
● Takes an RDD (or DataFrame) and returns a new RDD
or DataFrame
Action
● Causes an RDD to be evaluated (often storing the
result)
8. Dataframes between Spark & Pandas
Spark
● Fast
● Distributed
● Limited API
● Some ML
● I/O Options
● Not indexed
Pandas
● Fast
● Single Machine
● Full Feature API
● Integration with ML
● Different I/O
Options
● Indexed
● Easy to visualize
10. Simple Spark SQL Example
input = sqlContext.jsonFile(inputFile)
input.registerTempTable("tweets")
topTweets = sqlContext.sql("SELECT text, retweetCount" +
"FROM tweets ORDER BY retweetCount LIMIT 10")
local = topTweets.collect()
11. Convert a Spark DataFrame to Pandas
import pandas
...
ddf = sqlContext.read.json("hdfs://...")
# Some Spark transformations
transformedDdf = ddf.filter(ddf['age'] > 21)
return transformedDdf.toPandas()
12. Convert a Pandas DataFrame to Spark
import pandas
...
df = panda.DataFrame(...)
...
ddf = sqlContext.DataFrame(df)
13. Let’s combine the two
● Spark DataFrames already provides some of what we
need
○ Add UDFs / UDAFS
○ Use bits of Pandas code
● http://spark-packages.org - excellent pace to get
libraries
14. So where does the PB&J go?
Spark
DataFrame
Sparkling
Pandas API
Custom
UDFS
Pandas
Code
Sparkling
Pandas
Scala Code
PySpark
RDDs
Pandas
Code
Internal
State
15. Extending Spark - adding index support
self._index_names
def collect(self):
"""Collect the elements in an Dataframe
and concatenate the partition."""
df = self._schema_rdd.toPandas()
df = _update_index_on_df(df, self._index_names)
return df
16. Extending Spark - adding index support
def _update_index_on_df(df, index_names):
if index_names:
df = df.set_index(index_names)
# Remove names from unnamed indexes
index_names = _denormalize_names(index_names)
df.index.names = index_names
return df
17. Adding a UDF in Python
sqlContext.registerFunction("strLenPython", lambda x:
len(x), IntegerType())
18. Extending Spark SQL w/Scala for fun &
profit
// functions we want to be callable from python
object functions {
def kurtosis(e: Column): Column =
new Column(Kurtosis(EvilSqlTools.getExpr(e)))
def registerUdfs(sqlCtx: SQLContext): Unit = {
sqlCtx.udf.register("rowKurtosis", helpers.rowKurtosis _)
}
}
19. Extending Spark SQL w/Scala for fun &
profit
def _create_function(name, doc=""):
def _(col):
sc = SparkContext._active_spark_context
f = sc._jvm.com.sparklingpandas.functions, name
jc = getattr(f)(col._jc if isinstance(col, Column) else
col)
return Column(jc)
return _
_functions = {
'kurtosis': 'Calculate the kurtosis, maybe!',
}
20. Simple graphing with Sparkling Pandas
import matplotlib.pyplot as plt
plot = speaker_pronouns["pronoun"].plot()
plot.get_figure().savefig("/tmp/fig")
Not yet
merged in
21. Why is SparklingPandas fast*?
Keep stuff in the JVM as much as
possible.
Lazy operations
Distributed
*For really flexible versions of the word fast
Coffee
by eltpics
Panda image by Stéfan
Panda image by cactusroot
22. Supported operations:
DataFrames
● to_spark_sql
● applymap
● groupby
● collect
● stats
● query
● axes
● ftype
● dtype
Context
● simple
● read_csv
● from_data_frame
● parquetFile
● read_json
● stop
GroupBy
● groups
● indices
● first
● median
● mean
● sum
● aggregate
23. Always onwards and upwards
Now
Hypothetical, Wonderful Future
Workdone
Time
25. Using Sparkling Pandas
You can get Sparkling Pandas from
● Website:
http://www.sparklingpandas.com
● Code:
https://github.com/sparklingpandas/sparklingpandas
● Mailing List
https://groups.google.com/d/forum/sparklingpandas
26. Getting Sparkling Pandas friends
The examples from this will get merged into master.
Pandas
● http://pandas.pydata.org/ (or pip)
Spark
● http://spark.apache.org/