Koalas: Unifying Spark and pandas APIs

Koalas: Unifying Spark and
pandas APIs
1
Takuya UESHIN
Spark Meetup Tokyo #2, Nov 2019

About Me
- Software Engineer @databricks
- Apache Spark
Committer & PMC member
- Twitter: @ueshin
- GitHub: github.com/ueshin

Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
3

pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
4

Typical journey of a data scientist
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark
5

Apache Spark
De facto unified analytics engine for large-scale data processing
(Streaming, ETL, ML)
Originally created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala, R and SQL
6

7
pandas DataFrame Spark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b'])
Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'),
df['c2'].alias('b'))
Value count df['col'].value_counts() df.groupBy(df['col']).count().order
By('count', ascending = False)
pandas DataFrame vs Spark DataFrame

A short example
8
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x*df.x)

Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
9

Koalas
- Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
- Unify pandas API and Spark API, but pandas first
- pandas APIs that are appropriate for distributed
dataset
- Easy conversion from/to pandas DataFrame or
numpy array.
10

A short example
11
import pandas as pd
df = pd.read_csv("my_data.csv")
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")

Key Differences
Spark is more lazy by nature:
- most operations only happen when displaying or writing a
DataFrame
Spark does not maintain row order
Performance when working at scale
12

Quickly gaining traction
13
Bi-weekly releases!
> 600 patches merged since
announcement
> 20 significant contributors
outside of Databricks
> 10k daily downloads

Sessions related to Koalas
• Keynote: New Developments in the Open Source Ecosystem:
Apache Spark 3.0, Delta Lake, and Koalas
• https://databricks.com/session_eu19/new-developments-in-the-open-source-
ecosystem-apache-spark-3-0-delta-lake-and-koalas
• Koalas: Making an Easy Transition from Pandas to Apache
Spark
• https://databricks.com/session_eu19/koalas-making-an-easy-transition-from-pandas-
to-apache-spark
• Koalas: Pandas on Apache Spark
• https://databricks.com/session_eu19/koalas-pandas-on-apache-spark
14

Current status
Bi-weekly releases, very active community with daily changes
The most common functions have been implemented:
- 60% of the DataFrame / Series API
- 60% of the DataFrameGroupBy / SeriesGroupBy API
- 15% of the Index / MultiIndex API
- to_datetime, get_dummies, …
15

New features
- 80% of the plot functions (0.16.0-)
- Spark related functions (0.8.0-)
- IO: to_parquet/read_parquet, to_csv/read_csv,
to_json/read_json, to_spark_io/read_spark_io,
to_delta/read_delta, ...
- SQL
- cache
- Support for multi-index columns (90%) (0.16.0-)
- Options to configure Koalas’ behavior (0.17.0-)
16

Quickly gaining traction
17
Bi-weekly releases!
> 600 patches merged since
announcement
> 20 significant contributors
outside of Databricks
> 10k daily downloads
700
15k

Challenge: increasing scale
and complexity of data
operations
Struggling with the “Spark
switch” from pandas
More than 10X faster with
less than 1% code changes
How Virgin Hyperloop One reduced processing
time from hours to minutes with Koalas

What to expect?
• Improve pandas API coverage
- rolling/expanding
• Support categorical data types
• More time-series related functions
• Improve performance
- Minimize the overhead at Koalas layer
- Optimal implementation of APIs
19

Getting started
pip install koalas
conda install koalas
Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
10 min tutorial in a Live Jupyter notebook is available from the docs.
20

Documents pick up
• User Guide
• https://koalas.readthedocs.io/en/latest/user_guide/index.html
• Getting Started
• https://koalas.readthedocs.io/en/latest/getting_started/index.html
• Working with pandas and PySpark
• https://koalas.readthedocs.io/en/latest/user_guide/pandas_pyspark.htm
l
• Best Practice
• https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html
21

Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html
22

23
Thank you
Takuya UESHIN (ueshin@databricks.com)

Koalas: Unifying Spark and pandas APIs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Koalas: Unifying Spark and pandas APIs

Similar to Koalas: Unifying Spark and pandas APIs (20)

More from Takuya UESHIN

More from Takuya UESHIN (8)

Recently uploaded

Recently uploaded (20)

Koalas: Unifying Spark and pandas APIs