2. About Me
- Software Engineer @databricks
- Apache Spark
Committer & PMC member
- Twitter: @ueshin
- GitHub: github.com/ueshin
3. Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
3
4. pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
4
5. Typical journey of a data scientist
Education (MOOCs, books, universities) → pandas
Analyze small data sets → pandas
Analyze big data sets → DataFrame in Spark
5
6. Apache Spark
De facto unified analytics engine for large-scale data processing
(Streaming, ETL, ML)
Originally created at UC Berkeley by Databricks’ founders
PySpark API for Python; also API support for Scala, R and SQL
6
8. A short example
8
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x*df.x)
9. Koalas
Announced April 24, 2019
Pure Python library
Aims at providing the pandas API on top of Apache Spark:
- unifies the two ecosystems with a familiar API
- seamless transition between small and large data
9
10. Koalas
- Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
- Unify pandas API and Spark API, but pandas first
- pandas APIs that are appropriate for distributed
dataset
- Easy conversion from/to pandas DataFrame or
numpy array.
10
11. A short example
11
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
12. Key Differences
Spark is more lazy by nature:
- most operations only happen when displaying or writing a
DataFrame
Spark does not maintain row order
Performance when working at scale
12
14. Sessions related to Koalas
• Keynote: New Developments in the Open Source Ecosystem:
Apache Spark 3.0, Delta Lake, and Koalas
• https://databricks.com/session_eu19/new-developments-in-the-open-source-
ecosystem-apache-spark-3-0-delta-lake-and-koalas
• Koalas: Making an Easy Transition from Pandas to Apache
Spark
• https://databricks.com/session_eu19/koalas-making-an-easy-transition-from-pandas-
to-apache-spark
• Koalas: Pandas on Apache Spark
• https://databricks.com/session_eu19/koalas-pandas-on-apache-spark
14
15. Current status
Bi-weekly releases, very active community with daily changes
The most common functions have been implemented:
- 60% of the DataFrame / Series API
- 60% of the DataFrameGroupBy / SeriesGroupBy API
- 15% of the Index / MultiIndex API
- to_datetime, get_dummies, …
15
16. New features
- 80% of the plot functions (0.16.0-)
- Spark related functions (0.8.0-)
- IO: to_parquet/read_parquet, to_csv/read_csv,
to_json/read_json, to_spark_io/read_spark_io,
to_delta/read_delta, ...
- SQL
- cache
- Support for multi-index columns (90%) (0.16.0-)
- Options to configure Koalas’ behavior (0.17.0-)
16
18. Challenge: increasing scale
and complexity of data
operations
Struggling with the “Spark
switch” from pandas
More than 10X faster with
less than 1% code changes
How Virgin Hyperloop One reduced processing
time from hours to minutes with Koalas
19. What to expect?
• Improve pandas API coverage
- rolling/expanding
• Support categorical data types
• More time-series related functions
• Improve performance
- Minimize the overhead at Koalas layer
- Optimal implementation of APIs
19
20. Getting started
pip install koalas
conda install koalas
Look for docs on https://koalas.readthedocs.io/en/latest/
and updates on github.com/databricks/koalas
10 min tutorial in a Live Jupyter notebook is available from the docs.
20
21. Documents pick up
• User Guide
• https://koalas.readthedocs.io/en/latest/user_guide/index.html
• Getting Started
• https://koalas.readthedocs.io/en/latest/getting_started/index.html
• Working with pandas and PySpark
• https://koalas.readthedocs.io/en/latest/user_guide/pandas_pyspark.htm
l
• Best Practice
• https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html
21
22. Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
koalas.readthedocs.io/en/latest/development/contributing.html
22