Koalas: Unifying Spark and pandas APIs

Koalas: Unifying Spark and
pandas APIs
1
Xiao Li @ gatorsmile
PyBay Conf @ SF | Aug 2019

About Me
• Engineering Manager at Databricks
• Apache Spark Committer and PMC Member
• Previously, IBM Master Inventor
• Spark, Database Replication, Information Integration
• Ph.D. in University of Florida
• Github: gatorsmile

DATABRICKS WORKSPACE
APIs
Jobs
Models
Notebooks
Dashboards
DATA ENGINEERS DATA SCIENTISTS
DATABRICKS CLOUD SERVICE
DATABRICKS RUNTIME
Databricks Delta ML Frameworks
Reliable & Scalable Simple & Integrated
+ +
End to end ML lifecycle
Databricks Unified Analytics Platform

Apache Spark
Originally created by Databricks’ founders at UC Berkeley in 2009
A de facto unified analytics engine for large-scale data processing
- Just-in-time Data Warehouse [with Delta], Streaming, ETL,
ML, Graph Processing
PySpark API for Python; also API support for Scala, R and SQL
4

pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation and analysis in Python
Deeply integrated into Python data science ecosystem, e.g.
numpy, matplotlib
Can deal with a lot of different situations, including:
- basic statistical analysis
- handling missing data
- time series, categorical variables, strings
6

Why Spark Performs Faster in Big Data?
Distributed computing in Spark
More lazy execution in Spark
- Triggered until users call the action APIs (collect, save, show)
- Mixed, combined, optimized and executed holistically
More efficient execution in Spark
- Tungsten execution engine: whole-stage code generation
- Catalyst optimizer: heuristics-based and cost-based query
optimization, adaptive query optimization [Spark 3.0]
7

Spark-ify pandas Code???
• The increasing scale and complexity of data
operations
• pandas-based Python scripts become too slow
• But,,, “Spark switch” is time consuming and not
straightforward
8

Koalas
• Announced April 24, 2019
• Pure Python library
• Familiar if coming from pandas
• Aims at providing the pandas
API on top of Spark
• Unifies the two ecosystems
with a familiar API
• Seamless transition between
small and large data
10

API Differences
pandas
- Born of need + batteries included: providing APIs for common tasks
- Type system from NumPy
- Be Pythonic
PySpark
- Abstraction: tasks are implemented by primitives composition
- Type system from ANSI SQL
- Consistent with Scala DataFrame APIs
11

12
pandas DataFrame Spark DataFrame
Column df[‘col’] df[‘col’]
Mutability Mutable Immutable
Add a column df[‘c’] = df[‘a’] + df[‘b’] df.withColumn(‘c’, df[‘a’] + df[‘b’])
Rename columns df.columns = [‘a’,’b’] df.select(df[‘c1’].alias(‘a’),
df[‘c2’].alias(‘b’))
Value count df[‘col’].value_counts() df.groupBy(df[‘col’]).count()
.orderBy(‘count’, ascending =
False)
Pandas DataFrame vs Spark DataFrame

A short example
13
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = [‘x’, ‘y’, ‘z1’]
df[‘x2’] = df.x * df.x
df = (spark.read
.option("inferSchema", "true")
.option("comment", True)
.csv("my_data.csv"))
df = df.toDF(‘x’, ‘y’, ‘z1’)
df = df.withColumn(‘x2’, df.x*df.x)
pandas PySpark

A short example
14
import pandas as pd
df = pd.read_csv("my_data.csv")
pandas Koalas
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")

Koalas
• Provide discoverable APIs for common data science
tasks (i.e., follows pandas)
• Unify pandas API and Spark API, but pandas first
• pandas APIs that are appropriate for distributed
dataset
• Easy conversion from/to pandas DataFrame or
numpy array.
15

Koalas
16
Catalyst Optimization &
Tungsten Execution
DataFrame APIsSQL
Koalas
Core
Data Source
Connectors
Pandas
SPARK
A lean API layer

Current status
• Bi-weekly releases, very active community with daily changes
• The most common functions have been implemented:
- 60% of the DataFrame/Series API
- 50% of the DataFrameGroupBy/SeriesGroupBy API
- 15% of the Index/MultiIndex API
- to_datetime, get_dummies, …
- to_delta, to_parquet, to_spark_io, sql, cache, …
18

Quickly gaining traction
19
- 300+ patches merged
since announcement
- 20 significant
contributors outside of
Databricks
- 6K+ daily downloads

What to expect soon?
• Performance enhancements
• Better indexing support
• Better error handling
• Better coverage of pandas APIs
• More time series related functions
• Better visualization support
20

Getting started
• pip install koalas
• conda install koalas
• Look for docs and updates on github.com/databricks/koalas
• Project docs are published here: https://koalas.readthedocs.io
21

Do you have suggestions or requests?
Submit requests to github.com/databricks/koalas/issues
Very easy to contribute
github.com/databricks/koalas/blob/master/CONTRIBUTING.md
22

Thank you
Xiao Li
(lixiao@databricks.com)

Koalas: Unifying Spark and pandas APIs

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Koalas: Unifying Spark and pandas APIs

Similar to Koalas: Unifying Spark and pandas APIs (20)

Recently uploaded

Recently uploaded (20)

Koalas: Unifying Spark and pandas APIs