Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Koalas: pandas on Apache Spark
Niall Turbitt
Data Scientist @ Databricks
About
Niall Turbitt
Data Scientist at Databricks
● Professional Services and Training
● Previously:
○ e-Commerce
○ Supply Chain and Logistics
○ Recommender Systems and Personalization
● MS Statistics University College Dublin
● BA Mathematics & Economics Trinity College Dublin
Outline
▪ pandas vs Spark
▪ Why Koalas?
▪ Koalas under the hood
▪ Demo
▪ Koalas Roadmap
Typical Journey of a Data Scientist
▪ Education (MOOCs, Books, Universities) -> pandas
▪ Analyze Small Datasets -> pandas
▪ Analyze Big Datasets -> Apache Spark
▪ Authored by Wes McKinney in 2008
▪ The standard tool for data manipulation and analysis in Python
▪ Deeply integrated into Python data science ecosystem
▪ numpy
▪ matplotlib
▪ scikit-learn
▪ Can deal with a lot of different situations, including:
▪ Basic statistical analysis
▪ Handling missing data
▪ Time series, categorical variables, strings
▪ De facto unified analytics engine for large-scale data processing
▪ Streaming
▪ ETL
▪ ML
▪ Originally created at UC Berkeley by Databricks’ founders
▪ PySpark API for Python; also API support for Scala, R, Java and SQL
A short example
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
PySparkpandas
df = (spark.read
.option("inferSchema", "true")
.csv("my_data.csv"))
df = df.toDF('x', 'y', 'z1')
df = df.withColumn('x2', df.x * df.x)
▪ Announced April 2019
▪ Pure Python Library
▪ Aims at providing the pandas API on top of Apache Spark
▪ Unifies the two ecosystems with a familiar API
▪ Seamless transition between small and large data
Koalas
Koalas Growth
▪ > 30,000 daily downloads
▪ > 2,000 GitHub stars
A short example
import pandas as pd
df = pd.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
pandas
import databricks.koalas as ks
df = ks.read_csv("my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x * df.x
Koalas
Current Status
▪ Bi-weekly releases, very active community with daily changes
▪ Most common pandas functions have been implemented in Koalas:
▪ Series : 70%
▪ DataFrame : 74%
▪ Index : 56%
▪ MultiIndex : 51%
▪ DataFrameGroupBy : 67%
▪ SeriesGroupBy : 69%
▪ Plotting: 80%
Spark vs pandas - Key Differences
▪ DataFrame is immutable
▪ Lazy Evaluation
▪ Distributed
▪ Does not maintain row order
▪ Performance working at scale
▪ DataFrame is mutable
▪ Eager execution
▪ Single-machine
▪ Maintains row order
▪ Restricted to single machine
pandasSpark
Koalas - Architecture
Catalyst Optimization & Tungsten
Execution
DataFrame APIsSQL
Koalas
Core
Data Source
Connectors
pandas
SPARK
Lean API layer
Koalas - Under the hood
▪ InternalFrame
▪ Holds the current Spark DataFrame
▪ Internal immutable metadata
▪ Manages mappings from Koalas column names to Spark column names
▪ Manages mapping from Koalas index names to Spark column names
▪ Converts between Spark DataFrame and pandas DataFrame
InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
Spark
DataFrame
InternalFrame
- column_labels
- index_map
Spark
DataFrame
Koalas
DataFrame
API call copy with new state
InternalFrame - Metadata update only
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
Spark
DataFrame
InternalFrame
- column_labels
- index_map
Koalas
DataFrame
API call Copy with new state
Only updates
metadata
kdf.set_index(...)
InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
Spark
DataFrame
InternalFrame
- column_labels
- index_map
Spark
DataFrame
API call copy with new state
e.g. inplace=True
kdf.dropna(..., inplace=True)
Koalas - Under the hood
▪ Default Index
▪ Koalas manages a group of columns as an index
▪ Behaves in same manner as pandas
▪ If no index is specified when creating Koalas DataFrame
▪ A “default index” is attached automatically
▪ Each “default index” has pros and cons
Default Index Type
▪ Used by default
▪ Implements sequence that
increases one by one
▪ Uses PySpark Window
function without specifying
partition
▪ Can end up with whole
partition on single node
▪ Avoid when data is large
▪ Implements sequence that
increases one by one
▪ Uses group-by and
group-map in distributed
manner
▪ Recommended if the index
must be sequential for a
large dataset and increasing
one by one
distributed-sequencesequence
▪ Implements monotonically
increasing sequence
▪ Uses PySpark’s
monotonically_increasing_id
in distributed manner
▪ Values are non-deterministic
▪ Recommended if the index
does not have to be a
sequence increasing one by
one
distributed
Configurable by the option compute.default_index_type
Demo: bit.ly/koalas_summit_2020
Koalas - On the roadmap
▪ June/July 2020: Koalas 1.0 release, which supports Spark 3.0
▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x
Koalas - Getting started
pip install koalas
conda install koalas
▪ Docs: https://koalas.readthedocs.io
▪ Github: https://github.com/databricks/koalas
Thank You!
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Koalas: Pandas on Apache Spark

Koalas: Pandas on Apache Spark

  • 2.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.
  • 3.
    Koalas: pandas onApache Spark Niall Turbitt Data Scientist @ Databricks
  • 4.
    About Niall Turbitt Data Scientistat Databricks ● Professional Services and Training ● Previously: ○ e-Commerce ○ Supply Chain and Logistics ○ Recommender Systems and Personalization ● MS Statistics University College Dublin ● BA Mathematics & Economics Trinity College Dublin
  • 5.
    Outline ▪ pandas vsSpark ▪ Why Koalas? ▪ Koalas under the hood ▪ Demo ▪ Koalas Roadmap
  • 6.
    Typical Journey ofa Data Scientist ▪ Education (MOOCs, Books, Universities) -> pandas ▪ Analyze Small Datasets -> pandas ▪ Analyze Big Datasets -> Apache Spark
  • 7.
    ▪ Authored byWes McKinney in 2008 ▪ The standard tool for data manipulation and analysis in Python ▪ Deeply integrated into Python data science ecosystem ▪ numpy ▪ matplotlib ▪ scikit-learn ▪ Can deal with a lot of different situations, including: ▪ Basic statistical analysis ▪ Handling missing data ▪ Time series, categorical variables, strings
  • 8.
    ▪ De factounified analytics engine for large-scale data processing ▪ Streaming ▪ ETL ▪ ML ▪ Originally created at UC Berkeley by Databricks’ founders ▪ PySpark API for Python; also API support for Scala, R, Java and SQL
  • 9.
    A short example importpandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x PySparkpandas df = (spark.read .option("inferSchema", "true") .csv("my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x * df.x)
  • 10.
    ▪ Announced April2019 ▪ Pure Python Library ▪ Aims at providing the pandas API on top of Apache Spark ▪ Unifies the two ecosystems with a familiar API ▪ Seamless transition between small and large data Koalas
  • 11.
    Koalas Growth ▪ >30,000 daily downloads ▪ > 2,000 GitHub stars
  • 12.
    A short example importpandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x pandas import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x Koalas
  • 13.
    Current Status ▪ Bi-weeklyreleases, very active community with daily changes ▪ Most common pandas functions have been implemented in Koalas: ▪ Series : 70% ▪ DataFrame : 74% ▪ Index : 56% ▪ MultiIndex : 51% ▪ DataFrameGroupBy : 67% ▪ SeriesGroupBy : 69% ▪ Plotting: 80%
  • 14.
    Spark vs pandas- Key Differences ▪ DataFrame is immutable ▪ Lazy Evaluation ▪ Distributed ▪ Does not maintain row order ▪ Performance working at scale ▪ DataFrame is mutable ▪ Eager execution ▪ Single-machine ▪ Maintains row order ▪ Restricted to single machine pandasSpark
  • 15.
    Koalas - Architecture CatalystOptimization & Tungsten Execution DataFrame APIsSQL Koalas Core Data Source Connectors pandas SPARK Lean API layer
  • 16.
    Koalas - Underthe hood ▪ InternalFrame ▪ Holds the current Spark DataFrame ▪ Internal immutable metadata ▪ Manages mappings from Koalas column names to Spark column names ▪ Manages mapping from Koalas index names to Spark column names ▪ Converts between Spark DataFrame and pandas DataFrame
  • 17.
    InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map Spark DataFrame InternalFrame -column_labels - index_map Spark DataFrame Koalas DataFrame API call copy with new state
  • 18.
    InternalFrame - Metadataupdate only Koalas DataFrame InternalFrame - column_labels - index_map Spark DataFrame InternalFrame - column_labels - index_map Koalas DataFrame API call Copy with new state Only updates metadata kdf.set_index(...)
  • 19.
    InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map Spark DataFrame InternalFrame -column_labels - index_map Spark DataFrame API call copy with new state e.g. inplace=True kdf.dropna(..., inplace=True)
  • 20.
    Koalas - Underthe hood ▪ Default Index ▪ Koalas manages a group of columns as an index ▪ Behaves in same manner as pandas ▪ If no index is specified when creating Koalas DataFrame ▪ A “default index” is attached automatically ▪ Each “default index” has pros and cons
  • 21.
    Default Index Type ▪Used by default ▪ Implements sequence that increases one by one ▪ Uses PySpark Window function without specifying partition ▪ Can end up with whole partition on single node ▪ Avoid when data is large ▪ Implements sequence that increases one by one ▪ Uses group-by and group-map in distributed manner ▪ Recommended if the index must be sequential for a large dataset and increasing one by one distributed-sequencesequence ▪ Implements monotonically increasing sequence ▪ Uses PySpark’s monotonically_increasing_id in distributed manner ▪ Values are non-deterministic ▪ Recommended if the index does not have to be a sequence increasing one by one distributed Configurable by the option compute.default_index_type
  • 22.
  • 23.
    Koalas - Onthe roadmap ▪ June/July 2020: Koalas 1.0 release, which supports Spark 3.0 ▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x
  • 24.
    Koalas - Gettingstarted pip install koalas conda install koalas ▪ Docs: https://koalas.readthedocs.io ▪ Github: https://github.com/databricks/koalas
  • 25.
  • 26.
    Feedback Your feedback isimportant to us. Don’t forget to rate and review the sessions.