Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Koalas: Making an Easy Transition
from pandas to Apache Spark
Takuya Ueshin
Software Engineer @ Databricks
About
Takuya Ueshin
Software Engineer at Databricks
▪ Apache Spark committer and PMC member
▪ Focusing on Spark SQL and Py...
Outline
▪ What’s Koalas?
▪ pandas vs Apache Spark at a
high level
▪ Koalas 1.0
▪ Demo
▪ InternalFrame
▪ Index and Default ...
What’s Koalas?
▪ Announced April 24, 2019
▪ Aims at providing the pandas API on top of Apache Spark
▪ Unifies the two ecos...
pandas
▪ Authored by Wes McKinney in 2008
▪ The standard tool for data
manipulation and analysis in Python
▪ The current v...
pandas
▪ Deeply integrated into Python data science ecosystem
▪ numpy
▪ matplotlib
▪ scikit-learn
▪ Can deal with a lot of...
Apache Spark
▪ De facto unified analytics engine for large-scale data processing
▪ Streaming
▪ ETL
▪ ML
▪ Originally creat...
pandas DataFrame PySpark DataFrame
Column df['col'] df['col']
Mutability Mutable Immutable
Execution Eagerly Lazily
Add a ...
A short example
import pandas as pd
df =
pd.read_csv("/path/to/my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x...
A short example
import pandas as pd
df =
pd.read_csv("/path/to/my_data.csv")
df.columns = ['x', 'y', 'z1']
df['x2'] = df.x...
Koalas Growth
▪ 30,000+ Downloads per day, 800,000+ Downloads last month
Koalas 1.0
▪ Spark 3.0 support
▪ Optimize using Spark 3.0 functions, such as mapInPandas().
▪ Python 3.8 support
▪ pandas ...
Koalas 1.0
▪ Most common pandas functions have been implemented in Koalas:
▪ Series : 70%
▪ DataFrame : 77%
▪ Index : 65%
...
Demo
InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame
InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame
InternalFrame
- c...
InternalFrame
Koalas
DataFrame
InternalFrame
- column_labels
- index_map
- spark_columns
Spark DataFrame
InternalFrame
- c...
Index and Default Index
▪ Koalas manages a group of columns as an index.
▪ The index behaves the same as pandas’.
▪ to_koa...
Comparison of Default Index Types
Configurable by the option “compute.default_index_type”
Distributed
computation
Map-side...
Roadmap
▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x
▪ Improve the coverage and the behavior compatibi...
Getting started
▪ pip install koalas
▪ conda install -c conda-forge koalas
▪ Look for docs on https://koalas.readthedocs.i...
Do you have suggestions or requests?
▪ Submit requests to github.com/databricks/koalas/issues
▪ Very easy to contribute
ko...
Koalas Session
▪ Koalas: pandas on Apache Spark
▪ Friday, June 26th 10:00 AM (PDT)
Feedback
Your feedback is important to us.
Don’t forget to rate and
review the sessions.
Koalas: Making an Easy Transition from Pandas to Apache Spark
Koalas: Making an Easy Transition from Pandas to Apache Spark
Upcoming SlideShare
Loading in …5
×

of

Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 1 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 2 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 3 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 4 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 5 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 6 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 7 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 8 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 9 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 10 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 11 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 12 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 13 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 14 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 15 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 16 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 17 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 18 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 19 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 20 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 21 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 22 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 23 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 24 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 25 Koalas: Making an Easy Transition from Pandas to Apache Spark Slide 26
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0 Likes

Share

Download to read offline

Koalas: Making an Easy Transition from Pandas to Apache Spark

Download to read offline

Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.

  • Be the first to like this

Koalas: Making an Easy Transition from Pandas to Apache Spark

  1. 1. Koalas: Making an Easy Transition from pandas to Apache Spark Takuya Ueshin Software Engineer @ Databricks
  2. 2. About Takuya Ueshin Software Engineer at Databricks ▪ Apache Spark committer and PMC member ▪ Focusing on Spark SQL and PySpark ▪ Koalas maintainer
  3. 3. Outline ▪ What’s Koalas? ▪ pandas vs Apache Spark at a high level ▪ Koalas 1.0 ▪ Demo ▪ InternalFrame ▪ Index and Default Index ▪ Roadmap
  4. 4. What’s Koalas? ▪ Announced April 24, 2019 ▪ Aims at providing the pandas API on top of Apache Spark ▪ Unifies the two ecosystems with a familiar API ▪ Seamless transition between small and large data ▪ For pandas users ▪ Scale out the pandas code using Koalas ▪ Make learning PySpark much easier ▪ For PySpark users ▪ More productive by pandas-like functions
  5. 5. pandas ▪ Authored by Wes McKinney in 2008 ▪ The standard tool for data manipulation and analysis in Python ▪ The current version: 1.0.4 Stack Overflow Trends
  6. 6. pandas ▪ Deeply integrated into Python data science ecosystem ▪ numpy ▪ matplotlib ▪ scikit-learn ▪ Can deal with a lot of different situations, including: ▪ Basic statistical analysis ▪ Handling missing data ▪ Time series, categorical variables, strings
  7. 7. Apache Spark ▪ De facto unified analytics engine for large-scale data processing ▪ Streaming ▪ ETL ▪ ML ▪ Originally created at UC Berkeley by Databricks’ founders ▪ PySpark API for Python; also API support for Scala, R and SQL ▪ The latest version: 3.0.0
  8. 8. pandas DataFrame PySpark DataFrame Column df['col'] df['col'] Mutability Mutable Immutable Execution Eagerly Lazily Add a column df['c'] = df['a'] + df['b'] df = df.withColumn('c', df['a'] + df['b']) Rename columns df.columns = ['a','b'] df = df.select(df['c1'].alias('a'), df['c2'].alias('b')) df = df.toDF('a', 'b') Value count df['col'].value_counts() df.groupBy(df['col']).count() .orderBy('count', ascending=False) pandas DataFrame vs. PySpark DataFrame
  9. 9. A short example import pandas as pd df = pd.read_csv("/path/to/my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x df = (spark.read .option("inferSchema", "true") .csv("/path/to/my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x * df.x) PySparkpandas
  10. 10. A short example import pandas as pd df = pd.read_csv("/path/to/my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import databricks.koalas as ks df = ks.read_csv("/path/to/my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x Koalaspandas
  11. 11. Koalas Growth ▪ 30,000+ Downloads per day, 800,000+ Downloads last month
  12. 12. Koalas 1.0 ▪ Spark 3.0 support ▪ Optimize using Spark 3.0 functions, such as mapInPandas(). ▪ Python 3.8 support ▪ pandas 1.0 support (since 0.28.0) ▪ Basically Koalas will follow pandas 1.0 behavior. ▪ Remove deprecated functions ▪ Functions removed in pandas 1.0 ▪ @pandas_wraps, DataFrame.map_in_pandas() ▪ Introduce spark property and move Spark-specific functions
  13. 13. Koalas 1.0 ▪ Most common pandas functions have been implemented in Koalas: ▪ Series : 70% ▪ DataFrame : 77% ▪ Index : 65% ▪ MultiIndex : 60% ▪ DataFrameGroupBy : 67% ▪ SeriesGroupBy : 69% ▪ Plotting: 80% ▪ APIs for Spark users: ▪ to_koalas(), to_spark() ▪ DataFrame.spark.to_spark_io(), ks.read_spark_io(), ... ▪ DataFrame.spark.cache(), ks.sql(), ...
  14. 14. Demo
  15. 15. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame
  16. 16. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame Koalas DataFrame API call copy with a new state
  17. 17. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame InternalFrame - column_labels - index_map - spark_columns Koalas DataFrame API call copy with a new state
  18. 18. Index and Default Index ▪ Koalas manages a group of columns as an index. ▪ The index behaves the same as pandas’. ▪ to_koalas() has index_col parameter to specify index columns. ▪ If no index is specified when creating a Koalas DataFrame: it attaches a “default index” automatically. ▪ Each “default index” has Pros and Cons.
  19. 19. Comparison of Default Index Types Configurable by the option “compute.default_index_type” Distributed computation Map-side operation Continuous increment sequence No, in a single worker node No, requires a shuffle Yes distributed- sequence Yes Yes, but requires another Spark job Yes, in most cases distributed Yes Yes No
  20. 20. Roadmap ▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x ▪ Improve the coverage and the behavior compatibility of APIs. ▪ Visualization ▪ Matplotlib ▪ ... ▪ ML libraries ▪ Documentations ▪ More examples ▪ Workarounds for APIs we won’t support
  21. 21. Getting started ▪ pip install koalas ▪ conda install -c conda-forge koalas ▪ Look for docs on https://koalas.readthedocs.io/en/latest/ and updates on github.com/databricks/koalas ▪ 10 min tutorial in a Live Jupyter notebook is available from the docs. ▪ blog post: 10 Minutes from pandas to Koalas on Apache Spark https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html
  22. 22. Do you have suggestions or requests? ▪ Submit requests to github.com/databricks/koalas/issues ▪ Very easy to contribute koalas.readthedocs.io/en/latest/development/contributing.html
  23. 23. Koalas Session ▪ Koalas: pandas on Apache Spark ▪ Friday, June 26th 10:00 AM (PDT)
  24. 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.

Views

Total views

288

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

28

Shares

0

Comments

0

Likes

0

×