Successfully reported this slideshow.
Your SlideShare is downloading. ×

Koalas: Making an Easy Transition from Pandas to Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 26 Ad

Koalas: Making an Easy Transition from Pandas to Apache Spark

Download to read offline

Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.

Koalas is an open-source project that aims at bridging the gap between big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with pandas library in Python. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Koalas: Making an Easy Transition from Pandas to Apache Spark (20)

Advertisement

More from Databricks (20)

Recently uploaded (20)

Advertisement

Koalas: Making an Easy Transition from Pandas to Apache Spark

  1. 1. Koalas: Making an Easy Transition from pandas to Apache Spark Takuya Ueshin Software Engineer @ Databricks
  2. 2. About Takuya Ueshin Software Engineer at Databricks ▪ Apache Spark committer and PMC member ▪ Focusing on Spark SQL and PySpark ▪ Koalas maintainer
  3. 3. Outline ▪ What’s Koalas? ▪ pandas vs Apache Spark at a high level ▪ Koalas 1.0 ▪ Demo ▪ InternalFrame ▪ Index and Default Index ▪ Roadmap
  4. 4. What’s Koalas? ▪ Announced April 24, 2019 ▪ Aims at providing the pandas API on top of Apache Spark ▪ Unifies the two ecosystems with a familiar API ▪ Seamless transition between small and large data ▪ For pandas users ▪ Scale out the pandas code using Koalas ▪ Make learning PySpark much easier ▪ For PySpark users ▪ More productive by pandas-like functions
  5. 5. pandas ▪ Authored by Wes McKinney in 2008 ▪ The standard tool for data manipulation and analysis in Python ▪ The current version: 1.0.4 Stack Overflow Trends
  6. 6. pandas ▪ Deeply integrated into Python data science ecosystem ▪ numpy ▪ matplotlib ▪ scikit-learn ▪ Can deal with a lot of different situations, including: ▪ Basic statistical analysis ▪ Handling missing data ▪ Time series, categorical variables, strings
  7. 7. Apache Spark ▪ De facto unified analytics engine for large-scale data processing ▪ Streaming ▪ ETL ▪ ML ▪ Originally created at UC Berkeley by Databricks’ founders ▪ PySpark API for Python; also API support for Scala, R and SQL ▪ The latest version: 3.0.0
  8. 8. pandas DataFrame PySpark DataFrame Column df['col'] df['col'] Mutability Mutable Immutable Execution Eagerly Lazily Add a column df['c'] = df['a'] + df['b'] df = df.withColumn('c', df['a'] + df['b']) Rename columns df.columns = ['a','b'] df = df.select(df['c1'].alias('a'), df['c2'].alias('b')) df = df.toDF('a', 'b') Value count df['col'].value_counts() df.groupBy(df['col']).count() .orderBy('count', ascending=False) pandas DataFrame vs. PySpark DataFrame
  9. 9. A short example import pandas as pd df = pd.read_csv("/path/to/my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x df = (spark.read .option("inferSchema", "true") .csv("/path/to/my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x * df.x) PySparkpandas
  10. 10. A short example import pandas as pd df = pd.read_csv("/path/to/my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import databricks.koalas as ks df = ks.read_csv("/path/to/my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x Koalaspandas
  11. 11. Koalas Growth ▪ 30,000+ Downloads per day, 800,000+ Downloads last month
  12. 12. Koalas 1.0 ▪ Spark 3.0 support ▪ Optimize using Spark 3.0 functions, such as mapInPandas(). ▪ Python 3.8 support ▪ pandas 1.0 support (since 0.28.0) ▪ Basically Koalas will follow pandas 1.0 behavior. ▪ Remove deprecated functions ▪ Functions removed in pandas 1.0 ▪ @pandas_wraps, DataFrame.map_in_pandas() ▪ Introduce spark property and move Spark-specific functions
  13. 13. Koalas 1.0 ▪ Most common pandas functions have been implemented in Koalas: ▪ Series : 70% ▪ DataFrame : 77% ▪ Index : 65% ▪ MultiIndex : 60% ▪ DataFrameGroupBy : 67% ▪ SeriesGroupBy : 69% ▪ Plotting: 80% ▪ APIs for Spark users: ▪ to_koalas(), to_spark() ▪ DataFrame.spark.to_spark_io(), ks.read_spark_io(), ... ▪ DataFrame.spark.cache(), ks.sql(), ...
  14. 14. Demo
  15. 15. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame
  16. 16. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame Koalas DataFrame API call copy with a new state
  17. 17. InternalFrame Koalas DataFrame InternalFrame - column_labels - index_map - spark_columns Spark DataFrame InternalFrame - column_labels - index_map - spark_columns Koalas DataFrame API call copy with a new state
  18. 18. Index and Default Index ▪ Koalas manages a group of columns as an index. ▪ The index behaves the same as pandas’. ▪ to_koalas() has index_col parameter to specify index columns. ▪ If no index is specified when creating a Koalas DataFrame: it attaches a “default index” automatically. ▪ Each “default index” has Pros and Cons.
  19. 19. Comparison of Default Index Types Configurable by the option “compute.default_index_type” Distributed computation Map-side operation Continuous increment sequence No, in a single worker node No, requires a shuffle Yes distributed- sequence Yes Yes, but requires another Spark job Yes, in most cases distributed Yes Yes No
  20. 20. Roadmap ▪ July/Aug 2020: Release DBR/MLR 7.1 will pre-install Koalas 1.x ▪ Improve the coverage and the behavior compatibility of APIs. ▪ Visualization ▪ Matplotlib ▪ ... ▪ ML libraries ▪ Documentations ▪ More examples ▪ Workarounds for APIs we won’t support
  21. 21. Getting started ▪ pip install koalas ▪ conda install -c conda-forge koalas ▪ Look for docs on https://koalas.readthedocs.io/en/latest/ and updates on github.com/databricks/koalas ▪ 10 min tutorial in a Live Jupyter notebook is available from the docs. ▪ blog post: 10 Minutes from pandas to Koalas on Apache Spark https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html
  22. 22. Do you have suggestions or requests? ▪ Submit requests to github.com/databricks/koalas/issues ▪ Very easy to contribute koalas.readthedocs.io/en/latest/development/contributing.html
  23. 23. Koalas Session ▪ Koalas: pandas on Apache Spark ▪ Friday, June 26th 10:00 AM (PDT)
  24. 24. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×