Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

0

Share

Koalas: How Well Does Koalas Work?

Download to read offline

Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark.

There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.

  • Be the first to like this

Koalas: How Well Does Koalas Work?

  1. 1. Koalas How Well Does Koalas Work? Takuya Ueshin, Xinrong Meng Software Engineer @ Databricks
  2. 2. About Us Takuya Ueshin ▪ Software Engineer @ Databricks ▪ Apache Spark committer and PMC member ▪ Focusing on Spark SQL and PySpark ▪ Koalas maintainer Xinrong Meng ▪ Software Engineer @ Databricks ▪ Koalas maintainer
  3. 3. Agenda ▪ Introduction of Koalas pandas PySpark ▪ Koalas Internal ▪ Benchmark Introduction of Dask Koalas benchmark against Dask ▪ Koalas Updates
  4. 4. Introduction of Koalas
  5. 5. What’s Koalas? Announced April 24, 2019 Provides a drop-in replacement for pandas - enabling efficient scaling out to hundred of worker nodes For pandas users - Scale out the pandas code using Koalas - Make learning PySpark much easier For PySpark users - More productive by pandas-like functions
  6. 6. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem - NumPy - Matplotlib - scikit-learn Stack Overflow Trends
  7. 7. Apache Spark De facto unified analytics engine for large-scale data processing - Streaming - ETL - ML Originally created at UC Berkeley by Databricks’ founders PySpark for Python; also APIs support for Scala/Java, R, and SQL
  8. 8. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  9. 9. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - Translate pandas APIs into a logical plan of Spark SQL - The plan will be optimized and executed by Spark SQL engine - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  10. 10. Koalas Internal
  11. 11. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  12. 12. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  13. 13. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  14. 14. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  15. 15. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  16. 16. InternalFrame Koalas DataFrame PySpark DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes
  17. 17. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame API call copy with new state
  18. 18. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes API call Only updates metadata copy with new state
  19. 19. Benchmark
  20. 20. Introduction of Dask • A parallel computing framework • Written in pure python • Using blocked algorithms and task scheduling
  21. 21. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim Abstraction Collections
  22. 22. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Collections
  23. 23. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections
  24. 24. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections DataFrame Array, DataFrame, Bag
  25. 25. Benchmark setup - Methodology • Dataset 157 GB Yellow Taxi Trip Records (2009 - 2013) • Operations Basic statistical calculations Joins Grouping • Operations were applied to The whole dataset Filtered data (36% whole dataset) Cached filtered data (36% whole dataset) The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
  26. 26. Benchmark setup - Environment • Local execution A single i3.16xlarge VM: (488 GB memory | 64 cores | 25 Gigabit Ethernet) • Distributed execution 1 driver node, 3 worker nodes Each node is a i3.4xlarge VM: (122 GB memory | 16 cores | 10 Gigabit Ethernet)
  27. 27. Benchmark results - Overview Geometric Mean Simple Average Local execution 2.1x 4x Distributed execution 4.6x 7.9x Koalas outperformed Dask:
  28. 28. Benchmark results - On the whole dataset Local execution: Koalas is ~1.2x faster Distributed execution: Koalas is ~2x faster
  29. 29. Benchmark results - On the filtered data Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x faster
  30. 30. Benchmark results - On the cached filtered data Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x faster
  31. 31. Why is Koalas fast? ● Query plan optimization by Catalyst ● Whole-stage code generation
  32. 32. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst’s optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  33. 33. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst optimization • After the Catalyst optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  34. 34. Why is Koalas fast - Whole-stage code generation ~650% improvement ~1200% improvement
  35. 35. Benchmark conclusions • SQL optimizers improve the performance of DataFrame APIs • Caching accelerates both Koalas and Dask dramatically • Koalas outperforms Dask in the majority of use cases Reference blog post : Benchmark: Koalas (PySpark) and Dask
  36. 36. Koalas updates
  37. 37. Version 1.0.0~1.8.0 ▪ Improve Plotly backend support, and switch the default plotting backend to Plotly ▪ Extension dtypes support ▪ More Index types ▪ Create Index from Series or Index objects ▪ Support setting to a Series via attribute access ▪ Operations between Series and Index ▪ Standardize binary operations between int and str columns ▪ Index operations support ▪ Better type support ▪ Return type annotations for major Koalas objects
  38. 38. Version 1.0.0~1.8.0 ▪ Support for non-string names ▪ Non-named Series support ▪ Wider support of in-place update ▪ Improve distributed-sequence default index ▪ pandas 1.1, 1.1.4 support ▪ Better pandas API coverage ▪ Introduced koalas and Spark accessors ▪ Improve testing infrastructure ▪ Apache Spark 3.0 support ▪ Python 3.8 support ▪ Support for API extensions ▪ Better type hints support
  39. 39. Porting Koalas to Spark SPIP: Support pandas API layer on PySpark https://issues.apache.org/jira/browse/SPARK- 34849
  40. 40. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

Koalas is an open source project that provides pandas APIs on top of Apache Spark. Pandas is the standard tool for data science and it is typically the first step to explore and manipulate a data set, but pandas does not scale well to big data. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark. There are also many libraries trying to scale pandas APIs, such as Vaex, Modin, and so on. Dask is one of them and very popular among pandas users, and also works on its own cluster similar to Koalas which is on top of Spark cluster. In this talk, we will introduce Koalas and its current status, and the comparison between Koalas and Dask, including benchmarking.

Views

Total views

90

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

5

Shares

0

Comments

0

Likes

0

×