Successfully reported this slideshow.
Your SlideShare is downloading. ×

Koalas: Interoperability Between Koalas and Apache Spark

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Loading in …3
×

Check these out next

1 of 25 Ad

Koalas: Interoperability Between Koalas and Apache Spark

Download to read offline

Koalas is an open source project that provides pandas APIs on top of Apache Spark. pandas is a Python package commonly used among data scientists, but it does not scale out in a distributed manner. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark. Koalas is useful for not only pandas users but also PySpark users. For example, PySpark users can visualize their data directly from their PySpark DataFrame via the Koalas plotting APIs such as plotting. In addition, Koalas users can leverage PySpark specific APIs such as higher-order functions and a rich set of SQL APIs. In this talk, we will focus on the PySpark aspect and the interaction between PySpark and Koalas in order for PySpark users to leverage their knowledge of Apache Spark in Koalas.

Koalas is an open source project that provides pandas APIs on top of Apache Spark. pandas is a Python package commonly used among data scientists, but it does not scale out in a distributed manner. Koalas fills the gap by providing pandas equivalent APIs that work on Apache Spark. Koalas is useful for not only pandas users but also PySpark users. For example, PySpark users can visualize their data directly from their PySpark DataFrame via the Koalas plotting APIs such as plotting. In addition, Koalas users can leverage PySpark specific APIs such as higher-order functions and a rich set of SQL APIs. In this talk, we will focus on the PySpark aspect and the interaction between PySpark and Koalas in order for PySpark users to leverage their knowledge of Apache Spark in Koalas.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to Koalas: Interoperability Between Koalas and Apache Spark (20)

Advertisement

More from Databricks (20)

Advertisement

Koalas: Interoperability Between Koalas and Apache Spark

  1. 1. Koalas: Interoperability between Koalas and Apache Spark Takuya Ueshin, Haejoon Lee Software Engineer @ Databricks
  2. 2. About Us Takuya Ueshin Software Engineer @ Databricks - Apache Spark committer and PMC member - Focusing on Spark SQL and PySpark - Koalas contributor Haejoon Lee Software Engineer @ Databricks - Koalas contributor
  3. 3. Agenda Introduction of Koalas - pandas - PySpark Conversion from and to PySpark - Index and Default Index Spark I/O - pandas - Koalas specific Spark accessor Demo
  4. 4. Agenda Introduction of Koalas - pandas - PySpark Conversion from and to PySpark - Index and Default Index Spark I/O - pandas - Koalas specific Spark accessor Demo
  5. 5. What’s Koalas? Announced April 24, 2019 Provides a drop-in replacement for pandas - enabling efficient scaling out to hundreds of worker nodes For pandas users - Scale out the pandas code using Koalas - Make learning PySpark much easier For PySpark users - More productive by pandas-like functions
  6. 6. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem - NumPy - Matplotlib - scikit-learn Stack Overflow Trends
  7. 7. Apache Spark De facto unified analytics engine for large-scale data processing - Streaming - ETL - ML Originally created at UC Berkeley by Databricks’ founders PySpark API for Python; also API support for Scala/Java, R and SQL
  8. 8. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - Translate pandas APIs into a logical plan of Spark SQL. - The plan will be optimized and executed by Spark SQL engine. - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrameKoalas DataFrame
  9. 9. Agenda Introduction of Koalas - pandas - PySpark Conversion from and to PySpark - Index and Default Index Spark I/O - pandas - Koalas specific Spark accessor Demo
  10. 10. Conversion from PySpark DataFrame spark_df.to_koalas() - Attached to Spark DataFrame when importing Koalas - index_col parameter - Indicate which columns should be used as index - If not specified, “default index” will be attached
  11. 11. Conversion from PySpark DataFrame
  12. 12. Conversion to PySpark DataFrame koalas_df.to_spark() - Also koalas_df.spark.frame() - index_col parameter - Indicate the column names of index - If not specified, the index columns will be lost
  13. 13. Conversion to PySpark DataFrame
  14. 14. Index and Default Index - Koalas manages a group of columns as an index. - The index behaves the same as pandas’. - to_koalas() has index_col parameter to specify index columns. - If no index is specified when creating a Koalas DataFrame: it attaches a “default index” automatically. - Koalas has 3 types of “default index”. - Each “default index” has Pros and Cons.
  15. 15. Comparison of Default Index Types Configurable by the option “compute.default_index_type” Distributed computation Map-side operation Continuous increment Performance sequence No, in a single worker node No, requires a shuffle Yes Bad for large dataset distributed -sequence Yes Yes, but requires another Spark job Yes Good enough distributed Yes Yes No Good
  16. 16. Agenda Introduction of Koalas - pandas - PySpark Conversion from and to PySpark - Index and Default Index Spark I/O - pandas - Koalas specific Spark accessor Demo
  17. 17. Using Spark I/O Functions to read/write data use Spark I/O under the hood. - ks.read_csv / DataFrame.to_csv - ks.read_json / DataFrame.to_json - ks.read_parquet / DataFrame.to_parquet - ks.read_sql_table - ks.read_sql_query index_col parameter is available to specify the index columns. Take keyword arguments for additional Spark I/O options.
  18. 18. Using Spark I/O Koalas specific I/O functions. - ks.read_table / DataFrame.to_table - ks.read_spark_io / DataFrame.to_spark_io - ks.read_delta / DataFrame.to_delta index_col parameter is available. Take keyword arguments for additional Spark I/O options.
  19. 19. Agenda Introduction of Koalas - pandas - PySpark Conversion from and to PySpark - Index and Default Index Spark I/O - pandas - Koalas specific Spark accessor Demo
  20. 20. Spark accessor Provides functions to leverage the existing PySpark APIs more easily. - transform/apply for using Spark APIs directly. - Series.spark.transform - Series.spark.apply - DataFrame.spark.apply
  21. 21. Spark accessor Provides functions to leverage the existing PySpark APIs more easily. - Check the underlying Spark data type or schema - Series.spark.data_type - DataFrame.spark.schema / print_schema - Check the execution plan - DataFrame.spark.explain - Cache the DataFrame - DataFrame.spark.cache - Hints - DataFrame.spark.hint
  22. 22. Demo The notebook is available here
  23. 23. Getting started - Pre-installed in Databricks (7.1 and higher) - pip install koalas - conda install -c conda-forge koalas - GitHub: github.com/databricks/koalas - Docs: https://koalas.readthedocs.io/en/latest/ - 10 min tutorial in a Live Jupyter notebook is available from the docs. - blog posts - 10 Minutes from pandas to Koalas on Apache Spark https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - Interoperability between Koalas and Apache Spark https://databricks.com/blog/2020/08/11/interoperability-between-koalas-and-apache-spark.html
  24. 24. Do you have suggestions or requests? Submit requests to github.com/databricks/koalas/issues Very easy to contribute koalas.readthedocs.io/en/latest/development/contributing.html
  25. 25. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.

×