Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Koalas: Making an Easy Transition from Pandas to Apache Spark

In this talk, we present Koalas, a new open-source project that aims at bridging the gap between the big data and small data for data scientists and at simplifying Apache Spark for people who are already familiar with the pandas library in Python.

Pandas is the standard tool for data science in python, and it is typically the first step to explore and manipulate a data set by data scientists. The problem is that pandas does not scale well to big data. It was designed for small data sets that a single machine could handle.

When data scientists work today with very large data sets, they either have to migrate to PySpark to leverage Spark or downsample their data so that they can use pandas. This presentation will give a deep dive into the conversion between Spark and pandas dataframes.

Through live demonstrations and code samples, you will understand: – how to effectively leverage both pandas and Spark inside the same code base – how to leverage powerful pandas concepts such as lightweight indexing with Spark – technical considerations for unifying the different behaviors of Spark and pandas

  • Login to see the comments

Koalas: Making an Easy Transition from Pandas to Apache Spark

  1. 1. Koalas: Unifying Spark and pandas APIs 1 Spark + AI Summit Europe 2019 Tim Hunter Takuya Ueshin
  2. 2. About Takuya Ueshin, Software Engineer at Databricks • Apache Spark committer and PMC member • Focusing on Spark SQL and PySpark • Koalas committer Tim Hunter, Software Engineer at Databricks • Co-creator of the Koalas project • Contributes to Apache Spark MLlib, GraphFrames, TensorFrames and Deep Learning Pipelines libraries • Ph.D Machine Learning from Berkeley, M.S. Electrical Engineering from Stanford
  3. 3. Outline ● pandas vs Spark at a high level ● why Koalas (combine everything in one package) ○ key differences ● current status & new features ● demo ● technical topics ○ InternalFrame ○ Operations on different DataFrames ○ Default Index ● roadmap
  4. 4. Typical journey of a data scientist Education (MOOCs, books, universities) → pandas Analyze small data sets → pandas Analyze big data sets → DataFrame in Spark 4
  5. 5. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem, e.g. numpy, matplotlib Can deal with a lot of different situations, including: - basic statistical analysis - handling missing data - time series, categorical variables, strings 5
  6. 6. Apache Spark De facto unified analytics engine for large-scale data processing (Streaming, ETL, ML) Originally created at UC Berkeley by Databricks’ founders PySpark API for Python; also API support for Scala, R and SQL 6
  7. 7. 7 pandas DataFrame Spark DataFrame Column df['col'] df['col'] Mutability Mutable Immutable Add a column df['c'] = df['a'] + df['b'] df.withColumn('c', df['a'] + df['b']) Rename columns df.columns = ['a','b'] df.select(df['c1'].alias('a'), df['c2'].alias('b')) Value count df['col'].value_counts() df.groupBy(df['col']).count().order By('count', ascending = False) pandas DataFrame vs Spark DataFrame
  8. 8. A short example 8 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x df = (spark.read .option("inferSchema", "true") .option("comment", True) .csv("my_data.csv")) df = df.toDF('x', 'y', 'z1') df = df.withColumn('x2', df.x*df.x)
  9. 9. Koalas Announced April 24, 2019 Pure Python library Aims at providing the pandas API on top of Apache Spark: - unifies the two ecosystems with a familiar API - seamless transition between small and large data 9
  10. 10. Quickly gaining traction 10 Bi-weekly releases! > 500 patches merged since announcement > 20 significant contributors outside of Databricks > 8k daily downloads
  11. 11. A short example 11 import pandas as pd df = pd.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x import databricks.koalas as ks df = ks.read_csv("my_data.csv") df.columns = ['x', 'y', 'z1'] df['x2'] = df.x * df.x
  12. 12. Koalas - Provide discoverable APIs for common data science tasks (i.e., follows pandas) - Unify pandas API and Spark API, but pandas first - pandas APIs that are appropriate for distributed dataset - Easy conversion from/to pandas DataFrame or numpy array. 12
  13. 13. Key Differences Spark is more lazy by nature: - most operations only happen when displaying or writing a DataFrame Spark does not maintain row order Performance when working at scale 13
  14. 14. Current status Bi-weekly releases, very active community with daily changes The most common functions have been implemented: - 60% of the DataFrame / Series API - 60% of the DataFrameGroupBy / SeriesGroupBy API - 15% of the Index / MultiIndex API - to_datetime, get_dummies, … 14
  15. 15. New features - 80% of the plot functions (0.16.0-) - Spark related functions (0.8.0-) - IO: to_parquet/read_parquet, to_csv/read_csv, to_json/read_json, to_spark_io/read_spark_io, to_delta/read_delta, ... - SQL - cache - Support for multi-index columns (90%) (0.16.0-) - Options to configure Koalas’ behavior (0.17.0-) 15
  16. 16. Demo 16
  17. 17. Challenge: increasing scale and complexity of data operations Struggling with the “Spark switch” from pandas More than 10X faster with less than 1% code changes How Virgin Hyperloop One reduced processing time from hours to minutes with Koalas
  18. 18. Internal Frame Internal immutable metadata. - holds the current Spark DataFrame - manages mapping from Koalas column names to Spark column names - manages mapping from Koalas index names to Spark column names - converts between Spark DataFrame and pandas DataFrame 18
  19. 19. InternalFrame 19 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame
  20. 20. InternalFrame 20 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame InternalFrame - column_index - index_map Spark DataFrame Koalas DataFrame API call copy with new state
  21. 21. InternalFrame 21 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame InternalFrame - column_index - index_map Koalas DataFrame API call copy with new state Only updates metadata kdf.set_index(...)
  22. 22. InternalFrame 22 Koalas DataFrame InternalFrame - column_index - index_map Spark DataFrame InternalFrame - column_index - index_map Spark DataFrame API call copy with new state e.g., inplace=True kdf.dropna(..., inplace=True)
  23. 23. Operations on different DataFrames We only allow Series derived from the same DataFrame by default. 23 OK - df.a + df.b - df['c'] = df.a * df.b Not OK - df1.a + df2.b - df1['c'] = df2.a * df2.b
  24. 24. Operations on different DataFrames We only allow Series derived from the same DataFrame by default. Equivalent Spark code? 24 OK - df.a + df.b - df['c'] = df.a * df.b sdf.select( sdf['a'] + sdf['b']) Not OK - df1.a + df2.b - df1['c'] = df2.a * df2.b ??? sdf1.select( sdf1['a'] + sdf2['b'])
  25. 25. Operations on different DataFrames We only allow Series derived from the same DataFrame by default. Equivalent Spark code? 25 OK - df.a + df.b - df['c'] = df.a * df.b sdf.select( sdf['a'] + sdf['b']) Not OK - df1.a + df2.b - df1['c'] = df2.a * df2.b ??? sdf1.select( sdf1['a'] + sdf2['b'])AnalysisException!!
  26. 26. Operations on different DataFrames ks.set_option('compute.ops_on_diff_frames', True) Equivalent Spark code? 26 OK - df.a + df.b - df['c'] = df.a * df.b sdf.select( sdf['a'] + sdf['b']) OK - df1.a + df2.b - df1['c'] = df2.a * df2.b sdf1.join(sdf2, on="_index_") .select('a * b')
  27. 27. Default Index Koalas manages a group of columns as index. The index behaves the same as pandas’. If no index is specified when creating a Koalas DataFrame: it attaches a “default index” automatically. Each “default index” has Pros and Cons. 27
  28. 28. Default Indexes Configurable by the option “compute.default_index_type” See also: https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type 28 requires to collect data into single node requires shuffle continuous increments sequence YES YES YES distributed-sequence NO YES / NO YES / NO distributed NO NO NO
  29. 29. What to expect? • Improve pandas API coverage - rolling/expanding • Support categorical data types • More time-series related functions • Improve performance - Minimize the overhead at Koalas layer - Optimal implementation of APIs 29
  30. 30. Getting started pip install koalas conda install koalas Look for docs on https://koalas.readthedocs.io/en/latest/ and updates on github.com/databricks/koalas 10 min tutorial in a Live Jupyter notebook is available from the docs. 30
  31. 31. Do you have suggestions or requests? Submit requests to github.com/databricks/koalas/issues Very easy to contribute koalas.readthedocs.io/en/latest/development/contributing.html 31
  32. 32. Koalas Sessions Koalas: Pandas on Apache Spark (Tutorial) - 14:30 - @ROOM: G104 AMA: Koalas - 16:00 - @EXPO HALL 32
  33. 33. Thank you! Q&A? Get Started at databricks.com/try 33

×