SlideShare a Scribd company logo
Koalas
How Well Does Koalas Work?
Takuya Ueshin, Xinrong Meng
Software Engineer @ Databricks
About Us
Takuya Ueshin
▪ Software Engineer @ Databricks
▪ Apache Spark committer and PMC
member
▪ Focusing on Spark SQL and PySpark
▪ Koalas maintainer
Xinrong Meng
▪ Software Engineer @ Databricks
▪ Koalas maintainer
Agenda
▪ Introduction of Koalas
pandas
PySpark
▪ Koalas Internal
▪ Benchmark
Introduction of Dask
Koalas benchmark against Dask
▪ Koalas Updates
Introduction of Koalas
What’s Koalas?
Announced April 24, 2019
Provides a drop-in replacement for pandas
- enabling efficient scaling out to hundred of worker nodes
For pandas users
- Scale out the pandas code using Koalas
- Make learning PySpark much easier
For PySpark users
- More productive by pandas-like functions
pandas
Authored by Wes McKinney in 2008
The standard tool for data manipulation
and analysis in Python
Deeply integrated into Python data
science ecosystem
- NumPy
- Matplotlib
- scikit-learn
Stack Overflow Trends
Apache Spark
De facto unified analytics engine for large-scale data processing
- Streaming
- ETL
- ML
Originally created at UC Berkeley by Databricks’ founders
PySpark for Python;
also APIs support for Scala/Java, R, and SQL
Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
Koalas DataFrame and PySpark DataFrame
- Follow the structure of pandas
- Provide pandas APIs
- Implement index/identifier
- Translate pandas APIs into a logical
plan of Spark SQL
- The plan will be optimized and
executed by Spark SQL engine
- More compliant with the
relations/tables in relational
databases
- Does not have unique row identifiers
PySpark DataFrame
Koalas DataFrame
Koalas Internal
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Internal Immutable metadata.
- The current PySpark DataFrame
- PySpark Columns
- Index names/data column names
- Index dtypes/data dtypes
- Provides conversions between PySpark DataFrame
and pandas DataFrame
InternalFrame
Koalas
DataFrame
PySpark
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
API call
copy with new
state
InternalFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
PySpark
DataFrame
Koalas
DataFrame
InternalFrame
- index/data_spark_columns
- index_names/column_labels
- index/data_dtypes
API call
Only updates metadata
copy with new
state
Benchmark
Introduction of Dask
• A parallel computing framework
• Written in pure python
• Using blocked algorithms and
task scheduling
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
Abstraction
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections
Dask is different from Koalas
Koalas Dask
Execution engine
Apache Spark, a unified analytics engine
for large-scale data processing
Dask, a graph execution engine
Aim
A single codebase that works with both
pandas and Spark
Scale pandas workflow
Abstraction Query plan Task graph and task scheduler
Collections DataFrame Array, DataFrame, Bag
Benchmark setup - Methodology
• Dataset
157 GB Yellow Taxi Trip Records (2009 - 2013)
• Operations
Basic statistical calculations
Joins
Grouping
• Operations were applied to
The whole dataset
Filtered data (36% whole dataset)
Cached filtered data (36% whole dataset)
The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
Benchmark setup - Environment
• Local execution
A single i3.16xlarge VM:
(488 GB memory | 64 cores | 25 Gigabit Ethernet)
• Distributed execution
1 driver node, 3 worker nodes
Each node is a i3.4xlarge VM:
(122 GB memory | 16 cores | 10 Gigabit Ethernet)
Benchmark results - Overview
Geometric Mean Simple Average
Local execution 2.1x 4x
Distributed execution 4.6x 7.9x
Koalas outperformed Dask:
Benchmark results - On the whole dataset
Local execution: Koalas is ~1.2x
faster
Distributed execution: Koalas is ~2x
faster
Benchmark results - On the filtered data
Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x
faster
Benchmark results - On the cached filtered data
Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x
faster
Why is Koalas fast?
● Query plan optimization by Catalyst
● Whole-stage code generation
Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst’s optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
Why is Koalas fast - Catalyst optimizer
Query plan of mean calculation on the filtered data
• Before the Catalyst optimization
• After the Catalyst optimization
# Pseudocode
expr_filter = (df.tip_amt >= 1) &
(df.tip_amt <= 5)
df[expr_filter].fare_amt.mean()
Why is Koalas fast - Whole-stage code generation
~650%
improvement
~1200%
improvement
Benchmark conclusions
• SQL optimizers improve the performance of DataFrame
APIs
• Caching accelerates both Koalas and Dask dramatically
• Koalas outperforms Dask in the majority of use cases
Reference blog post : Benchmark: Koalas (PySpark) and Dask
Koalas updates
Version 1.0.0~1.8.0
▪ Improve Plotly backend support, and
switch the default plotting backend
to Plotly
▪ Extension dtypes support
▪ More Index types
▪ Create Index from Series or Index
objects
▪ Support setting to a Series via
attribute access
▪ Operations between Series and Index
▪ Standardize binary operations
between int and str columns
▪ Index operations support
▪ Better type support
▪ Return type annotations for major
Koalas objects
Version 1.0.0~1.8.0
▪ Support for non-string names
▪ Non-named Series support
▪ Wider support of in-place update
▪ Improve distributed-sequence
default index
▪ pandas 1.1, 1.1.4 support
▪ Better pandas API coverage
▪ Introduced koalas and Spark
accessors
▪ Improve testing infrastructure
▪ Apache Spark 3.0 support
▪ Python 3.8 support
▪ Support for API extensions
▪ Better type hints support
Porting Koalas to Spark
SPIP: Support pandas API layer on PySpark
https://issues.apache.org/jira/browse/SPARK-
34849
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Yoshiyasu SAEKI
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
Databricks
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
Dremio Corporation
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
 
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Cloudera Japan
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
Databricks
 
がっつりMongoDB事例紹介
がっつりMongoDB事例紹介がっつりMongoDB事例紹介
がっつりMongoDB事例紹介
Tetsutaro Watanabe
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Dremio Corporation
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
elephantscale
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
NTT DATA Technology & Innovation
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
Lam Le
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
Databricks
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
Databricks
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Cloudera Japan
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
ScyllaDB
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
Databricks
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
James Serra
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
Kasun Don
 

What's hot (20)

Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
Apache Sparkにおけるメモリ - アプリケーションを落とさないメモリ設計手法 -
 
Diving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction LogDiving into Delta Lake: Unpacking the Transaction Log
Diving into Delta Lake: Unpacking the Transaction Log
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark ClustersFrom HDFS to S3: Migrate Pinterest Apache Spark Clusters
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
 
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
Impala + Kudu を用いたデータウェアハウス構築の勘所 (仮)
 
Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
がっつりMongoDB事例紹介
がっつりMongoDB事例紹介がっつりMongoDB事例紹介
がっつりMongoDB事例紹介
 
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark JobsFine Tuning and Enhancing Performance of Apache Spark Jobs
Fine Tuning and Enhancing Performance of Apache Spark Jobs
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
Changing the game with cloud dw
Changing the game with cloud dwChanging the game with cloud dw
Changing the game with cloud dw
 
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
分析指向データレイク実現の次の一手 ~Delta Lake、なにそれおいしいの?~(NTTデータ テクノロジーカンファレンス 2020 発表資料)
 
Module 2 - Datalake
Module 2 - DatalakeModule 2 - Datalake
Module 2 - Datalake
 
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
 
Building End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCPBuilding End-to-End Delta Pipelines on GCP
Building End-to-End Delta Pipelines on GCP
 
Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理Apache Hadoop YARNとマルチテナントにおけるリソース管理
Apache Hadoop YARNとマルチテナントにおけるリソース管理
 
Apache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the CoversApache Iceberg: An Architectural Look Under the Covers
Apache Iceberg: An Architectural Look Under the Covers
 
Building Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta LakeBuilding Reliable Data Lakes at Scale with Delta Lake
Building Reliable Data Lakes at Scale with Delta Lake
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Kafka Connect - debezium
Kafka Connect - debeziumKafka Connect - debezium
Kafka Connect - debezium
 

Similar to Koalas: How Well Does Koalas Work?

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Xiao Li
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
Miklos Christine
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
MapR Technologies
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Databricks
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
Rahul Jain
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
Databricks
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
Databricks
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
Databricks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
Rahul Jain
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
Amazon Web Services
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
Eva Tse
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
Data Con LA
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Ceph Community
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
Chetan Khatri
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Spark Summit
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
Maloy Manna, PMP®
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Helena Edelson
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
DB Tsai
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
NoSQLmatters
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
Yousun Jeong
 

Similar to Koalas: How Well Does Koalas Work? (20)

Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov... Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
Apache Spark for RDBMS Practitioners: How I Learned to Stop Worrying and Lov...
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Koalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache SparkKoalas: Interoperability Between Koalas and Apache Spark
Koalas: Interoperability Between Koalas and Apache Spark
 
Koalas: Pandas on Apache Spark
Koalas: Pandas on Apache SparkKoalas: Pandas on Apache Spark
Koalas: Pandas on Apache Spark
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform(BDT303) Running Spark and Presto on the Netflix Big Data Platform
(BDT303) Running Spark and Presto on the Netflix Big Data Platform
 
Running Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data PlatformRunning Presto and Spark on the Netflix Big Data Platform
Running Presto and Spark on the Netflix Big Data Platform
 
Explore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and SnappydataExplore big data at speed of thought with Spark 2.0 and Snappydata
Explore big data at speed of thought with Spark 2.0 and Snappydata
 
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu YongUnlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
Unlock Bigdata Analytic Efficiency with Ceph Data Lake - Zhang Jian, Fu Yong
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
Data processing with spark in r &amp; python
Data processing with spark in r &amp; pythonData processing with spark in r &amp; python
Data processing with spark in r &amp; python
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Big Telco Real-Time Network Analytics
Big Telco Real-Time Network AnalyticsBig Telco Real-Time Network Analytics
Big Telco Real-Time Network Analytics
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
javier ramirez
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
2023240532
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Subhajit Sahu
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
u86oixdj
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 

Recently uploaded (20)

The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
The Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series DatabaseThe Building Blocks of QuestDB, a Time Series Database
The Building Blocks of QuestDB, a Time Series Database
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
Quantitative Data AnalysisReliability Analysis (Cronbach Alpha) Common Method...
 
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTESAdjusting OpenMP PageRank : SHORT REPORT / NOTES
Adjusting OpenMP PageRank : SHORT REPORT / NOTES
 
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
原版制作(swinburne毕业证书)斯威本科技大学毕业证毕业完成信一模一样
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 

Koalas: How Well Does Koalas Work?

  • 1. Koalas How Well Does Koalas Work? Takuya Ueshin, Xinrong Meng Software Engineer @ Databricks
  • 2. About Us Takuya Ueshin ▪ Software Engineer @ Databricks ▪ Apache Spark committer and PMC member ▪ Focusing on Spark SQL and PySpark ▪ Koalas maintainer Xinrong Meng ▪ Software Engineer @ Databricks ▪ Koalas maintainer
  • 3. Agenda ▪ Introduction of Koalas pandas PySpark ▪ Koalas Internal ▪ Benchmark Introduction of Dask Koalas benchmark against Dask ▪ Koalas Updates
  • 5. What’s Koalas? Announced April 24, 2019 Provides a drop-in replacement for pandas - enabling efficient scaling out to hundred of worker nodes For pandas users - Scale out the pandas code using Koalas - Make learning PySpark much easier For PySpark users - More productive by pandas-like functions
  • 6. pandas Authored by Wes McKinney in 2008 The standard tool for data manipulation and analysis in Python Deeply integrated into Python data science ecosystem - NumPy - Matplotlib - scikit-learn Stack Overflow Trends
  • 7. Apache Spark De facto unified analytics engine for large-scale data processing - Streaming - ETL - ML Originally created at UC Berkeley by Databricks’ founders PySpark for Python; also APIs support for Scala/Java, R, and SQL
  • 8. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  • 9. Koalas DataFrame and PySpark DataFrame - Follow the structure of pandas - Provide pandas APIs - Implement index/identifier - Translate pandas APIs into a logical plan of Spark SQL - The plan will be optimized and executed by Spark SQL engine - More compliant with the relations/tables in relational databases - Does not have unique row identifiers PySpark DataFrame Koalas DataFrame
  • 11. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 12. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 13. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 14. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 15. InternalFrame Internal Immutable metadata. - The current PySpark DataFrame - PySpark Columns - Index names/data column names - Index dtypes/data dtypes - Provides conversions between PySpark DataFrame and pandas DataFrame
  • 17. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame API call copy with new state
  • 18. InternalFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes PySpark DataFrame Koalas DataFrame InternalFrame - index/data_spark_columns - index_names/column_labels - index/data_dtypes API call Only updates metadata copy with new state
  • 20. Introduction of Dask • A parallel computing framework • Written in pure python • Using blocked algorithms and task scheduling
  • 21. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim Abstraction Collections
  • 22. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Collections
  • 23. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections
  • 24. Dask is different from Koalas Koalas Dask Execution engine Apache Spark, a unified analytics engine for large-scale data processing Dask, a graph execution engine Aim A single codebase that works with both pandas and Spark Scale pandas workflow Abstraction Query plan Task graph and task scheduler Collections DataFrame Array, DataFrame, Bag
  • 25. Benchmark setup - Methodology • Dataset 157 GB Yellow Taxi Trip Records (2009 - 2013) • Operations Basic statistical calculations Joins Grouping • Operations were applied to The whole dataset Filtered data (36% whole dataset) Cached filtered data (36% whole dataset) The scenario used in this benchmark was inspired by https://github.com/xdssio/big_data_benchmarks.
  • 26. Benchmark setup - Environment • Local execution A single i3.16xlarge VM: (488 GB memory | 64 cores | 25 Gigabit Ethernet) • Distributed execution 1 driver node, 3 worker nodes Each node is a i3.4xlarge VM: (122 GB memory | 16 cores | 10 Gigabit Ethernet)
  • 27. Benchmark results - Overview Geometric Mean Simple Average Local execution 2.1x 4x Distributed execution 4.6x 7.9x Koalas outperformed Dask:
  • 28. Benchmark results - On the whole dataset Local execution: Koalas is ~1.2x faster Distributed execution: Koalas is ~2x faster
  • 29. Benchmark results - On the filtered data Local execution: Koalas is ~6x faster Distributed execution: Koalas is ~9x faster
  • 30. Benchmark results - On the cached filtered data Local execution: Koalas is ~1.4x faster Distributed execution: Koalas is ~5x faster
  • 31. Why is Koalas fast? ● Query plan optimization by Catalyst ● Whole-stage code generation
  • 32. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst’s optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  • 33. Why is Koalas fast - Catalyst optimizer Query plan of mean calculation on the filtered data • Before the Catalyst optimization • After the Catalyst optimization # Pseudocode expr_filter = (df.tip_amt >= 1) & (df.tip_amt <= 5) df[expr_filter].fare_amt.mean()
  • 34. Why is Koalas fast - Whole-stage code generation ~650% improvement ~1200% improvement
  • 35. Benchmark conclusions • SQL optimizers improve the performance of DataFrame APIs • Caching accelerates both Koalas and Dask dramatically • Koalas outperforms Dask in the majority of use cases Reference blog post : Benchmark: Koalas (PySpark) and Dask
  • 37. Version 1.0.0~1.8.0 ▪ Improve Plotly backend support, and switch the default plotting backend to Plotly ▪ Extension dtypes support ▪ More Index types ▪ Create Index from Series or Index objects ▪ Support setting to a Series via attribute access ▪ Operations between Series and Index ▪ Standardize binary operations between int and str columns ▪ Index operations support ▪ Better type support ▪ Return type annotations for major Koalas objects
  • 38. Version 1.0.0~1.8.0 ▪ Support for non-string names ▪ Non-named Series support ▪ Wider support of in-place update ▪ Improve distributed-sequence default index ▪ pandas 1.1, 1.1.4 support ▪ Better pandas API coverage ▪ Introduced koalas and Spark accessors ▪ Improve testing infrastructure ▪ Apache Spark 3.0 support ▪ Python 3.8 support ▪ Support for API extensions ▪ Better type hints support
  • 39. Porting Koalas to Spark SPIP: Support pandas API layer on PySpark https://issues.apache.org/jira/browse/SPARK- 34849
  • 40. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.