SlideShare a Scribd company logo
Submit Search
Upload
High Performance Python on Apache Spark
Report
Share
Wes McKinney
Director of Ursa Labs, Open Source Developer at Ursa Labs
Follow
•
41 likes
•
16,556 views
1
of
35
High Performance Python on Apache Spark
•
41 likes
•
16,556 views
Report
Share
Download Now
Download to read offline
Technology
Slides from my talk at Spark Summit West 2016 on June 7, 2016
Read more
Wes McKinney
Director of Ursa Labs, Open Source Developer at Ursa Labs
Follow
Recommended
Improving Python and Spark (PySpark) Performance and Interoperability by
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
19.8K views
•
37 slides
Integrating Existing C++ Libraries into PySpark with Esther Kundin by
Integrating Existing C++ Libraries into PySpark with Esther Kundin
Databricks
4.5K views
•
44 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
1.1K views
•
26 slides
Using LLVM to accelerate processing of data in Apache Arrow by
Using LLVM to accelerate processing of data in Apache Arrow
DataWorks Summit
2.1K views
•
31 slides
Apache Arrow Flight Overview by
Apache Arrow Flight Overview
Jacques Nadeau
6K views
•
8 slides
Scaling your Data Pipelines with Apache Spark on Kubernetes by
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
2.1K views
•
37 slides
More Related Content
What's hot
Optimizing Delta/Parquet Data Lakes for Apache Spark by
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
2.5K views
•
51 slides
Migrating to Apache Spark at Netflix by
Migrating to Apache Spark at Netflix
Databricks
2.2K views
•
34 slides
Fine Tuning and Enhancing Performance of Apache Spark Jobs by
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
2.5K views
•
31 slides
Session découverte de la Data Virtualization by
Session découverte de la Data Virtualization
Denodo
209 views
•
61 slides
Databricks: A Tool That Empowers You To Do More With Data by
Databricks: A Tool That Empowers You To Do More With Data
Databricks
457 views
•
9 slides
The Apache Spark File Format Ecosystem by
The Apache Spark File Format Ecosystem
Databricks
2.1K views
•
44 slides
What's hot
(20)
Optimizing Delta/Parquet Data Lakes for Apache Spark by Databricks
Optimizing Delta/Parquet Data Lakes for Apache Spark
Databricks
•
2.5K views
Migrating to Apache Spark at Netflix by Databricks
Migrating to Apache Spark at Netflix
Databricks
•
2.2K views
Fine Tuning and Enhancing Performance of Apache Spark Jobs by Databricks
Fine Tuning and Enhancing Performance of Apache Spark Jobs
Databricks
•
2.5K views
Session découverte de la Data Virtualization by Denodo
Session découverte de la Data Virtualization
Denodo
•
209 views
Databricks: A Tool That Empowers You To Do More With Data by Databricks
Databricks: A Tool That Empowers You To Do More With Data
Databricks
•
457 views
The Apache Spark File Format Ecosystem by Databricks
The Apache Spark File Format Ecosystem
Databricks
•
2.1K views
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp... by FIWARE
FIWARE Wednesday Webinars - Performing Big Data Analysis Using Cosmos With Sp...
FIWARE
•
611 views
Aggregated queries with Druid on terrabytes and petabytes of data by Rostislav Pashuto
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
•
9.7K views
From HDFS to S3: Migrate Pinterest Apache Spark Clusters by Databricks
From HDFS to S3: Migrate Pinterest Apache Spark Clusters
Databricks
•
490 views
Architect’s Open-Source Guide for a Data Mesh Architecture by Databricks
Architect’s Open-Source Guide for a Data Mesh Architecture
Databricks
•
3.1K views
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf by Anya Bida
Run Apache Spark on Kubernetes in Large Scale_ Challenges and Solutions-2.pdf
Anya Bida
•
210 views
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri... by Spark Summit
From DataFrames to Tungsten: A Peek into Spark's Future-(Reynold Xin, Databri...
Spark Summit
•
5K views
Integrating Apache NiFi and Apache Flink by Hortonworks
Integrating Apache NiFi and Apache Flink
Hortonworks
•
13.5K views
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da... by Andrew Lamb
2022-06-23 Apache Arrow and DataFusion_ Changing the Game for implementing Da...
Andrew Lamb
•
179 views
VictoriaLogs: Open Source Log Management System - Preview by VictoriaMetrics
VictoriaLogs: Open Source Log Management System - Preview
VictoriaMetrics
•
2.2K views
Free Training: How to Build a Lakehouse by Databricks
Free Training: How to Build a Lakehouse
Databricks
•
3.3K views
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler by Databricks
Cloud-Native Apache Spark Scheduling with YuniKorn Scheduler
Databricks
•
693 views
Iceberg + Alluxio for Fast Data Analytics by Alluxio, Inc.
Iceberg + Alluxio for Fast Data Analytics
Alluxio, Inc.
•
528 views
A Deep Dive into Query Execution Engine of Spark SQL by Databricks
A Deep Dive into Query Execution Engine of Spark SQL
Databricks
•
6.6K views
Incremental View Maintenance with Coral, DBT, and Iceberg by Walaa Eldin Moustafa
Incremental View Maintenance with Coral, DBT, and Iceberg
Walaa Eldin Moustafa
•
597 views
Similar to High Performance Python on Apache Spark
Enabling Python to be a Better Big Data Citizen by
Enabling Python to be a Better Big Data Citizen
Wes McKinney
6K views
•
19 slides
Next-generation Python Big Data Tools, powered by Apache Arrow by
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
13K views
•
22 slides
Python Data Ecosystem: Thoughts on Building for the Future by
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
5.4K views
•
37 slides
Apache Arrow and Python: The latest by
Apache Arrow and Python: The latest
Wes McKinney
5.8K views
•
19 slides
An Incomplete Data Tools Landscape for Hackers in 2015 by
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
PyData: The Next Generation by
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
Similar to High Performance Python on Apache Spark
(20)
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Next-generation Python Big Data Tools, powered by Apache Arrow by Wes McKinney
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
•
13K views
Python Data Ecosystem: Thoughts on Building for the Future by Wes McKinney
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
•
5.4K views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
An Incomplete Data Tools Landscape for Hackers in 2015 by Wes McKinney
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
•
8.1K views
PyData: The Next Generation by Wes McKinney
PyData: The Next Generation
Wes McKinney
•
22.2K views
Improving data interoperability in Python and R by Wes McKinney
Improving data interoperability in Python and R
Wes McKinney
•
2.6K views
Improving Data Interoperability for Python and R by Work-Bench
Improving Data Interoperability for Python and R
Work-Bench
•
10.3K views
PyData: The Next Generation | Data Day Texas 2015 by Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016... by Hadoop / Spark Conference Japan
Apache Kudu Fast Analytics on Fast Data (Hadoop / Spark Conference Japan 2016...
Hadoop / Spark Conference Japan
•
2.5K views
PyData Boston 2013 by Travis Oliphant
PyData Boston 2013
Travis Oliphant
•
3.9K views
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl... by Codemotion
Luciano Resende - Scaling Big Data Interactive Workloads across Kubernetes Cl...
Codemotion
•
103 views
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data by Hakka Labs
DatEngConf SF16 - Apache Kudu: Fast Analytics on Fast Data
Hakka Labs
•
978 views
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019 by Timothy Spann
IoT Edge Data Processing with NVidia Jetson Nano oct 3 2019
Timothy Spann
•
1.5K views
TensorFlow meetup: Keras - Pytorch - TensorFlow.js by Stijn Decubber
TensorFlow meetup: Keras - Pytorch - TensorFlow.js
Stijn Decubber
•
570 views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney by Hakka Labs
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Building a Hadoop Data Warehouse with Impala by huguk
Building a Hadoop Data Warehouse with Impala
huguk
•
2K views
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage by Alluxio, Inc.
Accelerate and Scale Big Data Analytics with Disaggregated Compute and Storage
Alluxio, Inc.
•
66 views
Building a Hadoop Data Warehouse with Impala by Swiss Big Data User Group
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group
•
7.3K views
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ... by Cloudera, Inc.
Apache Kudu (Incubating): New Hadoop Storage for Fast Analytics on Fast Data ...
Cloudera, Inc.
•
5.1K views
More from Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
1.1K views
•
31 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
1.4K views
•
53 slides
New Directions for Apache Arrow by
New Directions for Apache Arrow
Wes McKinney
1.9K views
•
27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
2.2K views
•
31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
2K views
•
47 slides
Apache Arrow: Present and Future @ ScaledML 2020 by
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
970 views
•
36 slides
More from Wes McKinney
(20)
Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.2K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Recently uploaded
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson
66 views
•
32 slides
AMAZON PRODUCT RESEARCH.pdf by
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta
19 views
•
13 slides
1st parposal presentation.pptx by
1st parposal presentation.pptx
i238212
9 views
•
3 slides
Piloting & Scaling Successfully With Microsoft Viva by
Piloting & Scaling Successfully With Microsoft Viva
Richard Harbridge
12 views
•
160 slides
The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya
79 views
•
25 slides
Melek BEN MAHMOUD.pdf by
Melek BEN MAHMOUD.pdf
MelekBenMahmoud
14 views
•
1 slide
Recently uploaded
(20)
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N... by James Anderson
GDG Cloud Southlake 28 Brad Taylor and Shawn Augenstein Old Problems in the N...
James Anderson
•
66 views
AMAZON PRODUCT RESEARCH.pdf by JerikkLaureta
AMAZON PRODUCT RESEARCH.pdf
JerikkLaureta
•
19 views
1st parposal presentation.pptx by i238212
1st parposal presentation.pptx
i238212
•
9 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft Viva
Richard Harbridge
•
12 views
The Research Portal of Catalonia: Growing more (information) & more (services) by CSUC - Consorci de Serveis Universitaris de Catalunya
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya
•
79 views
Melek BEN MAHMOUD.pdf by MelekBenMahmoud
Melek BEN MAHMOUD.pdf
MelekBenMahmoud
•
14 views
ChatGPT and AI for Web Developers by Maximiliano Firtman
ChatGPT and AI for Web Developers
Maximiliano Firtman
•
187 views
Scaling Knowledge Graph Architectures with AI by Enterprise Knowledge
Scaling Knowledge Graph Architectures with AI
Enterprise Knowledge
•
28 views
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10
•
237 views
Report 2030 Digital Decade by Massimo Talia
Report 2030 Digital Decade
Massimo Talia
•
15 views
Data-centric AI and the convergence of data and model engineering:opportunit... by Paolo Missier
Data-centric AI and the convergence of data and model engineering:opportunit...
Paolo Missier
•
39 views
20231123_Camunda Meetup Vienna.pdf by Phactum Softwareentwicklung GmbH
20231123_Camunda Meetup Vienna.pdf
Phactum Softwareentwicklung GmbH
•
33 views
Attacking IoT Devices from a Web Perspective - Linux Day by Simone Onofri
Attacking IoT Devices from a Web Perspective - Linux Day
Simone Onofri
•
15 views
Tunable Laser (1).pptx by Hajira Mahmood
Tunable Laser (1).pptx
Hajira Mahmood
•
24 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker
•
33 views
From chaos to control: Managing migrations and Microsoft 365 with ShareGate! by sammart93
From chaos to control: Managing migrations and Microsoft 365 with ShareGate!
sammart93
•
9 views
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdf
Aitana
•
16 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software
•
257 views
Perth MeetUp November 2023 by Michael Price
Perth MeetUp November 2023
Michael Price
•
19 views
Spesifikasi Lengkap ASUS Vivobook Go 14 by Dot Semarang
Spesifikasi Lengkap ASUS Vivobook Go 14
Dot Semarang
•
37 views
High Performance Python on Apache Spark
1.
1 © Cloudera,
Inc. All rights reserved. High Performance Python on Apache Spark Wes McKinney @wesmckinn Spark Summit West -‐-‐ June 7, 2016
2.
2 © Cloudera,
Inc. All rights reserved. Me • Data Science Tools at Cloudera • Serial creator of structured data tools / user interfaces • Wrote bestseller Python for Data Analysis 2012 • Working on expanded and revised 2nd edi-on, coming 2017 • Open source projects • Python {pandas, Ibis, statsmodels} • Apache {Arrow, Parquet, Kudu (incubaUng)} • Focused on C++, Python, and Hybrid projects
3.
3 © Cloudera,
Inc. All rights reserved. Agenda • Why care about Python? • What does “high performance Python” even mean? • A modern approach to Python data so]ware • Spark and Python: performance analysis and development direcUons
4.
4 © Cloudera,
Inc. All rights reserved. • Accessible, “swiss army knife” programming language • Highly producUve for so]ware engineering and data science alike • Has excelled as the agile “orchestraUon” or “glue” layer for applicaUon business logic • Easy to interface with C / C++ / Fortran code. Well-‐designed Python C API Why care about (C)Python?
5.
5 © Cloudera,
Inc. All rights reserved. Defining “High Performance Python” • The end-‐user workflow involves primarily Python programming; programs can be invoked with “python app_entry_point.py ...” • The so]ware uses system resources within an acceptable factor of an equivalent program developed completely in Java or C++ • Preferably 1-‐5x slower, not 20-‐50x • The so]ware is suitable for interacUve / exploratory compuUng on modestly large data sets (= gigabytes) on a single node
6.
6 © Cloudera,
Inc. All rights reserved. Building fast Python so]ware means embracing certain limitaUons
7.
7 © Cloudera,
Inc. All rights reserved. Having a healthy relaUonship with the interpreter • The Python interpreter itself is “slow”, as compared with hand-‐coded C or Java • Each line of Python code may feature mulUple internal C API calls, temporary data structures, etc. • Python built-‐in data structures (numbers, strings, tuples, lists, dicts, etc.) have significant memory and performance use overhead • Threads performing concurrent CPU or IO work must take care not to block other threads
8.
8 © Cloudera,
Inc. All rights reserved. Mantras for great success • Key quesUon 1: Am I making the Python interpreter do a lot of work? • Key quesUon 2: Am I blocking other interpreted code from execu-ng? • Key quesUon 3: Am I handling data (memory) in a “good” way?
9.
9 © Cloudera,
Inc. All rights reserved. Toy example: interpreted vs. compiled code
10.
10 © Cloudera,
Inc. All rights reserved. Toy example: interpreted vs. compiled code Cython: 78x faster than interpreted
11.
11 © Cloudera,
Inc. All rights reserved. Toy example: interpreted vs. compiled code NumPy Creating a full 80MB temporary array + PyArray_Sum is only 35% slower than a fully inlined Cython ( C ) function Interesting: ndarray.sum by itself is almost 2x faster than the hand-coded Cython function...
12.
12 © Cloudera,
Inc. All rights reserved. Submarines and Icebergs: metaphors for fast Python so]ware
13.
13 © Cloudera,
Inc. All rights reserved. SubopUmal control flow Elsewhere Data Python code Python data structures Pure Python computation Python data structures Pure Python computation Python data structures Data Deserialization Serialization Time for a coffee break...
14.
14 © Cloudera,
Inc. All rights reserved. Beler control flow Extension code (C / C++) Native data Python code C Func Native data Native data C Func Python app logic Python app logic Users only see this! Zoom zoom! (if the extension code is good)
15.
15 © Cloudera,
Inc. All rights reserved. But it’s much easier to write 100% Python! • Building hybrid C/C++ and Python systems adds a lot of complexity to the engineering process • (but it’s o]en worth it) • See: Cython, SWIG, Boost.Python, Pybind11, and other “hybrid” so]ware creaUon tools • BONUS: Python programs can orchestrate mulU-‐threaded / concurrent systems wrilen in C/C++ (no Python C API needed) • The GIL only comes in when you need to “bubble up” data or control flow (e.g. Python callbacks) into the Python interpreter
16.
16 © Cloudera,
Inc. All rights reserved. A story of reading a CSV file f = get_stream(...) df = pandas.read_csv(f, **csv_options) while more_data(): buffer = f.read() parse_bytes(buffer) df = type_infer_columns() internally, pseudocode Concerns Uses PyString_FromStringAndSize, must hold GIL for this Synchronous or asynchronous with IO? Type infer in parallel? Data structures used?
17.
17 © Cloudera,
Inc. All rights reserved. It’s All About the Benjamins (Data Structures) • The hard currency of data so]ware is: in-‐memory data structures • How costly are they to send and receive? • How costly to manipulate and munge in-‐memory? • How difficult is it to add new proprietary computaUon logic? • In Python: NumPy established a gold standard for interoperable array data • pandas is built on NumPy, and made it easy to “plug in” to the ecosystem • (but there are plenty of warts sUll)
18.
18 © Cloudera,
Inc. All rights reserved. What’s this have to do with Spark? • Some known performance issues in PySpark • IO throughput • Python to Spark • Spark to Python (or Python extension code) • Running interpreted Python code on RDDs / Spark DataFrames • Lambda mappers / reducers (rdd.map(...)) • Spark SQL UDFs (registerFuncUon(...))
19.
19 © Cloudera,
Inc. All rights reserved. Spark IO throughput to/from Python 1.15 MB/s in 9.82 MB/s out Spark 1.6.1 running on localhost 76 MB pandas.DataFrame
20.
20 © Cloudera,
Inc. All rights reserved. Spark IO throughput to/from Python Unofficial improved toPandas 25.6 MB/s out
21.
21 © Cloudera,
Inc. All rights reserved. Compared with HiveServer2 Thri] RPC fetch Impala 2.5 + Parquet file on localhost ibis + impyla 41.46 MB/s read hs2client (C++ / Python) 90.8 MB/s Task benchmarked: Thrift TFetchResultsReq + deserialization + conversion to pandas.DataFrame
22.
22 © Cloudera,
Inc. All rights reserved. Back of envelope comp w/ file formats Feather: 1105 MB/s write CSV (pandas): 6.2 MB/s write Feather: 2414 MB/s read CSV (pandas): 51.9 MB/s read disclaimer: warm NVMe / OS file cache
23.
23 © Cloudera,
Inc. All rights reserved. Aside: CSVs can be fast See: https://github.com/wiseio/paratext
24.
24 © Cloudera,
Inc. All rights reserved. How Python lambdas work in PySpark Spark RDD Python task worker pool Python worker Python worker Python worker Python worker Python worker Data stream + pickled PyFunction See: spark/api/python/PythonRDD.scala python/pyspark/worker.py The inner loop of RDD.map map(f, iterator)
25.
25 © Cloudera,
Inc. All rights reserved. How Python lambdas perform NumPy array-oriented operations are about 100x faster… but that’s not the whole story Disclaimer: this isn’t a remotely “fair” comparison, but it helps illustrate the real pitfalls associated with introducing serialization and RPC/IPC into a computational process
26.
26 © Cloudera,
Inc. All rights reserved. How Python lambdas perform 8 cores 1 core Lessons learned: Python data analytics should not be based around scalar object iteration
27.
27 © Cloudera,
Inc. All rights reserved. Asides / counterpoints • Spark<-‐>Python IO may not be important -‐-‐ can leave all of the data remote • Spark DataFrame operaUons have reduced the need for many types of Lambda funcUons • Can use binary file formats as an alternate IO interface • Parquet (Python support soon via apache/parquet-‐cpp) • Avro (see cavro, fastavro, pyavroc) • ORC (needs a Python champion) • ...
28.
28 © Cloudera,
Inc. All rights reserved. Apache Arrow http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
29.
29 © Cloudera,
Inc. All rights reserved. Apache Arrow in a Slide • New Top-‐level Apache So]ware FoundaUon project • hlp://arrow.apache.org • Focused on Columnar In-‐Memory AnalyUcs 1. 10-‐100x speedup on many workloads 2. Common data layer enables companies to choose best of breed systems 3. Designed to work with any programming language 4. Support for both relaUonal and complex data as-‐is • Oriented at collaboraUon amongst other OSS projects Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
30.
30 © Cloudera,
Inc. All rights reserved. High Performance Sharing & Interchange Today With Arrow • Each system has its own internal memory format • 70-80% CPU wasted on serialization and deserialization • Similar functionality implemented in multiple projects • All systems utilize the same memory format • No overhead for cross-system communication • Projects can share functionality (eg, Parquet-to-Arrow reader)
31.
31 © Cloudera,
Inc. All rights reserved. Arrow and PySpark • Build a C API level data protocol to move data between Spark and Python • Either • (Fast) Convert Arrow to/from pandas.DataFrame • (Faster) Perform naUve analyUcs on Arrow data in-‐memory • Use Arrow • For efficiently handling nested Spark SQL data in-‐memory • IO: pandas/NumPy data push/pull • Lambda/UDF evaluaUon
32.
32 © Cloudera,
Inc. All rights reserved. • Problem: fast, language-‐ agnosUc binary data frame file format • Creators: Wes McKinney (Python) and Hadley Wickham (R) • Read speeds close to disk IO performance Arrow in acUon: Feather File Format for Python and R
33.
33 © Cloudera,
Inc. All rights reserved. More on Feather array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
34.
34 © Cloudera,
Inc. All rights reserved. Summary • It’s essenUal to improve Spark’s low-‐level data interoperability with the Python data ecosystem • I’m personally excited to work with the Spark + Arrow + PyData + other communiUes to help make this a reality
35.
35 © Cloudera,
Inc. All rights reserved. Thank you Wes McKinney @wesmckinn Views are my own