SlideShare a Scribd company logo
Wes McKinney @wesmckinn
SHARED INFRASTRUCTURE FOR
DATA SCIENCE
WES MCKINNEY @WESMCKINN
Rice Data Science Conference | October 2017
ME
2
I M P O R TA N T L E G A L I N F O R M AT I O N
• The information presented here is offered for informational purposes only and should not be used for any other purpose
(including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes
only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any
offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of
Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at
any time.
• Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such
copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for
identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright
or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma,
nor vice versa.
• Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
Wes McKinney @wesmckinn 3
THINKING ON THE LAST 10 YEARS
4
2007 2017
CLOSED SOURCE OPEN SOURCE
5
Shared front-ends
for data science
THE NEXT 10 YEARS AND BEYOND
7
2017 2027 …
THE AI ARMS RACE
Wes McKinney @wesmckinn 8
CHANGING HARDWARE LANDSCAPE
DISK
PROCESSIN
G
MEMORY
9
T
DATA SCIENCE “LANGUAGE “SILOS”
FRONT-END
PYTHON R JVM JULIA …
10
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
11
WHAT’S IN A
SILO?
STORAGE /
DATA ACCESS
DATA
STRUCTURES /
IN-MEMORY
FORMATS
GENERAL
COMPUTE
ENGINE(S)
ADVANCED
ANALYTICS
pandas NumPy
pandas
NumPy
pandas
scikit-learn
12
RENOVATING PANDAS
Wes McKinney @wesmckinn 13
27
T
MAKING THE SILOS “SMALLER”
FRONT-END
PYTHON R JVM JULIA
?
…
14
PROGRAMMING LANGUAGES
AS USER INTERFACES
15
GRAPHIC: Iceberg under sea (only top
part visible to naked eye)
T
df <- read_csv(…)
df % group_by(…) % summarise(…)
df = read_csv(…)
df.groupby(…).aggregate(…)
PYTHON
R
SAME ANALYSIS, DIFFERENT
IMPLEMENTATION
17
T
A SHARED RUNTIME FOR DATA SCIENCE
FRONT-END
PYTHON R JVM JULIA
SHARED DATA SCIENCE RUNTIME
…
18
FROM IDEA TO ACTION
19
T
PART 1: STANDARD IN-MEMORY FORMAT
R
PYTHON
JVM
PORTABLE DATA
FRAME
Non-Portable Data Frames
20…
T
PART 2: ZERO COPY INTERCHANGE
RPYTHON JVM
SHARED MEMORY + STANDARD MEMORY FORMATS
…
21
T
PART 3: HIGH PERFORMANCE DATA
ACCESS
BINARY
COLUMNAR
CSV
SQL
PORTABLE
DATA FRAME
Storage Formats/ Databases
… 22
T
PART 4: FLEXIBLE COMPUTATION ENGINE
• Zero-overhead User-defined Functions
• Portable Operator “Graphs”
• “Embeddable” in Larger Systems
23
APACHE ARROW
Language-agnostic Data Frame Format
Zero-Copy Interchange
24
24
Without Arrow With Arrow
Simple, fast data interchange
24
• Cache-efficient columnar memory: optimized for CPU affinity and
SIMD / parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata,
batch/file-based and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in
various stages of development
Big picture Arrow goals
T
BUILDING THE ARROW FORMAT
• “Superset” of representations supported by
R, pandas, SQL engines
• Optimized for CPU cache affinity
• ASF Governance: Open + Transparent
Community Project
25
FEATHER: MINIMALIST ARROW ON DISK
Some Arrow OSS Users
Feather Format
Ray Project
27
FROM ARROW TO PANDAS2
28
Logical Operator Graphs
27
(a + b).log()
Log Add
a
b
Terminology
27
• Kernel functions: atomic units of
computation
• Operator nodes: input/output types,
operator parallelism properties
Parallel Execution of Operator Graphs
27
a b
ADD LOG
tmp out
Some Optimization strategies
27
• Multicore scheduling
• Elimination of temporaries
• Operator fusion / pipelinng
A
28
Arrow-optimized data connectors
Arrow in-memory format
Logical Data Frame Expression Graphs
Parallel Dataflow Execution Engine
Python user API, DataFrame semantics,
User-defined functions
pandas2
Apache Arrow
BUILDING THE FUTURE
28
Wes McKinney @wesmckinn
THANK YOU
WES MCKINNEY @WESMCKINN
Apache Arrow: http://arrow.apache.org

More Related Content

What's hot

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
Adam Doyle
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
Data Science Thailand
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
Guido Schmutz
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Databricks
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
Flink Forward
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
Amazon Web Services
 
Hyperion Essbase - Ravi Kurakula
Hyperion Essbase   -   Ravi KurakulaHyperion Essbase   -   Ravi Kurakula
Hyperion Essbase - Ravi Kurakula
Ravi kurakula
 
Using Pentaho with MariaDB ColumnStore
Using Pentaho with MariaDB ColumnStoreUsing Pentaho with MariaDB ColumnStore
Using Pentaho with MariaDB ColumnStore
MariaDB plc
 
Oracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesOracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API Examples
Bobby Curtis
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
Amazon Web Services
 
Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...
Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...
Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...
Phil Wilkins
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
Julian Hyde
 
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
Sandesh Rao
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
Amazon Web Services
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Data Warehouse Concepts and Architecture
Data Warehouse Concepts and ArchitectureData Warehouse Concepts and Architecture
Data Warehouse Concepts and Architecture
Mohd Tousif
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
priyadharshini626440
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
James Serra
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
Amazon Web Services
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
Mark Kromer
 

What's hot (20)

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Data Lake,beyond the Data Warehouse
Data Lake,beyond the Data WarehouseData Lake,beyond the Data Warehouse
Data Lake,beyond the Data Warehouse
 
Spark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka StreamsSpark (Structured) Streaming vs. Kafka Streams
Spark (Structured) Streaming vs. Kafka Streams
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Dongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of FlinkDongwon Kim – A Comparative Performance Evaluation of Flink
Dongwon Kim – A Comparative Performance Evaluation of Flink
 
Building Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS GlueBuilding Serverless ETL Pipelines with AWS Glue
Building Serverless ETL Pipelines with AWS Glue
 
Hyperion Essbase - Ravi Kurakula
Hyperion Essbase   -   Ravi KurakulaHyperion Essbase   -   Ravi Kurakula
Hyperion Essbase - Ravi Kurakula
 
Using Pentaho with MariaDB ColumnStore
Using Pentaho with MariaDB ColumnStoreUsing Pentaho with MariaDB ColumnStore
Using Pentaho with MariaDB ColumnStore
 
Oracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API ExamplesOracle GoldenGate 18c - REST API Examples
Oracle GoldenGate 18c - REST API Examples
 
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
What's New with Amazon Redshift ft. McDonald's (ANT350-R1) - AWS re:Invent 2018
 
Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...
Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...
Look at Oracle Integration Cloud – its relationship to ICS. Customer use Case...
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
AIOUG : OTNYathra - Troubleshooting and Diagnosing Oracle Database 12.2 and O...
 
Amazon Redshift
Amazon Redshift Amazon Redshift
Amazon Redshift
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Data Warehouse Concepts and Architecture
Data Warehouse Concepts and ArchitectureData Warehouse Concepts and Architecture
Data Warehouse Concepts and Architecture
 
Azure Data Engineering.pptx
Azure Data Engineering.pptxAzure Data Engineering.pptx
Azure Data Engineering.pptx
 
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solutionDifferentiate Big Data vs Data Warehouse use cases for a cloud solution
Differentiate Big Data vs Data Warehouse use cases for a cloud solution
 
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
RDS Postgres and Aurora Postgres | AWS Public Sector Summit 2017
 
Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22Mapping Data Flows Training deck Q1 CY22
Mapping Data Flows Training deck Q1 CY22
 

Similar to Shared Infrastructure for Data Science

Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!
Angelo Corsaro
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
 
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
scoopnewsgroup
 
5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy
Western Digital
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
Ray Bugg
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Spark Summit
 
Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013Julio Da Silva
 
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Quantopian
 
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
Peter Stinson
 
Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...
Amazon Web Services
 
GPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial IntelligenceGPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial Intelligence
Amazon Web Services
 
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPTTaking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
Amazon Web Services
 
Watson data platform_sofia_20171017
Watson data platform_sofia_20171017Watson data platform_sofia_20171017
Watson data platform_sofia_20171017
Mladen Jovanovski
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
nkabra
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
Adam Muise
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
Adam Muise
 
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
BCS Data Management Specialist Group
 

Similar to Shared Infrastructure for Data Science (20)

Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!Got Big Data? Get OpenSplice!
Got Big Data? Get OpenSplice!
 
Raising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
 
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
Digital Innovation Trends in Government Blockchain Machine Learning and Inter...
 
5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy5 Tips to Building a Successful Big Data Strategy
5 Tips to Building a Successful Big Data Strategy
 
Big Data Scotland 2017
Big Data Scotland 2017Big Data Scotland 2017
Big Data Scotland 2017
 
Improving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
 
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
Improving Python and Spark Performance and Interoperability: Spark Summit Eas...
 
Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013Big Data Mining Keynote presentation Sept 2013 09012013
Big Data Mining Keynote presentation Sept 2013 09012013
 
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
Empowering Quants in the Data Economy by Napoleon Hernandez at QuantCon 2016
 
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004District Office of Info and KM - Proposed - by Joel Magnussen - 2004
District Office of Info and KM - Proposed - by Joel Magnussen - 2004
 
Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...Building ML platforms in Financial Services with serverless technology - FSV2...
Building ML platforms in Financial Services with serverless technology - FSV2...
 
GPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial IntelligenceGPSBUS201-GPS Demystifying Artificial Intelligence
GPSBUS201-GPS Demystifying Artificial Intelligence
 
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPTTaking Complexity Out of Data Science with AWS and Zoomdata PPT
Taking Complexity Out of Data Science with AWS and Zoomdata PPT
 
Watson data platform_sofia_20171017
Watson data platform_sofia_20171017Watson data platform_sofia_20171017
Watson data platform_sofia_20171017
 
Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013Big data in marketing at harvard business club nick1 june 15 2013
Big data in marketing at harvard business club nick1 june 15 2013
 
What is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMACWhat is Hadoop? Nov 20 2013 - IRMAC
What is Hadoop? Nov 20 2013 - IRMAC
 
What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013What is Hadoop? Oct 17 2013
What is Hadoop? Oct 17 2013
 
Big data
Big dataBig data
Big data
 
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
Horse meat or beef? (3) D Murphy, National Grid, 21/3/13
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 
Python Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
 
PyCon APAC 2016 Keynote
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 

Shared Infrastructure for Data Science

  • 1. Wes McKinney @wesmckinn SHARED INFRASTRUCTURE FOR DATA SCIENCE WES MCKINNEY @WESMCKINN Rice Data Science Conference | October 2017
  • 3. I M P O R TA N T L E G A L I N F O R M AT I O N • The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. • Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. • Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved Wes McKinney @wesmckinn 3
  • 4. THINKING ON THE LAST 10 YEARS 4 2007 2017
  • 7. THE NEXT 10 YEARS AND BEYOND 7 2017 2027 …
  • 8. THE AI ARMS RACE Wes McKinney @wesmckinn 8
  • 10. T DATA SCIENCE “LANGUAGE “SILOS” FRONT-END PYTHON R JVM JULIA … 10
  • 11. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS 11
  • 12. WHAT’S IN A SILO? STORAGE / DATA ACCESS DATA STRUCTURES / IN-MEMORY FORMATS GENERAL COMPUTE ENGINE(S) ADVANCED ANALYTICS pandas NumPy pandas NumPy pandas scikit-learn 12
  • 14. 27
  • 15. T MAKING THE SILOS “SMALLER” FRONT-END PYTHON R JVM JULIA ? … 14
  • 17. GRAPHIC: Iceberg under sea (only top part visible to naked eye)
  • 18. T df <- read_csv(…) df % group_by(…) % summarise(…) df = read_csv(…) df.groupby(…).aggregate(…) PYTHON R SAME ANALYSIS, DIFFERENT IMPLEMENTATION 17
  • 19. T A SHARED RUNTIME FOR DATA SCIENCE FRONT-END PYTHON R JVM JULIA SHARED DATA SCIENCE RUNTIME … 18
  • 20. FROM IDEA TO ACTION 19
  • 21. T PART 1: STANDARD IN-MEMORY FORMAT R PYTHON JVM PORTABLE DATA FRAME Non-Portable Data Frames 20…
  • 22. T PART 2: ZERO COPY INTERCHANGE RPYTHON JVM SHARED MEMORY + STANDARD MEMORY FORMATS … 21
  • 23. T PART 3: HIGH PERFORMANCE DATA ACCESS BINARY COLUMNAR CSV SQL PORTABLE DATA FRAME Storage Formats/ Databases … 22
  • 24. T PART 4: FLEXIBLE COMPUTATION ENGINE • Zero-overhead User-defined Functions • Portable Operator “Graphs” • “Embeddable” in Larger Systems 23
  • 25. APACHE ARROW Language-agnostic Data Frame Format Zero-Copy Interchange 24
  • 26. 24 Without Arrow With Arrow Simple, fast data interchange
  • 27. 24 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development Big picture Arrow goals
  • 28. T BUILDING THE ARROW FORMAT • “Superset” of representations supported by R, pandas, SQL engines • Optimized for CPU cache affinity • ASF Governance: Open + Transparent Community Project 25
  • 30. Some Arrow OSS Users Feather Format Ray Project 27
  • 31. FROM ARROW TO PANDAS2 28
  • 32. Logical Operator Graphs 27 (a + b).log() Log Add a b
  • 33. Terminology 27 • Kernel functions: atomic units of computation • Operator nodes: input/output types, operator parallelism properties
  • 34. Parallel Execution of Operator Graphs 27 a b ADD LOG tmp out
  • 35. Some Optimization strategies 27 • Multicore scheduling • Elimination of temporaries • Operator fusion / pipelinng
  • 36. A 28 Arrow-optimized data connectors Arrow in-memory format Logical Data Frame Expression Graphs Parallel Dataflow Execution Engine Python user API, DataFrame semantics, User-defined functions pandas2 Apache Arrow
  • 38.
  • 39. Wes McKinney @wesmckinn THANK YOU WES MCKINNEY @WESMCKINN Apache Arrow: http://arrow.apache.org