Memory Interoperability in Analytics and Machine Learning

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
www.twosigma.com
Memory Interoperability for
Analytics and Machine Learning
March 26, 2017All Rights Reserved
Wes McKinney @wesmckinn
ScaledML @ Stanford
March 25, 2017
Me
March 26, 2017
• Currently: Software Architect at Two Sigma Investments
• Creator of Python pandas project
• PMC member for Apache Arrow and Apache Parquet
• Author of Python for Data Analysis
• Other Python projects: Ibis, Feather, statsmodels
All Rights Reserved 2
Important Legal Information
March 26, 2017
The information presented here is offered for informational purposes only and should not be used for
any other purpose (including, without limitation, the making of investment decisions). Examples
provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing
herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest;
tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments,
LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any
time.
Some of the images, logos or other material used herein may be protected by copyright and/or
trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the
material and are used purely for identification and comment as fair use under international copyright
and/or trademark laws. Use of such image, copyright or trademark does not imply any association with
such organization (or endorsement of such organization) by Two Sigma, nor vice versa.
Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved
All Rights Reserved 3
This talk
4March 26, 2017
• Benefits of interoperable data and metadata
• Challenges to sharing memory between runtime environments
• Apache Arrow: Purpose and C++ architecture
• Opportunities for collaboration
• Example application: pandas 2.0
All Rights Reserved
Changing hardware landscape
March 26, 2017
• Intel has released first production 3D Xpoint SSD
• Reported 1000x faster than NAND, less expensive than RAM
• Convergence between RAM vs. shared memory / mmap performance
All Rights Reserved 5
Changing software landscape
March 26, 2017
• Next-gen ML / AI frameworks (TensorFlow, Torch, etc.)
• DIY open source architectures for machine learning in production
• Streaming / batch data processing pipelines
• Data cleaning and feature engineering
• Model fitting / scoring / serving
All Rights Reserved 6
“Zero-copy” memory interfaces
March 26, 2017
• Enables computational tools to process a dataset without any additional
serialization, or transfer to a different memory space
• Can do random access on a dataset that does not fit in RAM
• Another interpretation: reading a dataset is a metadata-only conversion
All Rights Reserved 7
Challenges to zero-copy memory sharing
March 26, 2017
• Cross-language issues
• Type metadata + logical types
• Byte/bit-level memory layout
• Language-specific issues
• In-memory data structures
• Memory allocation and sharing constructs
All Rights Reserved 8
What is pandas?
March 26, 2017
• Popular in-memory data manipulation tool for Python
• Focused on tabular datasets (“data frames”)
• Sprawling codebase spanning multiple areas
• IO for many data formats
• Array manipulations / data preparation
• OLAP-style analytics
• Internals implemented using NumPy array objects
All Rights Reserved 9
NumPy
March 26, 2017
• Tensor memory model ("ndarray") for numeric data
• Strided, homogeneously-typed, byte-addressable memory
• APL-inspired semantics
• Zero-copy construction from compatible memory layouts
• Computational tools support both strided and contiguous memory access
All Rights Reserved 10
pandas: Technical debt + Architectural issues
March 26, 2017
• Tensor library like NumPy awkward fit for pandas use cases
• Multidimensionality + strided memory access complicated algorithms
• Lack of built-in missing value support
• Weak on native string, variable length, or nested types
• pandas at core a “in-memory columnar” problem, similar to analytical SQL
engines
All Rights Reserved 11
Thesis: Tensors and Tables
March 26, 2017
• 2 data structures best suited for zero-copy sharing
• Tensors: N-dimensional, homogeneously-typed arrays
• Tables: Column-oriented, heterogeneously typed
• These data structures can be defined using common memory and metadata
primitives
All Rights Reserved 12
Observations
March 26, 2017
• A Tensor is semantically a multidimensional view of a 1D block of memory
• Writing computational code targeting arbitrary tensors is much more difficult
than 1D contiguous arrays
• Tensors of non-fixed size types (e.g. strings) occur less frequently
All Rights Reserved 13
Apache Arrow
March 26, 2017
• github.com/apache/arrow
• Collaboration amongst broad set of OSS projects around language-agnostic
shared data structures
• Initial focus
• In-memory columnar tables
• Canonical metadata
• Interoperability between JVM and native code (C/C++) ecosystem
All Rights Reserved 14
High performance data interchange
March 26, 2017All Rights Reserved
Today With Arrow
Source: Apache Arrow
15
What does Apache Arrow give you?
March 26, 2017
• Cache-efficient columnar memory: optimized for CPU affinity and SIMD /
parallel processing, O(1) random value access
• Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based
and streaming binary formats
• Complex schema support: Flat and nested data types
• Main implementations in C++ and Java: with integration tests
• Bindings / implementations for C, Python, Ruby, Javascript in various stages
of development
All Rights Reserved 16
Arrow in C++
March 26, 2017
• Reusable memory management and IO subsystem for native code applications
• Layered in multiple components
• Memory management
• Type metadata / schemas
• Array / Table containers
• IO interfaces
• Zero-copy IPC / messaging
All Rights Reserved 17
Arrow C++: Memory management
March 26, 2017
• arrow::Buffer
• RAII-based memory lifetime with std::shared_ptr<Buffer>
• arrow::MemoryMappedBuffer: for memory maps
• arrow::MemoryPool
• Abstract memory allocator for tracking all allocations
All Rights Reserved 18
Arrow C++: Type metadata
March 26, 2017
• arrow::DataType
• Base class for fixed size, variable size, and nested datatypes
• arrow::Field
• Type + name + additional metadata
• arrow::Schema
• Collection of fields
All Rights Reserved 19
Arrow C++: Array / Table containers
March 26, 2017
• arrow::Array
• 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc.
• Support for dictionary-encoded arrays
• arrow::RecordBatch
• Collection of equal-length arrays
• arrow::Column
• Logical table “column” as chunked array
• arrow::Table
• Collection of columns
All Rights Reserved 20
Arrow C++: IO interfaces
March 26, 2017
• arrow::{InputStream, OutputStream}
• arrow::RandomAccessFile
• Abstract file interface
• arrow::MemoryMappedFile
• Zero-copy reads to arrow::Buffer
• Specific implementations for OS files, HDFS, etc.
All Rights Reserved 21
Arrow C++: Messaging / IPC
March 26, 2017
• Metadata read/write using Google’s Flatbuffers library
• Encapsulated Message type
• Write record batches, read with zero-copy
• arrow::{FileWriter, FileReader}
• Random access / “batch” binary format
• arrow::{StreamWriter, StreamReader}
• Streaming binary format
All Rights Reserved 22
In development: arrow::Tensor
March 26, 2017
• Targeting interoperability with memory layouts as used in NumPy,
TensorFlow, Torch, or other standard tensor-based frameworks
• data: arrow::Buffer
• shape: dimension sizes
• strides: memory ordering
• Zero-copy reads using Arrow’s shared memory tools
• Support Tensor math libraries for C++ like xtensor
All Rights Reserved 23
Example use: Ray ML framework from Berkeley RISELab
March 26, 2017All Rights Reserved 24
Source: https://arxiv.org/abs/1703.03924
• Shared memory-based object
store
• Zero-copy tensor reads using
Arrow libraries
Example use: pandas 2.0
March 26, 2017
• In-planning rearchitecture of pandas’s internals
• libpandas — largely Python-agnostic C++11 library
• Decoupling pandas data structures from NumPy tensors
• Support analytics targeting native Arrow memory
• Multicore / parallel algorithms
• Leverage latest SIMD intrinsics
• Lazy-loading DataFrames from primary input formats
• CSV, JSON, HDF5, Apache Parquet
All Rights Reserved 25
Other examples
March 26, 2017
• Spark integration (SPARK-13534)
• Weld integration (ARROW-649)
All Rights Reserved 26
Thank you
March 26, 2017
• Building code and community around
• IO subsystems
• Metadata
• Data structures and in-memory formats
All Rights Reserved 27
1 of 27

Recommended

Python Data Wrangling: Preparing for the Future by
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
12.5K views27 slides
Python Data Ecosystem: Thoughts on Building for the Future by
Python Data Ecosystem: Thoughts on Building for the FuturePython Data Ecosystem: Thoughts on Building for the Future
Python Data Ecosystem: Thoughts on Building for the FutureWes McKinney
5.4K views37 slides
Data Science Languages and Industry Analytics by
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsWes McKinney
5.5K views19 slides
Apache Arrow -- Cross-language development platform for in-memory data by
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
2.9K views23 slides
DataFrames: The Extended Cut by
DataFrames: The Extended CutDataFrames: The Extended Cut
DataFrames: The Extended CutWes McKinney
8.5K views34 slides
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"Wes McKinney
1.1K views32 slides

More Related Content

What's hot

Apache Arrow: Present and Future @ ScaledML 2020 by
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Wes McKinney
970 views36 slides
Next-generation Python Big Data Tools, powered by Apache Arrow by
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
13K views22 slides
An Incomplete Data Tools Landscape for Hackers in 2015 by
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
8.1K views22 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides
Data Analysis and Statistics in Python using pandas and statsmodels by
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodelsWes McKinney
19.8K views29 slides
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future Wes McKinney
2.1K views52 slides

What's hot(20)

Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
Next-generation Python Big Data Tools, powered by Apache Arrow by Wes McKinney
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney13K views
An Incomplete Data Tools Landscape for Hackers in 2015 by Wes McKinney
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney8.1K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Data Analysis and Statistics in Python using pandas and statsmodels by Wes McKinney
Data Analysis and Statistics in Python using pandas and statsmodelsData Analysis and Statistics in Python using pandas and statsmodels
Data Analysis and Statistics in Python using pandas and statsmodels
Wes McKinney19.8K views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
PyData: The Next Generation by Wes McKinney
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next Generation
Wes McKinney22.2K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Ibis: Scaling the Python Data Experience by Wes McKinney
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data Experience
Wes McKinney3.8K views
Improving data interoperability in Python and R by Wes McKinney
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney2.6K views
My Data Journey with Python (SciPy 2015 Keynote) by Wes McKinney
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)
Wes McKinney7.4K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney103.9K views
PyCon Singapore 2013 Keynote by Wes McKinney
PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney94.6K views
Apache Arrow (Strata-Hadoop World San Jose 2016) by Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney17K views
Future of pandas by Jeff Reback
Future of pandasFuture of pandas
Future of pandas
Jeff Reback5.4K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa... by Dremio Corporation
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
The Future of Column-Oriented Data Processing With Apache Arrow and Apache Pa...
Dremio Corporation1.1K views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views

Viewers also liked

Raising the Tides: Open Source Analytics for Data Science by
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
3.2K views28 slides
ドローン農業最前線 by
ドローン農業最前線ドローン農業最前線
ドローン農業最前線tetsuya furukawa
8.5K views43 slides
Functional go by
Functional goFunctional go
Functional goGeison Goes
9.3K views27 slides
Goをカンストさせる話 by
Goをカンストさせる話Goをカンストさせる話
Goをカンストさせる話Moriyoshi Koizumi
6.4K views25 slides
Startup Pitch Decks by
Startup Pitch DecksStartup Pitch Decks
Startup Pitch DecksSteve Schlafman
80.7K views42 slides
Angular of things: angular2 + web bluetooth by
Angular of things: angular2 + web bluetoothAngular of things: angular2 + web bluetooth
Angular of things: angular2 + web bluetoothSergio Castillo Yrizales
737 views8 slides

Viewers also liked(20)

Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views
Functional go by Geison Goes
Functional goFunctional go
Functional go
Geison Goes9.3K views
マイクロサービスバックエンドAPIのためのRESTとgRPC by disc99_
マイクロサービスバックエンドAPIのためのRESTとgRPCマイクロサービスバックエンドAPIのためのRESTとgRPC
マイクロサービスバックエンドAPIのためのRESTとgRPC
disc99_19.9K views
Introduction to Search Systems - ScaleConf Colombia 2017 by Toria Gibbs
Introduction to Search Systems - ScaleConf Colombia 2017Introduction to Search Systems - ScaleConf Colombia 2017
Introduction to Search Systems - ScaleConf Colombia 2017
Toria Gibbs1.3K views
Mapping Experiences - Workshop Presentation by Jim Kalbach
Mapping Experiences - Workshop PresentationMapping Experiences - Workshop Presentation
Mapping Experiences - Workshop Presentation
Jim Kalbach7.4K views
Chainerを使って細胞を数えてみた by samacoba1983
Chainerを使って細胞を数えてみたChainerを使って細胞を数えてみた
Chainerを使って細胞を数えてみた
samacoba198352.2K views
Large Scale Deep Learning with TensorFlow by Jen Aman
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
Jen Aman24.8K views
HoloLens x Graphics 入門 by hecomi
HoloLens x Graphics 入門HoloLens x Graphics 入門
HoloLens x Graphics 入門
hecomi 13.8K views
PyCon APAC 2016 Keynote by Wes McKinney
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
Surrounded by flowers (Michael and Inessa Garmash ) by Makala (D)
Surrounded by flowers (Michael and Inessa Garmash )Surrounded by flowers (Michael and Inessa Garmash )
Surrounded by flowers (Michael and Inessa Garmash )
Makala (D)969 views
High Performance Python on Apache Spark by Wes McKinney
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views
フォントの選び方・使い方 by k maztani
フォントの選び方・使い方フォントの選び方・使い方
フォントの選び方・使い方
k maztani43.3K views
Krenn algorithmic democracy_ab_jan_2016 by democracyGPS
Krenn algorithmic democracy_ab_jan_2016Krenn algorithmic democracy_ab_jan_2016
Krenn algorithmic democracy_ab_jan_2016
democracyGPS720 views

Similar to Memory Interoperability in Analytics and Machine Learning

Engineering patterns for implementing data science models on big data platforms by
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platformsHisham Arafat
556 views39 slides
Big data berlin by
Big data berlinBig data berlin
Big data berlinkammeyer
1.7K views62 slides
Design Choices for Cloud Data Platforms by
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data PlatformsAshish Mrig
233 views25 slides
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud by
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloudJeff Hung
286 views37 slides
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir by
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.iraminnezarat
88 views78 slides
Ursa Labs and Apache Arrow in 2019 by
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Wes McKinney
4.2K views30 slides

Similar to Memory Interoperability in Analytics and Machine Learning(20)

Engineering patterns for implementing data science models on big data platforms by Hisham Arafat
Engineering patterns for implementing data science models on big data platformsEngineering patterns for implementing data science models on big data platforms
Engineering patterns for implementing data science models on big data platforms
Hisham Arafat556 views
Big data berlin by kammeyer
Big data berlinBig data berlin
Big data berlin
kammeyer1.7K views
Design Choices for Cloud Data Platforms by Ashish Mrig
Design Choices for Cloud Data PlatformsDesign Choices for Cloud Data Platforms
Design Choices for Cloud Data Platforms
Ashish Mrig233 views
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud by Jeff Hung
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
Jeff Hung286 views
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir by aminnezarat
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
01 introduction fundamentals_of_parallelism_and_code_optimization-www.astek.ir
aminnezarat88 views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ... by MLconf
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
MLconf933 views
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow by PyData
Simplifying And Accelerating Data Access for Python With Dremio and Apache ArrowSimplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
Simplifying And Accelerating Data Access for Python With Dremio and Apache Arrow
PyData817 views
An architecture for federated data discovery and lineage over on-prem datasou... by DataWorks Summit
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
DataWorks Summit2.9K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney by Hakka Labs
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs1.1K views
Optimizing Hortonworks Apache Spark machine learning workloads for contempora... by Indrajit Poddar
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Optimizing Hortonworks Apache Spark machine learning workloads for contempora...
Indrajit Poddar396 views
The Challenges of Bringing Machine Learning to the Masses by Alice Zheng
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
Alice Zheng3.3K views
Nisha talagala keynote_inflow_2016 by Nisha Talagala
Nisha talagala keynote_inflow_2016Nisha talagala keynote_inflow_2016
Nisha talagala keynote_inflow_2016
Nisha Talagala602 views
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote) by Peter Wang
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Python for Data: Past, Present, Future (PyCon JP 2017 Keynote)
Peter Wang2.4K views
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده by Web Standards School
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان دادهمعرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017 by Amazon Web Services
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017
Architectures for HPC/HTC Workloads on AWS - CMP306 - re:Invent 2017

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow: Leveling Up the Analytics Stack by
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackWes McKinney
1.4K views37 slides
Apache Arrow: Leveling Up the Data Science Stack by
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackWes McKinney
3.5K views20 slides

More from Wes McKinney(10)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney6K views
Data Tools and the Data Scientist Shortage by Wes McKinney
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney3.7K views

Recently uploaded

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...Bernd Ruecker
50 views69 slides
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueShapeBlue
134 views54 slides
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TShapeBlue
81 views34 slides
NTGapps NTG LowCode Platform by
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform Mustafa Kuğu
287 views30 slides
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueShapeBlue
63 views15 slides
The Power of Heat Decarbonisation Plans in the Built Environment by
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built EnvironmentIES VE
67 views20 slides

Recently uploaded(20)

iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker50 views
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue by ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlueVNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
VNF Integration and Support in CloudStack - Wei Zhou - ShapeBlue
ShapeBlue134 views
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T by ShapeBlue
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&TCloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
CloudStack and GitOps at Enterprise Scale - Alex Dometrius, Rene Glover - AT&T
ShapeBlue81 views
NTGapps NTG LowCode Platform by Mustafa Kuğu
NTGapps NTG LowCode Platform NTGapps NTG LowCode Platform
NTGapps NTG LowCode Platform
Mustafa Kuğu287 views
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue by ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlueCloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
CloudStack Object Storage - An Introduction - Vladimir Petrov - ShapeBlue
ShapeBlue63 views
The Power of Heat Decarbonisation Plans in the Built Environment by IES VE
The Power of Heat Decarbonisation Plans in the Built EnvironmentThe Power of Heat Decarbonisation Plans in the Built Environment
The Power of Heat Decarbonisation Plans in the Built Environment
IES VE67 views
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O... by ShapeBlue
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
Declarative Kubernetes Cluster Deployment with Cloudstack and Cluster API - O...
ShapeBlue59 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software373 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates by ShapeBlue
Keynote Talk: Open Source is Not Dead - Charles Schulz - VatesKeynote Talk: Open Source is Not Dead - Charles Schulz - Vates
Keynote Talk: Open Source is Not Dead - Charles Schulz - Vates
ShapeBlue178 views
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue by ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlueWhat’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
What’s New in CloudStack 4.19 - Abhishek Kumar - ShapeBlue
ShapeBlue191 views
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti... by ShapeBlue
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
DRaaS using Snapshot copy and destination selection (DRaaS) - Alexandre Matti...
ShapeBlue69 views
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ... by ShapeBlue
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
Backup and Disaster Recovery with CloudStack and StorPool - Workshop - Venko ...
ShapeBlue114 views
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT by ShapeBlue
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBITUpdates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
Updates on the LINSTOR Driver for CloudStack - Rene Peinthor - LINBIT
ShapeBlue138 views
Business Analyst Series 2023 - Week 4 Session 7 by DianaGray10
Business Analyst Series 2023 -  Week 4 Session 7Business Analyst Series 2023 -  Week 4 Session 7
Business Analyst Series 2023 - Week 4 Session 7
DianaGray10110 views
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue by ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
2FA and OAuth2 in CloudStack - Andrija Panić - ShapeBlue
ShapeBlue75 views
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f... by TrustArc
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc Webinar - Managing Online Tracking Technology Vendors_ A Checklist f...
TrustArc130 views

Memory Interoperability in Analytics and Machine Learning

  • 1. www.twosigma.com Memory Interoperability for Analytics and Machine Learning March 26, 2017All Rights Reserved Wes McKinney @wesmckinn ScaledML @ Stanford March 25, 2017
  • 2. Me March 26, 2017 • Currently: Software Architect at Two Sigma Investments • Creator of Python pandas project • PMC member for Apache Arrow and Apache Parquet • Author of Python for Data Analysis • Other Python projects: Ibis, Feather, statsmodels All Rights Reserved 2
  • 3. Important Legal Information March 26, 2017 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved All Rights Reserved 3
  • 4. This talk 4March 26, 2017 • Benefits of interoperable data and metadata • Challenges to sharing memory between runtime environments • Apache Arrow: Purpose and C++ architecture • Opportunities for collaboration • Example application: pandas 2.0 All Rights Reserved
  • 5. Changing hardware landscape March 26, 2017 • Intel has released first production 3D Xpoint SSD • Reported 1000x faster than NAND, less expensive than RAM • Convergence between RAM vs. shared memory / mmap performance All Rights Reserved 5
  • 6. Changing software landscape March 26, 2017 • Next-gen ML / AI frameworks (TensorFlow, Torch, etc.) • DIY open source architectures for machine learning in production • Streaming / batch data processing pipelines • Data cleaning and feature engineering • Model fitting / scoring / serving All Rights Reserved 6
  • 7. “Zero-copy” memory interfaces March 26, 2017 • Enables computational tools to process a dataset without any additional serialization, or transfer to a different memory space • Can do random access on a dataset that does not fit in RAM • Another interpretation: reading a dataset is a metadata-only conversion All Rights Reserved 7
  • 8. Challenges to zero-copy memory sharing March 26, 2017 • Cross-language issues • Type metadata + logical types • Byte/bit-level memory layout • Language-specific issues • In-memory data structures • Memory allocation and sharing constructs All Rights Reserved 8
  • 9. What is pandas? March 26, 2017 • Popular in-memory data manipulation tool for Python • Focused on tabular datasets (“data frames”) • Sprawling codebase spanning multiple areas • IO for many data formats • Array manipulations / data preparation • OLAP-style analytics • Internals implemented using NumPy array objects All Rights Reserved 9
  • 10. NumPy March 26, 2017 • Tensor memory model ("ndarray") for numeric data • Strided, homogeneously-typed, byte-addressable memory • APL-inspired semantics • Zero-copy construction from compatible memory layouts • Computational tools support both strided and contiguous memory access All Rights Reserved 10
  • 11. pandas: Technical debt + Architectural issues March 26, 2017 • Tensor library like NumPy awkward fit for pandas use cases • Multidimensionality + strided memory access complicated algorithms • Lack of built-in missing value support • Weak on native string, variable length, or nested types • pandas at core a “in-memory columnar” problem, similar to analytical SQL engines All Rights Reserved 11
  • 12. Thesis: Tensors and Tables March 26, 2017 • 2 data structures best suited for zero-copy sharing • Tensors: N-dimensional, homogeneously-typed arrays • Tables: Column-oriented, heterogeneously typed • These data structures can be defined using common memory and metadata primitives All Rights Reserved 12
  • 13. Observations March 26, 2017 • A Tensor is semantically a multidimensional view of a 1D block of memory • Writing computational code targeting arbitrary tensors is much more difficult than 1D contiguous arrays • Tensors of non-fixed size types (e.g. strings) occur less frequently All Rights Reserved 13
  • 14. Apache Arrow March 26, 2017 • github.com/apache/arrow • Collaboration amongst broad set of OSS projects around language-agnostic shared data structures • Initial focus • In-memory columnar tables • Canonical metadata • Interoperability between JVM and native code (C/C++) ecosystem All Rights Reserved 14
  • 15. High performance data interchange March 26, 2017All Rights Reserved Today With Arrow Source: Apache Arrow 15
  • 16. What does Apache Arrow give you? March 26, 2017 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development All Rights Reserved 16
  • 17. Arrow in C++ March 26, 2017 • Reusable memory management and IO subsystem for native code applications • Layered in multiple components • Memory management • Type metadata / schemas • Array / Table containers • IO interfaces • Zero-copy IPC / messaging All Rights Reserved 17
  • 18. Arrow C++: Memory management March 26, 2017 • arrow::Buffer • RAII-based memory lifetime with std::shared_ptr<Buffer> • arrow::MemoryMappedBuffer: for memory maps • arrow::MemoryPool • Abstract memory allocator for tracking all allocations All Rights Reserved 18
  • 19. Arrow C++: Type metadata March 26, 2017 • arrow::DataType • Base class for fixed size, variable size, and nested datatypes • arrow::Field • Type + name + additional metadata • arrow::Schema • Collection of fields All Rights Reserved 19
  • 20. Arrow C++: Array / Table containers March 26, 2017 • arrow::Array • 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc. • Support for dictionary-encoded arrays • arrow::RecordBatch • Collection of equal-length arrays • arrow::Column • Logical table “column” as chunked array • arrow::Table • Collection of columns All Rights Reserved 20
  • 21. Arrow C++: IO interfaces March 26, 2017 • arrow::{InputStream, OutputStream} • arrow::RandomAccessFile • Abstract file interface • arrow::MemoryMappedFile • Zero-copy reads to arrow::Buffer • Specific implementations for OS files, HDFS, etc. All Rights Reserved 21
  • 22. Arrow C++: Messaging / IPC March 26, 2017 • Metadata read/write using Google’s Flatbuffers library • Encapsulated Message type • Write record batches, read with zero-copy • arrow::{FileWriter, FileReader} • Random access / “batch” binary format • arrow::{StreamWriter, StreamReader} • Streaming binary format All Rights Reserved 22
  • 23. In development: arrow::Tensor March 26, 2017 • Targeting interoperability with memory layouts as used in NumPy, TensorFlow, Torch, or other standard tensor-based frameworks • data: arrow::Buffer • shape: dimension sizes • strides: memory ordering • Zero-copy reads using Arrow’s shared memory tools • Support Tensor math libraries for C++ like xtensor All Rights Reserved 23
  • 24. Example use: Ray ML framework from Berkeley RISELab March 26, 2017All Rights Reserved 24 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  • 25. Example use: pandas 2.0 March 26, 2017 • In-planning rearchitecture of pandas’s internals • libpandas — largely Python-agnostic C++11 library • Decoupling pandas data structures from NumPy tensors • Support analytics targeting native Arrow memory • Multicore / parallel algorithms • Leverage latest SIMD intrinsics • Lazy-loading DataFrames from primary input formats • CSV, JSON, HDF5, Apache Parquet All Rights Reserved 25
  • 26. Other examples March 26, 2017 • Spark integration (SPARK-13534) • Weld integration (ARROW-649) All Rights Reserved 26
  • 27. Thank you March 26, 2017 • Building code and community around • IO subsystems • Metadata • Data structures and in-memory formats All Rights Reserved 27