Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Memory Interoperability in Analytics and Machine Learning

4,552 views

Published on

March 25, 2017 @ ScaledML in Stanford. Talk about pandas, Apache Arrow in C++, and more

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Memory Interoperability in Analytics and Machine Learning

  1. 1. www.twosigma.com Memory Interoperability for Analytics and Machine Learning March 26, 2017All Rights Reserved Wes McKinney @wesmckinn ScaledML @ Stanford March 25, 2017
  2. 2. Me March 26, 2017 • Currently: Software Architect at Two Sigma Investments • Creator of Python pandas project • PMC member for Apache Arrow and Apache Parquet • Author of Python for Data Analysis • Other Python projects: Ibis, Feather, statsmodels All Rights Reserved 2
  3. 3. Important Legal Information March 26, 2017 The information presented here is offered for informational purposes only and should not be used for any other purpose (including, without limitation, the making of investment decisions). Examples provided herein are for illustrative purposes only and are not necessarily based on actual data. Nothing herein constitutes: an offer to sell or the solicitation of any offer to buy any security or other interest; tax advice; or investment advice. This presentation shall remain the property of Two Sigma Investments, LP (“Two Sigma”) and Two Sigma reserves the right to require the return of this presentation at any time. Some of the images, logos or other material used herein may be protected by copyright and/or trademark. If so, such copyrights and/or trademarks are most likely owned by the entity that created the material and are used purely for identification and comment as fair use under international copyright and/or trademark laws. Use of such image, copyright or trademark does not imply any association with such organization (or endorsement of such organization) by Two Sigma, nor vice versa. Copyright © 2017 TWO SIGMA INVESTMENTS, LP. All rights reserved All Rights Reserved 3
  4. 4. This talk 4March 26, 2017 • Benefits of interoperable data and metadata • Challenges to sharing memory between runtime environments • Apache Arrow: Purpose and C++ architecture • Opportunities for collaboration • Example application: pandas 2.0 All Rights Reserved
  5. 5. Changing hardware landscape March 26, 2017 • Intel has released first production 3D Xpoint SSD • Reported 1000x faster than NAND, less expensive than RAM • Convergence between RAM vs. shared memory / mmap performance All Rights Reserved 5
  6. 6. Changing software landscape March 26, 2017 • Next-gen ML / AI frameworks (TensorFlow, Torch, etc.) • DIY open source architectures for machine learning in production • Streaming / batch data processing pipelines • Data cleaning and feature engineering • Model fitting / scoring / serving All Rights Reserved 6
  7. 7. “Zero-copy” memory interfaces March 26, 2017 • Enables computational tools to process a dataset without any additional serialization, or transfer to a different memory space • Can do random access on a dataset that does not fit in RAM • Another interpretation: reading a dataset is a metadata-only conversion All Rights Reserved 7
  8. 8. Challenges to zero-copy memory sharing March 26, 2017 • Cross-language issues • Type metadata + logical types • Byte/bit-level memory layout • Language-specific issues • In-memory data structures • Memory allocation and sharing constructs All Rights Reserved 8
  9. 9. What is pandas? March 26, 2017 • Popular in-memory data manipulation tool for Python • Focused on tabular datasets (“data frames”) • Sprawling codebase spanning multiple areas • IO for many data formats • Array manipulations / data preparation • OLAP-style analytics • Internals implemented using NumPy array objects All Rights Reserved 9
  10. 10. NumPy March 26, 2017 • Tensor memory model ("ndarray") for numeric data • Strided, homogeneously-typed, byte-addressable memory • APL-inspired semantics • Zero-copy construction from compatible memory layouts • Computational tools support both strided and contiguous memory access All Rights Reserved 10
  11. 11. pandas: Technical debt + Architectural issues March 26, 2017 • Tensor library like NumPy awkward fit for pandas use cases • Multidimensionality + strided memory access complicated algorithms • Lack of built-in missing value support • Weak on native string, variable length, or nested types • pandas at core a “in-memory columnar” problem, similar to analytical SQL engines All Rights Reserved 11
  12. 12. Thesis: Tensors and Tables March 26, 2017 • 2 data structures best suited for zero-copy sharing • Tensors: N-dimensional, homogeneously-typed arrays • Tables: Column-oriented, heterogeneously typed • These data structures can be defined using common memory and metadata primitives All Rights Reserved 12
  13. 13. Observations March 26, 2017 • A Tensor is semantically a multidimensional view of a 1D block of memory • Writing computational code targeting arbitrary tensors is much more difficult than 1D contiguous arrays • Tensors of non-fixed size types (e.g. strings) occur less frequently All Rights Reserved 13
  14. 14. Apache Arrow March 26, 2017 • github.com/apache/arrow • Collaboration amongst broad set of OSS projects around language-agnostic shared data structures • Initial focus • In-memory columnar tables • Canonical metadata • Interoperability between JVM and native code (C/C++) ecosystem All Rights Reserved 14
  15. 15. High performance data interchange March 26, 2017All Rights Reserved Today With Arrow Source: Apache Arrow 15
  16. 16. What does Apache Arrow give you? March 26, 2017 • Cache-efficient columnar memory: optimized for CPU affinity and SIMD / parallel processing, O(1) random value access • Zero-copy messaging / IPC: Language-agnostic metadata, batch/file-based and streaming binary formats • Complex schema support: Flat and nested data types • Main implementations in C++ and Java: with integration tests • Bindings / implementations for C, Python, Ruby, Javascript in various stages of development All Rights Reserved 16
  17. 17. Arrow in C++ March 26, 2017 • Reusable memory management and IO subsystem for native code applications • Layered in multiple components • Memory management • Type metadata / schemas • Array / Table containers • IO interfaces • Zero-copy IPC / messaging All Rights Reserved 17
  18. 18. Arrow C++: Memory management March 26, 2017 • arrow::Buffer • RAII-based memory lifetime with std::shared_ptr<Buffer> • arrow::MemoryMappedBuffer: for memory maps • arrow::MemoryPool • Abstract memory allocator for tracking all allocations All Rights Reserved 18
  19. 19. Arrow C++: Type metadata March 26, 2017 • arrow::DataType • Base class for fixed size, variable size, and nested datatypes • arrow::Field • Type + name + additional metadata • arrow::Schema • Collection of fields All Rights Reserved 19
  20. 20. Arrow C++: Array / Table containers March 26, 2017 • arrow::Array • 1-dimensional columnar arrays: Int32Array, ListArray, StructArray, etc. • Support for dictionary-encoded arrays • arrow::RecordBatch • Collection of equal-length arrays • arrow::Column • Logical table “column” as chunked array • arrow::Table • Collection of columns All Rights Reserved 20
  21. 21. Arrow C++: IO interfaces March 26, 2017 • arrow::{InputStream, OutputStream} • arrow::RandomAccessFile • Abstract file interface • arrow::MemoryMappedFile • Zero-copy reads to arrow::Buffer • Specific implementations for OS files, HDFS, etc. All Rights Reserved 21
  22. 22. Arrow C++: Messaging / IPC March 26, 2017 • Metadata read/write using Google’s Flatbuffers library • Encapsulated Message type • Write record batches, read with zero-copy • arrow::{FileWriter, FileReader} • Random access / “batch” binary format • arrow::{StreamWriter, StreamReader} • Streaming binary format All Rights Reserved 22
  23. 23. In development: arrow::Tensor March 26, 2017 • Targeting interoperability with memory layouts as used in NumPy, TensorFlow, Torch, or other standard tensor-based frameworks • data: arrow::Buffer • shape: dimension sizes • strides: memory ordering • Zero-copy reads using Arrow’s shared memory tools • Support Tensor math libraries for C++ like xtensor All Rights Reserved 23
  24. 24. Example use: Ray ML framework from Berkeley RISELab March 26, 2017All Rights Reserved 24 Source: https://arxiv.org/abs/1703.03924 • Shared memory-based object store • Zero-copy tensor reads using Arrow libraries
  25. 25. Example use: pandas 2.0 March 26, 2017 • In-planning rearchitecture of pandas’s internals • libpandas — largely Python-agnostic C++11 library • Decoupling pandas data structures from NumPy tensors • Support analytics targeting native Arrow memory • Multicore / parallel algorithms • Leverage latest SIMD intrinsics • Lazy-loading DataFrames from primary input formats • CSV, JSON, HDF5, Apache Parquet All Rights Reserved 25
  26. 26. Other examples March 26, 2017 • Spark integration (SPARK-13534) • Weld integration (ARROW-649) All Rights Reserved 26
  27. 27. Thank you March 26, 2017 • Building code and community around • IO subsystems • Metadata • Data structures and in-memory formats All Rights Reserved 27

×