Successfully reported this slideshow.
Your SlideShare is downloading. ×

Standardizing arrays -- Microsoft Presentation

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Upcoming SlideShare
PyCon Estonia 2019
PyCon Estonia 2019
Loading in …3
×

Check these out next

1 of 31 Ad
Advertisement

More Related Content

Slideshows for you (20)

Similar to Standardizing arrays -- Microsoft Presentation (20)

Advertisement

Recently uploaded (20)

Advertisement

Standardizing arrays -- Microsoft Presentation

  1. 1. © 2017 Continuum Analytics - Confidential & Proprietary© 2018 Quansight - Confidential & Proprietary Standardizing ND-Arrays (Tensors) in Python Quansight Labs travis@quansight.com @quansightai @teoliphant Python Summit, December 2018
  2. 2. 1998 20182001 2015 2009 20122005 … 2001 2006 Python Data Analysis and Machine Learning Time-Line 1991 2003 2014 2008 2010 2016 2009
  3. 3. Maintenance Problem — Funding for Community Devs Full-time: 2 Full-time: 0 Full-time: 1/2 Open Source is too important to be just left to volunteer time — current situation is not working to sustain millions of users: • No funding for creators of these libraries to continue their work • GPU support could have been added to NumPy years ago • SciPy took 17 years to hit 1.0 • NumPy should already be at 2.0 — but not without full-time guidance and leadership Full-time: 2 Full-time: 0
  4. 4. Company 2012 - Created Two Orgs for Sustainability Community Enterprise software company initially built on services and supporting open-source. Became
  5. 5. Quansight — continuing Continuum momentum Replaced by Spin Out Incubate 2012 2018 ? ? Key. Members of the management team at Continuum Analytics ==> Anaconda was our first (spin-out) company. 2015 2019 and beyond…
  6. 6. Build and Connect Companies and Communities to Solve Challenging Problems with Data Continuing my quest to find more ways to pay developers to work on open source!
  7. 7. Open Source Directions Webinar series to promote and encourage accessible publicity about what community developers are thinking about.
  8. 8. LABS Sustaining the Future Open-source innovation and maintenance around the entire data- science and AI workflow. • Hire and fund a “PyData Core Team” • GPU Support for NumPy Ecosystem • Improve foundations of Array computing • JupyterLab development and plugins • Data Catalog standards and demos • Packaging (conda-forge, PyPA, etc.) • Cross Language Integration uarray — unified array interface and “symbolic" NumPy xnd — re-factored NumPy (low-level cross-language libraries for N-D (tensor) computing) Partnered with NumFOCUS and Ursa Labs (supporting Apache Arrow) Bokeh Adapted from Jake Vanderplas PyCon 2017 Keynote http://quansight.com/labs
  9. 9. Quansight Labs Team Pearu Peterson Saul Shanabrook Hameer Abbasi Stefan Krah Tony Fast Anirrudh Krishnan Aaron Meurer David Charboneau Chris Ostrouchov Ryan Henning Carlos Cordoba Anthony Scopatz James Bourbeau Sameer Deshmukh Ivan Ogasawara Ian Rose Hugo Shi
  10. 10. NumPy was created to merge array objects in Python and unify PyData community Numeric Numarray NumPy 2005 to 2006
  11. 11. Now a large community effort SciPy ~ 673 contributors NumPy ~ 709 contributors
  12. 12. Bokeh Adapted from Jake Vanderplas PyCon 2017 Keynote
  13. 13. Python’s Scientific Ecosystem Bokeh Adapted from Jake Vanderplas PyCon
  14. 14. https://github.com/josephmisiti/awesome-machine-learning#python-general-purpose http://deeplearning.net/software_links/ http://scikit-learn.org/stable/related_projects.html Explosion of ML Frameworks and libraries TVM/NNVM
  15. 15. Now array-like objects everywhere Sparse Arrays Neon CUDArray
  16. 16. We have a “divided” community again! Numeric Numarray NumPy
  17. 17. Real problem is packages have little re-use FastAI skorch Pyro Eduard anyrl Braid PyMC4 MLFlow torchdiffeq
  18. 18. Two additional efforts in 2006 Buffer Protocol (PEP 3118) __array_interface__ Way for all Python objects to share memory using NumPy-like data-structures (strided memory layout with a shape). “memoryview” Type system not solved at the time (punted to the struct module syntax extended with character codes) (“I 2s f”) == dtype(‘u4, 2S, f’) Protocol approach. Any object can define this attribute to explain how it could be interpreted as an array — still tied to NumPy structure (strided layout)
  19. 19. What if we revisit these earlier efforts Buffer Protocol (PEP 3118) __array_interface__ Cross-language buffer-protocol plus numpy-like math libraries uarray New project to formalize and generalize array protocol for Python while that downstream projects can depend on (rather than a single array)
  20. 20. NumPy’s Key Parts dtype umath ndarray Description of what is “in the array” — data-description language but missing key primitives (pointer, missing-data types, categoricals, new float types, etc.) Strictly extensible —- but not easily. Innovation was ability to map to any memory pointer that you could describe via dtype “language” and then “slice and dice” Pointer to data described by “dtype” with shape and strides information and powerful “indexing” capabilities. Mapping pointers to the start of a data-structure you can describe with dtype and then applying (generalized) ufuncs is the essence of array-oriented computing Math and functions for arrays. Started as “scalar” kernels (ufuncs) that are applied over the array. DEShaw added “generalized ufuncs” which allowed the kernel applied over the array to involve “inner-dimensions” (i.e. dot, cholesky, svd, argmax, can be a kernel)
  21. 21. libndtypes libgumath libxnd C-libraries with defined API/ABI Language Bindings (Python, Ruby, …) ndtypes gumath xnd Generalization of dtype. Description of “any” container Generalization of numpy array container and Universal functions (arbitray kernels applied over the data) Need: C++, Scala, Node, F#, C#, Go, Java Not a NumPy replacement — but could be used by NumPy!
  22. 22. Is a generalization of Arrow — you could describe an Arrow container with XND Like Pandas columns are NumPy arrays.
  23. 23. Unified (or Universal) Array Interface Need to fix the “string / bytes” problem of the array world! Logical array vs. strided-pointer of numpy “uarray” interface …… CuPy
  24. 24. Big Hairy Audacious Goal (BHAG) Enhance the Array ecosystem (initially for Python) with an abstract interface that downstream libraries can use (with a concrete interface based on xnd). • Reuse as much of the existing ecosystem as possible. • Easily allow multiple implementations of an array (sparse, hardware-backed, delayed) with a common interface. • Libraries (e.g. SciPy and PyData) that depend only on the interface could be compiled down to hardware or use a backend runtime.
  25. 25. Collaboration with Mathematics Apply reduction rules from the "Mathematics of Arrays” on code that uses the array_interface. Lenore Mullin worked with Ken Iverson on APL and has since developed a formal mathematics of arrays that shows how arbitrary array-based cacluations (based on the Psi function) can be consistently defined, simplified and formalized to be optimally implemented on arbitrary hardware. https://www.researchgate.net/profile/Lenore_Mullin https://arxiv.org/abs/0907.0796 Tensors and n-d Arrays: A Mathematics of Arrays (MoA), psi-Calculus and the Composition of Tensor and Array Operations
  26. 26. Similar to but learning from…
  27. 27. Current NumPy (API is huge…) • Generalized ufuncs on top of this including Segmentation (Grouping) and reduction • Input/Output Rules for reducing and simplifying functions • Method for defining pipelines of functions (with automatic differentiation) Compute/Transform Creation/Reading Reporting/Output Indexing/Subsetting MetaData/Attributes Other Total Functions 33 7 6 12 11 2 71 Methods 226 170 22 38 21 68 545 NumPy API
  28. 28. What is an array (or tensor)? Fundamental concepts: • shape (a named tuple) • a function that takes a tuple of indexes and returns another array (Psi function) • A (“dtype” or “memory-type”) (what are the elements) • Math that works with arrays. Other important concepts: • for each dimension an “index” mapping from index space to 0..N-1 (labels) • Data pointer (including device ID) • Slicing, sub-selection, and indexing capability • conversion from (0-d array) to Python scalar type • Optional bit-array for masking missing data • Functions for concatenation • Functions for creating and filling the array (from a file, from a string, from Python objects, from ODBC) Core API that might be necessary
  29. 29. First part of the general Idea __uarray__ —> return an object that implements the array interface uarray interface: (strawperson phase…) required __u_psi__ : function mapping from a sequence of integers to an mtype __u_shape__: a named tuple showing the shape of the uarray (or None if unknown) __u_mtype__ : What this array contains: The Python type object in each element of the array __u_attr__ : named tuple of attributes (version, ndim, jagged, strided, c-like, f-like, …) optional __u_llvm__ : return named tuple of llvm snippets for psi function __u_llfuncs__ : return named tuple of low-level function pointers __u_psi_dim__: function mapping from an sequence of integers to a __uarray__ one dimension smaller __u_setelement__ : a function that sets an element of the array with an object of type mtype __u_getelement__ : a function that gets __u_fromiter__ : function to build a array from an iterator __u_frombuffer__ : function to build a “gamma-based” uarray from a buffer __u_concat__ : concatenate a sequence of __uarray__ objects along an axis
  30. 30. Core C-API from NumPy PyArray_FromAny PyArray_Shape PyArray_New PyArray_Fill PyArray_Copy PyArray_Take PyArray_Put PyArray_NDIM PyArray_GETITEM PyArray_SETITEM … * EquivArrTypes * GetItem * SetItem * CopySwapN * CopySwap * ScanFunc * FromStr * FillFunc * meta-data Core Array Container DTtype Basic Idea: Provide a place for these function-pointers in Python TypeObject
  31. 31. Start of a Proposal Core Array Container dtype Mtype tp_as_ndarray PyNDArrayMethods Analagous to PySequenceMethods Standardized function pointers for “bits” In an “element” of a data-structure. Inherit from PyHeapTypeObject

×