Blaze: a large-scale, array-                   oriented infrastructure for                              Python            ...
Brief History                             Person               Package       Year                                         ...
Early pieces of SciPy                  fftw wrappers              cephesmodule                    June 1998               ...
1999 : Early SciPy emerges       Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis    envir...
SciPy 2001       Travis Oliphant                            optimize                             sparse                   ...
Community effort                          many, many others --- forgive me!     • Chuck Harris     • Pauli Virtanen     • ...
1,000,000 to 2,000,000 users of NumPy!                                                                              ...
Now What?Tuesday, March 19, 13
What is good about NumPy?         • Array-oriented         • Extensive DType System (including structures)         • C-API...
What is wrong with NumPy           • Dtype system is difficult to extend           • Immediate mode creates huge temporarie...
Improvements needed       • NDArray improvements         • Indexes (esp. for Structured arrays)         • SQL front-end   ...
Improvements needed         • Dtype improvements           • Enumerated types (including dynamic enumeration)           • ...
Improvements needed         • Ufunc improvements           • Generalized ufuncs support more than just             contigu...
More Improvements needed         • Miscellaneous improvements           • ABI-management           • Eventual Move to libr...
New Project              NumPy                               Blaze                               Out of Core,             ...
NumPy Array                        shapeTuesday, March 19, 13
Blaze: Different kinds of Arrays                                                    Indexable               Record Type   ...
Blaze Deferred Arrays           • Symbolic objects which build a graph           • Represents deferred computation        ...
Deferred allows handling large arrays                                                                         ...
Blaze Concrete Array                                              URL   URL    URL      URL   URL              Data Descri...
Multiple URLs comprising an array                                                                         ...
URLs Provide Bytes                                       Arbitrarily sliced                        Memory-Like     Random ...
Blaze Data Container          Index                                                 Data Buffer         Operation         ...
Indexes                        Contiguous / Strided    NumPy-Like                          Chunked / Tiled      Special Ac...
Indexes allow for many orderings                              ...
DataShape Type System                        Shape      DType                                                   ...
Allows for all kinds of containers                                        ...
Advanced Types  Parametrized Types                type Point = {                                        x : int;  type Squ...
Advanced Shapes           {1,2,4,2,1}, int32             [                                              [1],              ...
Execution Model               • Graphs dispatch to specialized library code                 that is “registered with the s...
Blaze Agents                                      Code                                       Data                         ...
How?                        “I think you should be more                          explicit here in step two.”Tuesday, March...
Team                        Travis Oliphant          Stephen Diehl                         NumPy, SciPy                   ...
DARPA providing help                        DARPA-BAA-12-38: XDATA     TA-1: Scalable analytics and data processing techno...
StatusTuesday, March 19, 13
Type System = DataShape              Best Type System this                 side of Haskell!Tuesday, March 19, 13
BLZ persistence                                         BLZ$layout$at$a$glance$                          Dataset$         ...
Blaze Server           https://wakari.io/nb/urls/raw.github.com/ContinuumIO/              blaze-web/master/example/noteboo...
Out-of-core calculationsTuesday, March 19, 13
Distributed Array                        Coming soon....Tuesday, March 19, 13
Roadmap                • 0.1 release expected in May                • 0.3 release at end of August                • 1.0 Re...
NumFOCUS    Num(Py) Foundation for Open Code for Usable Science                        http://www.numfocus.orgTuesday, Mar...
Upcoming SlideShare
Loading in …5
×

Blaze: a large-scale, array-oriented infrastructure for Python

6,429 views

Published on

This talk gives a high-level overview of the motivation, design goals, and status of the Blaze project from Continuum Analytics which is a large-scale array object for Python.

Published in: Technology

Blaze: a large-scale, array-oriented infrastructure for Python

  1. 1. Blaze: a large-scale, array- oriented infrastructure for Python Travis E. Oliphant PyData Silicon Valley 2013Tuesday, March 19, 13
  2. 2. Brief History Person Package Year Matrix Object Jim Fulton 1994 in Python Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005Tuesday, March 19, 13
  3. 3. Early pieces of SciPy fftw wrappers cephesmodule June 1998 November 1998 stats.py December 1998 Gary StrangmanTuesday, March 19, 13
  4. 4. 1999 : Early SciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 sigtools 0.40 23 Feb 1999 Numeric docs March 1999 cephes 1.1 9 Mar 1999 Plotting?? multipack 0.3 13 Apr 1999 Helper routines 14 Apr 1999 Gist multipack 0.6 (leastsq, ode, fsolve, 29 Apr 1999 XPLOT quad) DISLIN sparse plan described 30 May 1999 Gnuplot multipack 0.7 14 Jun 1999 SparsePy 0.1 cephes 1.2 (vectorize) 5 Nov 1999 29 Dec 1999 Helping with f2pyTuesday, March 19, 13
  5. 5. SciPy 2001 Travis Oliphant optimize sparse interpolate integrate special signal stats Founded in 2001 with Travis Vaught fftpack misc Eric Jones weave cluster Pearu Peterson GA* linalg interpolate f2pyTuesday, March 19, 13
  6. 6. Community effort many, many others --- forgive me! • Chuck Harris • Pauli Virtanen • David Cournapeau • Stefan van der Walt • Jake Vanderplas • Josef Perktold • Anne Archibald • Dag Sverre Seljebotn • Robert Kern • Matthew Brett • Warren Weckesser • Ralf Gommers • Joe Harrington --- Documentation effort • Andrew Straw --- www.scipy.orgTuesday, March 19, 13
  7. 7. 1,000,000 to 2,000,000 users of NumPy!     Tuesday, March 19, 13
  8. 8. Now What?Tuesday, March 19, 13
  9. 9. What is good about NumPy? • Array-oriented • Extensive DType System (including structures) • C-API --- lots of libraries • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Ufuncs and more • Broadcasting • Easy to interface C/C++/Fortran codeTuesday, March 19, 13
  10. 10. What is wrong with NumPy • Dtype system is difficult to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPUTuesday, March 19, 13
  11. 11. Improvements needed • NDArray improvements • Indexes (esp. for Structured arrays) • SQL front-end • Multi-level, hierarchical labels • selection via mappings (labeled arrays) • Memory spaces (array made up of regions) • Distributed arrays (global array) • Compressed arrays • Standard distributed persistance • fancy indexing as view and optimizations • streaming arraysTuesday, March 19, 13
  12. 12. Improvements needed • Dtype improvements • Enumerated types (including dynamic enumeration) • Derived fields • Specification as a class (or JSON) • Pointer dtype (i.e. C++ object, or varchar) • Missing data: masks and bit-patterns • Parameterized field names • Computed fieldsTuesday, March 19, 13
  13. 13. Improvements needed • Ufunc improvements • Generalized ufuncs support more than just contiguous arrays • Specification of ufuncs in Python • Move most dtype “array functions” to ufuncs • Unify error-handling for all computations • Allow lazy-evaluation and remote computation --- streaming and generator data • Structured and string dtype ufuncs • Multi-core and GPU optimized ufuncs • Group-by reductionTuesday, March 19, 13
  14. 14. More Improvements needed • Miscellaneous improvements • ABI-management • Eventual Move to library (NDLib)? • NDLib could serve as base for Javascript and other high-level languages? • Integration with LLVM • Possible dtype / shape / stride unification into a “table interface” • Remote computation • Fast I/O for CSV and ExcelTuesday, March 19, 13
  15. 15. New Project NumPy Blaze Out of Core, Distributed and Optimized NumPyTuesday, March 19, 13
  16. 16. NumPy Array shapeTuesday, March 19, 13
  17. 17. Blaze: Different kinds of Arrays Indexable Record Type Primitive Type NDTable NDArray Deferred Concrete Deferred ConcreteTuesday, March 19, 13
  18. 18. Blaze Deferred Arrays • Symbolic objects which build a graph • Represents deferred computation +" A + B*C A" *" Usually what you have when B" C" you have a Blaze ArrayTuesday, March 19, 13
  19. 19. Deferred allows handling large arrays      Can be handled out-of- core using chunks to  stream through memory.        Tuesday, March 19, 13
  20. 20. Blaze Concrete Array URL URL URL URL URL Data Descriptor Where are the bytes? Indexes DataShape Extensible Type System What do the bytes mean? which includes shape MetaData Dictionary Labels, provenance, etc.Tuesday, March 19, 13
  21. 21. Multiple URLs comprising an array    Tuesday, March 19, 13
  22. 22. URLs Provide Bytes Arbitrarily sliced Memory-Like Random Seeks Deal with in chunks File-Like Random Seeks Deal with in Chunks Stream-Like Sequential SeeksTuesday, March 19, 13
  23. 23. Blaze Data Container Index Data Buffer Operation ByteProvider Data Descriptor Protocol NumPy BLZ RDBMS Persistent Data Stream Format CSVTuesday, March 19, 13
  24. 24. Indexes Contiguous / Strided NumPy-Like Chunked / Tiled Special Access Opaque Opaque Element-only Iterator-accessTuesday, March 19, 13
  25. 25. Indexes allow for many orderings        Tuesday, March 19, 13
  26. 26. DataShape Type System Shape DType  DataShape  • A data description language  • A super-set of NumPy’s dtype • Provides more flexibilityTuesday, March 19, 13
  27. 27. Allows for all kinds of containers                Tuesday, March 19, 13
  28. 28. Advanced Types Parametrized Types type Point = { x : int; type SquareMatrix T = N, N, T y : int } Alias Types type Space = { type IntMatrix N = N, N, int32 a: Point; b: Point } 5, 10, SpaceTuesday, March 19, 13
  29. 29. Advanced Shapes {1,2,4,2,1}, int32 [ [1], Could Represent [1,2], [1,3,2,9], [3,2], [3] ]Tuesday, March 19, 13
  30. 30. Execution Model • Graphs dispatch to specialized library code that is “registered with the system” based on type and meta-data of array (blaze Modules) • Many operations can be compiled with LLVM to machine-code • BLIR (simple typed expression syntax) • Numba (Python compiler)Tuesday, March 19, 13
  31. 31. Blaze Agents Code Data Blaze CSV Agent Directory Blaze Code Graph with Blaze Agent MongoDB Arrays Blaze Agent Vertica Blaze Agent HDFSTuesday, March 19, 13
  32. 32. How? “I think you should be more explicit here in step two.”Tuesday, March 19, 13
  33. 33. Team Travis Oliphant Stephen Diehl NumPy, SciPy Mark Florisson Peter Wang Numba Chaco, Bokeh Francesc Alted Mark Wiebe PyTables NumPy, DyND Oscar VillellasTuesday, March 19, 13
  34. 34. DARPA providing help DARPA-BAA-12-38: XDATA TA-1: Scalable analytics and data processing technology   TA-2: Visual user interface technologyTuesday, March 19, 13
  35. 35. StatusTuesday, March 19, 13
  36. 36. Type System = DataShape Best Type System this side of Haskell!Tuesday, March 19, 13
  37. 37. BLZ persistence BLZ$layout$at$a$glance$ Dataset$ Super<Chunk$ Chunk$ Header& Header& root$ Offset$0$ Offset$0$ Offset$1$ Offset$1$ Offset$2$ Offset$2$ meta$ data$ <<<<<$ <<<<<$ Chunk$0$ Block$0$ Chunk$1$ Block$1$ __0__.blp$ __1__.blp$ Chunk$2$ Block$2$ Blaze$(BLZ)$format$ Bloscpack$(BLP)$format$ Blosc$format$Tuesday, March 19, 13
  38. 38. Blaze Server https://wakari.io/nb/urls/raw.github.com/ContinuumIO/ blaze-web/master/example/notebooks/Kiva-Tiny %20Example.ipynb Computed Columns!Tuesday, March 19, 13
  39. 39. Out-of-core calculationsTuesday, March 19, 13
  40. 40. Distributed Array Coming soon....Tuesday, March 19, 13
  41. 41. Roadmap • 0.1 release expected in May • 0.3 release at end of August • 1.0 Release by PyData west-coast 2014 • Now only get involved if you want to develop • Then, continue building PyData ecosystem around scalable array.Tuesday, March 19, 13
  42. 42. NumFOCUS Num(Py) Foundation for Open Code for Usable Science http://www.numfocus.orgTuesday, March 19, 13

×