• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Large-scale Array-oriented Computing with Python
 

Large-scale Array-oriented Computing with Python

on

  • 7,570 views

Keynote for Day 1 of PyCon Taiwan 2012, by Travis Oliphant

Keynote for Day 1 of PyCon Taiwan 2012, by Travis Oliphant
http://tw.pycon.org/2012/speaker/

Statistics

Views

Total Views
7,570
Views on SlideShare
7,567
Embed Views
3

Actions

Likes
6
Downloads
141
Comments
0

2 Embeds 3

https://si0.twimg.com 2
https://twitter.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Large-scale Array-oriented Computing with Python Large-scale Array-oriented Computing with Python Presentation Transcript

    • Large-scale array-oriented computing with Python Travis E. Oliphant PyCon Taiwan, June 9, 2012Friday, June 8, 12
    • My RootsFriday, June 8, 12
    • My Roots Images from BYU Mers LabFriday, June 8, 12
    • Science led to Python 2 ⇢0 (2⇡f ) Ui (a, f ) = [Cijkl (a, f ) Uk,l (a, f )],j Raja Muthupillai Richard Ehman 1997 Armando ManducaFriday, June 8, 12
    • Finding derivatives of 5-d data ⌅=r⇥UFriday, June 8, 12
    • Scientist at heartFriday, June 8, 12
    • Python origins. http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999Friday, June 8, 12
    • Python origins. http://python-history.blogspot.com/2009/01/brief-timeline-of-python.html Version Date 0.9.0 Feb. 1991 0.9.4 Dec. 1991 0.9.6 Apr. 1992 0.9.8 Jan. 1993 1.0.0 Jan. 1994 1.2 Apr. 1995 1.4 Oct. 1996 1.5.2 Apr. 1999Friday, June 8, 12
    • Brief History Person Package Year Matrix Object Jim Fulton 1994 in Python Jim Hugunin Numeric 1995 Perry Greenfield, Rick White, Todd Miller Numarray 2001 Travis Oliphant NumPy 2005Friday, June 8, 12
    • 1999 : Early SciPy emerges Discussions on the matrix-sig from 1997 to 1999 wanting a complete data analysis environment: Paul Barrett, Joe Harrington, Perry Greenfield, Paul Dubois, Konrad Hinsen, and others. Activity in 1998, led to increased interest in 1999. In response on 15 Jan, 1999, I posted to matrix-sig a list of routines I felt needed to be present and began wrapping / writing in earnest. On 6 April 1999, I announced I would be creating this uber-package which eventually became SciPy Gaussian quadrature 5 Jan 1999 cephes 1.0 30 Jan 1999 Plotting?? sigtools 0.40 23 Feb 1999 Numeric docs cephes 1.1 March 1999 9 Mar 1999 Gist multipack 0.3 13 Apr 1999 XPLOT Helper routines 14 Apr 1999 DISLIN multipack 0.6 (leastsq, ode, fsolve, quad) 29 Apr 1999 Gnuplot sparse plan described 30 May 1999 multipack 0.7 SparsePy 0.1 14 Jun 1999 5 Nov 1999 Helping with f2py cephes 1.2 (vectorize) 29 Dec 1999Friday, June 8, 12
    • SciPy 2001 Travis Oliphant optimize sparse interpolate integrate special signal stats Founded in 2001 with Travis Vaught fftpack misc Eric Jones weave cluster Pearu Peterson GA* linalg interpolate f2pyFriday, June 8, 12
    • Community effort • Chuck Harris • Pauli Virtanen • David Cournapeau • Stefan van der Walt • Dag Sverre Seljebotn • Robert Kern • Warren Weckesser • Ralf Gommers • Mark Wiebe • Nathaniel SmithFriday, June 8, 12
    • Why Python for Technical Computing • Syntax (it gets out of your way) • Over-loadable operators • Complex numbers built-in early • Just enough language support for arrays • “Occasional” programmers can grok it • Supports multiple programming styles • Expert programmers can also use it effectively • Has a simple, extensible implementation • General-purpose language --- can build a system • Critical massFriday, June 8, 12
    • What is wrong with Python? • Packaging is still not solved well (distribute, pip, and distutils2 don’t cut it) • Missing anonymous blocks • The CPython run-time is aged and needs an overhaul (GIL, global variables, lack of dynamic compilation support) • No approach to language extension except for “import hooks” (lightweight DSL need) • The distraction of multiple run-times... • Array-oriented and NumPy not really understood by most Python devs.Friday, June 8, 12
    • Putting Science back in Comp Sci • Much of the software stack is for systems programming --- C++, Java, .NET, ObjC, web - Complex numbers? - Vectorized primitives? • Array-oriented programming has been supplanted by Object-oriented programming • Software stack for scientists is not as helpful as it should be • Fortran is still where many scientists end upFriday, June 8, 12
    • Array-Oriented Computing Example1: Fibonacci Numbers fn = fn 1 + fn 2 f0 = 0 f1 = 1 f = 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . .Friday, June 8, 12
    • Common Python approaches Recursive Iterative Algorithm matters!!Friday, June 8, 12
    • Array-oriented approaches Using Formula Using LFilterFriday, June 8, 12
    • Array-oriented approachesFriday, June 8, 12
    • NumPy: an Array-Oriented Extension • Data: the array object – slicing and shaping – data-type map to Bytes • Fast Math: – vectorization – broadcasting – aggregationsFriday, June 8, 12
    • NumPy Array shapeFriday, June 8, 12
    • Zen of NumPy • strided is better than scattered • contiguous is better than strided • descriptive is better than imperative • array-oriented is better than object-oriented • broadcasting is a great idea • vectorized is better than an explicit loop • unless it’s too complicated --- then use Cython/Numba • think in higher dimensionsFriday, June 8, 12
    • More NumPy DemonstrationFriday, June 8, 12
    • Conway’s game of Life • Dead cell with exactly 3 live neighbors will come to life • A live cell with 2 or 3 neighbors will survive • With too few or too many neighbors, the cell diesFriday, June 8, 12
    • Interesting Patterns emergeFriday, June 8, 12
    • APL : the first array-oriented language • Appeared in 1964 • Originated by Ken Iverson • Direct descendants (J, K, Matlab) are still used heavily and people pay a lot of money for them APL • NumPy is a descendent J K Matlab Numeric NumPyFriday, June 8, 12
    • Conway’s Game of Life APL NumPy Initialization Update StepFriday, June 8, 12
    • Demo Python Version Array-oriented NumPy VersionFriday, June 8, 12
    • Memory using Object-oriented Object Object Object Attr1 Attr1 Attr1 Attr2 Attr2 Attr2 Attr3 Attr3 Attr3 Object Attr1 Object Attr2 Attr1 Object Attr3 Attr2 Attr1 Attr3 Attr2 Attr3Friday, June 8, 12
    • Array-oriented (Table) approach Attr1 Attr2 Attr3 Object1 Object2 Object3 Object4 Object5 Object6Friday, June 8, 12
    • Benefits of Array-oriented • Many technical problems are naturally array- oriented (easy to vectorize) • Algorithms can be expressed at a high-level • These algorithms can be parallelized more simply (quite often much information is lost in the translation to typical “compiled” languages) • Array-oriented algorithms map to modern hard-ware caches and pipelines.Friday, June 8, 12
    • We need more focus on complied array-oriented languages with fast compilers!Friday, June 8, 12
    • What is good about NumPy? • Array-oriented • Extensive Dtype System (including structures) • C-API • Simple to understand data-structure • Memory mapping • Syntax support from Python • Large community of users • Broadcasting • Easy to interface C/C++/Fortran codeFriday, June 8, 12
    • What is wrong with NumPy • Dtype system is difficult to extend • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL-lite (missing indexes) • Integration with sparse arrays • Lots of un-optimized parts • Minimal support for multi-core / GPU • Code-base is organic and hard to extendFriday, June 8, 12
    • Improvements needed • NDArray improvements • Indexes (esp. for Structured arrays) • SQL front-end • Multi-level, hierarchical labels • selection via mappings (labeled arrays) • Memory spaces (array made up of regions) • Distributed arrays (global array) • Compressed arrays • Standard distributed persistance • fancy indexing as view and optimizations • streaming arraysFriday, June 8, 12
    • Improvements needed • Dtype improvements • Enumerated types (including dynamic enumeration) • Derived fields • Specification as a class (or JSON) • Pointer dtype (i.e. C++ object, or varchar) • Finishing datetime • Missing data with bit-patterns • Parameterized field namesFriday, June 8, 12
    • Example of Object-defined Dtype @np.dtype class Stock(np.DType): symbol = np.Str(4) open = np.Int(2) close = np.Int(2) high = np.Int(2) low = np.Int(2) @np.Int(2) def mid(self): return (self.high + self.low) / 2.0Friday, June 8, 12
    • Improvements needed • Ufunc improvements • Generalized ufuncs support more than just contiguous arrays • Specification of ufuncs in Python • Move most dtype “array functions” to ufuncs • Unify error-handling for all computations • Allow lazy-evaluation and remote computation --- streaming and generator data • Structured and string dtype ufuncs • Multi-core and GPU optimized ufuncs • Group-by reductionFriday, June 8, 12
    • More Improvements needed • Miscellaneous improvements • ABI-management • Eventual Move to library (NDLib)? • Integration with LLVM • Sparse dimensions • Remote computation • Fast I/O for CSV and Excel • Out-of-core calculations • Delayed-mode executionFriday, June 8, 12
    • New Project NumPy Blaze Next Generation NumPy Out-of-core Distributed TablesFriday, June 8, 12
    • Blaze Main Features • New ndarray with multiple memory segments • Distributed ndtable which can span the world • Fast, out-of-core algorithms for all functions • Delayed-mode execution: expressions build up graph which gets executed where the data is • Built-in Indexes (beyond searchsorted) • Built-in labels (data-array) • Sparse dimensions (defined by attributes or elements of another dimension) • Direct adapters to all data (move code to data)Friday, June 8, 12
    • Delayed execution Demo Code OnlyFriday, June 8, 12
    • Dimensions defined by Attributes dim1 Day Month Year High Low 15 3 2012 30 20 16 3 2012 35 25 20 3 2012 40 30 21 3 2012 41 29Friday, June 8, 12
    • Outline NDTable NDArray Domain GFunc DType BytesFriday, June 8, 12
    • NDTable (Example) Proc0 Proc1 Proc2 Proc3 Each Partition: • Remote Proc0 Proc1 Proc2 Proc3 • Expression • NDArray Proc0 Proc1 Proc2 Proc3 Proc4 Proc4 Proc4 Proc4Friday, June 8, 12
    • Data URLs • Variables in script are global addresses (DATA URLs). All the world’s data you can see via web can be in used as part of an algorithm by referencing it as a part of an array. • Dynamically interpret bytes as data-type • Scheduler will push code based on data-type to the data instead of pulling data to the code.Friday, June 8, 12
    • Overview Processing Code Node Processing Code Node Main Code Processing Script Node Code Processing Processing Node NodeFriday, June 8, 12
    • NDArray • Local ndarray (NumPy++) • Multiple byte-buffers (streaming or random access) • Variable-length arrays • All kinds of data-types (everything...) • Multiple patterns of memory access possible (Z-order, Fortran-order, C-order) • Sparse dimensionsFriday, June 8, 12
    • GFunc • Generalized Function • All NumPy functions • element-by-element • linear algebra • manipulation • Fourier Transform • Iteration and Dispatch to low-level kernels • Kernels can be written in anything that builds a C-like interfaceFriday, June 8, 12
    • Early Timeline Date Milestone July 2012 Pre-alpha release December 2012 Early Beta Release June 2013 Version 1.0Friday, June 8, 12
    • PyData All computing modules known to work with Blaze will be placed under PyData umbrella of projects over the coming years.Friday, June 8, 12
    • Introducing Numba (lots of kernels to write)Friday, June 8, 12
    • NumPy Users • Want to be able to write Python to get fast code that works on arrays and scalars • Need access to a boat-load of C-extensions (NumPy is just the beginning) PyPy doesn’t cut it for us!Friday, June 8, 12
    • Friday, June 8, 12 Ufuncs Generalized UFuncs Python Function Window Kernel Funcs Function- based Indexing Memory Dynamic compilation Filters Dynamic Compilation NumPy Runtime I/O Filters Reduction Filters Computed Columns function pointer
    • SciPy needs a Python compiler optimize integrate special ode writing more of SciPy at high-levelFriday, June 8, 12
    • Numba -- a Python compiler • Replays byte-code on a stack with simple type- inference • Translates to LLVM (using LLVM-py) • Uses LLVM for code-gen • Resulting C-level function-pointer can be inserted into NumPy run-time • Understands NumPy arrays • Is NumPy / SciPy awareFriday, June 8, 12
    • NumPy + Mamba = Numba Python Function Machine Code LLVM-PY LLVM 3.1 ISPC OpenCL OpenMP CUDA CLANG Intel AMD Nvidia AppleFriday, June 8, 12
    • ExamplesFriday, June 8, 12
    • ExamplesFriday, June 8, 12
    • Software Stack Future? Plateaus of Code re-use + DSLs SQL R TDPL Matlab Python OBJC C FORTRAN C++ LLVMFriday, June 8, 12
    • How to pay for all this?Friday, June 8, 12
    • Dual strategy BlazeFriday, June 8, 12
    • NumFOCUS Num(Py) Foundation for Open Code for Usable ScienceFriday, June 8, 12
    • NumFOCUS • Mission • To initiate and support educational programs furthering the use of open source software in science. • To promote the use of high-level languages and open source in science, engineering, and math research • To encourage reproducible scientific research • To provide infrastructure and support for open source projects for technical computingFriday, June 8, 12
    • NumFOCUS • Activites • Sponsor sprints and conferences • Provide scholarships and grants for people using these tools • Pay for documentation development and basic course development • Fund continuous integration and build systems • Work with domain-specific organizations • Raise funds from industries using Python and NumPyFriday, June 8, 12
    • NumFOCUS Core Projects NumPy SciPy IPython Matplotlib Other Projects (seeking more --- need representatives) Scikits ImageFriday, June 8, 12
    • NumFOCUS • Directors • Perry Greenfield • John Hunter • Jarrod Millman • Travis Oliphant • Fernando Perez • Members • Basically people who donate for now. In time, a body that elects directors.Friday, June 8, 12
    • • Large-scale data analysis products • Python and NumPy training • NumPy support and consulting • Rich-client or web user-interfaces • Blaze and PyData DevelopmentFriday, June 8, 12