PyData Boston 2013
Upcoming SlideShare
Loading in...5
×
 

PyData Boston 2013

on

  • 2,709 views

A description of Continuum Analytics projects focusing on the open source work. Includes descriptions of Wakari, Numba, Conda, Bokeh, Blaze, and CDX.

A description of Continuum Analytics projects focusing on the open source work. Includes descriptions of Wakari, Numba, Conda, Bokeh, Blaze, and CDX.

Statistics

Views

Total Views
2,709
Views on SlideShare
2,577
Embed Views
132

Actions

Likes
6
Downloads
67
Comments
0

7 Embeds 132

https://twitter.com 97
http://eventifier.co 18
http://www.meetup.com 9
http://aws.w3db.us 5
http://email11.secureserver.net 1
http://eventifier.com 1
http://ec2-54-243-189-159.compute-1.amazonaws.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

PyData Boston 2013 PyData Boston 2013 Presentation Transcript

  • Continuum Analytics: What we are doing and why Travis Oliphant, PhD Continuum Analytics, Inc
  • This talk will be about (mostly) the free and / or open source software we are building. Enterprise Python Scientific Computing Data Processing Data Analysis Visualisation Scalable Computing • Products • Training • Support • Consulting
  • Began operations in January of 2012 with 5 people with big dreams Python
  • We are big backers of NumFOCUS and organizers of PyData Spyder
  • Our Team Scientists Developers +5 contractors +4 interns +1 Part-time Business
  • Big Picture expertise insights
  • NumPy and SciPy are quite successful Thanks to large and diverse community around it (Matplotlib, IPython, SymPy, Pandas, etc.) I estimate 1.5 million to 2 million users Only incremental improvements possible with these projects at this point. Thus, we needed to start new projects...
  • Related Open Source Projects Blaze: High-performance Python library for modern vector computing, distributed and streaming data Numba:Vectorizing Python compiler for multicore and GPU, using LLVM Bokeh: Interactive, grammar-based visualization system for large datasets Common Thread: High-level, expressive language for domain experts; innovative compilers & runtimes for efficient, powerful data transformation
  • Conda and Anaconda • Cross-platform package management • Multiple environments allows you to have multiple versions of packages installed in system • Easy app-deployment • Taming open-source Free for all users Enterprise support available!
  • Why Conda? • Linux users stopped complaining about Python deployment • I made major mistakes in management of NumPy/SciPy: • too much in SciPy (SciPy as distribution) --- scikits model is much better (tighter libraries developed by smaller teams) • gave in to community desire for binary ABI compatibility in NumPy motivated by difficulty of reproducing Python install • Need for a cross-platform way to install major Python extensions (many with dependencies on large C or C++ libraries • Python can’t be ubiquitous if people struggle to just get it and then manage it.
  • Why conda Information Technology
  • What is Conda • Full package management (like yum or apt-get) but cross-platform • Control over environments (using hard-link farms) --- better than virtual-env. virtualenv today is like distutils and setuptools of several years ago (great at first but will end up hating it) • Architected to be able to manage any packages (R, Scala, Clojure, Haskell, Ruby, JS) • SAT solver to manage dependencies • User-definable repositories
  • New Features and Binstar • Build command from recipe --- many recipes here: https:// github.com/ContinuumIO/conda-recipes • Upload recipes to Binstar (last mile in binary package hosting and deployment for any language). • “binstar in beta” is the beta code • Personal conda repositories --- https://conda.binstar.org/ travis • Free Continuum Anaconda repo will be on binstar.org • Private packages and behind-the-firewall satellites available • *Free build queue on Linux (Mac and Windows coming soon) for hosted conda recipes
  • Demo create Python 3 environment with IPython and scipy create new recipe from PyPI (yunomi)
  • Packaging and Distribution Solved • conda and binstar solve most of the problems that we have seen people encounter in managing Python installations (especially in large-scale institutions). • they are supported solutions that can remove the technology pain of managing Python • some problems, though, are people
  • Anaconda (open) Free enterprise-ready Python distribution of open- source tools for large-scale data processing, predictive analytics, and scientific computing
  • Anaconda Add-Ons (paid-for) •Revolutionary Python to GPU compiler •Extends Numba to take a subset of Python to the GPU (program CUDA in Python) •CUDA FFT / BLAS interfaces Fast, memory-efficient Python interface for SQL databases, NoSQL stores,Amazon S3, and large data files. NumPy, SciPy, scikit-learn, NumExpr compiled against Intel’s Math Kernel Library (MKL)
  • Launcher
  • Why Numba? •Python is too slow for loops •Most people are not learning C/C++/Fortran today •Cython is an improvment (but still verbose and needs C-compiler) •NVIDIA using LLVM for the GPU •Many people working with large typed-containers (NumPy arrays) •We want to take high-level, tarray-oriented expressions and compile it to fast code
  • NumPy + Mamba = Numba LLVM Library Intel Nvidia AppleAMD OpenCLISPC CUDA CLANGOpenMP LLVMPY Python Function Machine Code ARM
  • Example Numba
  • Numba @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  • Numba changes the game! LLVM IR x86 C++ ARM PTX C Fortran Python Numba turns (a subset of) Python into a “compiled language” as fast as C (but much more flexible). You don’t have to reach for C/C++
  • Laplace Example @jit('void(double[:,:], double, double)') def numba_update(u, dx2, dy2): nx, ny = u.shape for i in xrange(1,nx-1): for j in xrange(1, ny-1): u[i,j] = ((u[i+1,j] + u[i-1,j]) * dy2 + (u[i,j+1] + u[i,j-1]) * dx2) / (2*(dx2+dy2)) Adapted from http://www.scipy.org/PerformancePython originally by Prabhu Ramachandran @jit('void(double[:,:], double, double)') def numbavec_update(u, dx2, dy2): u[1:-1,1:-1] = ((u[2:,1:-1]+u[:-2,1:-1])*dy2 + (u[1:-1,2:] + u[1:-1,:-2])*dx2) / (2*(dx2+dy2))
  • Results of Laplace example Version Time Speed Up NumPy 3.19 1.0 Numba 2.32 1.38 Vect. Numba 2.33 1.37 Cython 2.38 1.34 Weave 2.47 1.29 Numexpr 2.62 1.22 Fortran Loops 2.30 1.39 Vect. Fortran 1.50 2.13 https://github.com/teoliphant/speed.git
  • LLVMPy worth looking at LLVM (via LLVMPy) has done much heavy lifting LLVMPy = Compilers for everybody
  • What is wrong with NumPy? • Dtype system is difficult to extend • Many Dtypes needed (missing data, enums, variable length strings) • Immediate mode creates huge temporaries (spawning Numexpr) • “Almost” an in-memory data-base comparable to SQL- lite (missing indexes) • Integration with sparse arrays • Standard structure of arrays representation... • Missing Multi-methods • Optimization / Minimal support for multi-core / GPU
  • Now What? After watching NumPy and SciPy get used all over Wall Street and by many scientists / engineers in industry --- what would we do differently?
  • New Project Blaze NumPy Out of Core, Distributed and Optimized NumPy
  • Blaze Array or Table Data Descriptor Data Buffer Index Operation NumPy BLZ Persistent Format RDBMS CSVData Stream
  • Blaze Deferred Arrays +" A" *" B" C" A + B*C • Symbolic objects which build a graph • Represents deferred computation Usually what you have when you have a Blaze Array
  • DataShape Type System    • A data description language • A super-set of NumPy’s dtype • Provides more flexibility Shape DType DataShape
  • Blaze Database GPU Node Array Server NFS Array Server Array Server Blaze Client Synthesized Array/Table view array+sql:// array:// file:// array:// Python REPL, Scripts Viz Data Server C, C++, FORTRAN JVM languages
  • Progress • Basic calculations work out-of-core (via Numba and LLVM) • Hard dependency on dynd and dynd-python (a dynamic C++-only multi-dimensional library like NumPy but with many improvements) • Persistent arrays from BLZ • Basic array-server functionality for layering over CSV files • 0.2 release in 1-2 weeks. 0.3 within a month after that (first usable release)
  • DARPA providing help DARPA-BAA-12-38: XDATA TA-1: Scalable analytics and data processing technology   TA-2: Visual user interface technology
  • Bokeh Plotting Library • Interactive graphics for the web • Designed for large datasets • Designed for streaming data • Native interface in Python • Fast JavaScript component • DARPA funded • v0.1 release imminent
  • Reasons for Bokeh 1. Plotting must happen near the data too 2. Quick iteration is essential => interactive visualization 3. Interactive visualization on remote-data => use the browser 4. Almost all web plotting libraries are either: 1. Designed for javascript programmers 2. Designed to output static graphs 5. We designed Bokeh to be dynamic graphing in the web for Python programmers 6. Will include “Abstract” or “synthetic” rendering (working on Hadoop and Spark compatibility)
  • Wakari • Browser-based data analysis and visualization platform • Wordpress /YouTube / Github for data analysis • Full Linux environment with Anaconda Python • Can be installed on internal clusters & servers
  • Why Wakari? • Data is too big to fit on your desktop • You need compute power but don’t have easy access to a large cluster (cloud is sitting there with lots of power) • Configuration of software on a new system stinks (especially a cluster). • Collaborative Data Analytics --- you want to build a complex technical workflow and then share it with others easily (without requiring they do painful configuration to see your results) • IPython Notebook is awesome --- let’s share it (but we also need the dependencies and data).
  • Wakari • Free account has 512 MB RAM / 10 GB disk and shared multi-core CPU • Easily spin-up map-reduce (Disco and Hadoop clusters) • Use IPython Parallel on many-nodes in the cloud • Develop GUI apps (possibly in Anaconda) and publish them easily to Wakari (based on full power of scientific python --- complex technical workflows (IPython notebook for now)
  • Basic Data Explorer
  • Continuum Data Explorer (CDX) • Open Source • Goal is interactivity • Combination of IPython REPL, Bokeh, and tables • Tight integration between GUI elements and REPL • Current features - Namespace viewer (mapped to IPython namespace) - DataTable widget with group-by, computed columns, advanced- filters - Interactive Plots connected to tables
  • CDX
  • Conclusion Projects circle around giving tools to experts (occasional programmers or domain experts) to enable them to move their expertise to the data to get insights --- keep data where it is and move high-level but performant code) Join us or ask how we can help you!