Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scaling PyData Up and Out


Published on

Making NumPy-style and Pandas-style code faster and run in parallel. Continuum has been working on scaled versions of NumPy and Pandas for 4 years. This talk describes how Numba and Dask provide scaled Python today.

Published in: Technology
  • Be the first to comment

Scaling PyData Up and Out

  1. 1. Scaling PyData Up and Out Travis E. Oliphant, PhD @teoliphant Python Enthusiast since 1997 Making NumPy- and Pandas-style code faster and run in parallel. PyData London 2016
  2. 2. @teoliphant Scale Up vs Scale Out 2 Big Memory & Many Cores / GPU Box Best of Both (e.g. GPU Cluster) Many commodity nodes in a cluster ScaleUp (BiggerNodes) Scale Out (More Nodes)
  3. 3. @teoliphant 3 @jakevdp "Python's Scientific Ecosystem"
  4. 4. @teoliphant 4 I spent the first 15 years of my “Python life” working to build the foundation of the Python Scientific Ecosystem as a practitioner/developer PEP 357 PEP 3118 SciPy
  5. 5. @teoliphant 5 I plan to spend the next 15 years ensuring this ecosystem can scale both up and out as an entrepreneur, evangelist, and director of innovation. CONDA NUMBA DASK DYND BLAZE DATA-FABRIC
  6. 6. @teoliphant 6 In 2012 we “took off” in this effort to form the foundation of scalable Python — Numba (scale-up) and Blaze (scale-out) were the major efforts. In early 2013, I formally “retired” from NumPy maintenance to pursue “next-generation” NumPy. Not so much a single new library, but a new emphasis on scale
  7. 7. @teoliphant 7 Rally a community “Apache” for Open Data Science Community-led and directed Conference Series with sponsorship proceeds going directly to NumFocus
  8. 8. @teoliphant 8 Organize a company Empower people to solve the world’s greatest challenges. We help people discover, analyze, and collaborate by connecting their curiosity and experience with any data. Purpose Mission
  9. 9. @teoliphant 9 Start working on key technology Blaze — blaze, odo, datashape, dynd, dask Bokeh — interactive visualization in web including Jupyter Numba — Array-oriented Python subset compiler Conda — Cross-platform package manager with environments Bootstrap and seed funding through 2014 VC funded in 2015
  10. 10. @teoliphant 10 is…. the Open Data Science Platform powered by Python… the fastest growing open data science language • Accelerate Time-to-Value • Connect Data, Analytics & Compute • Empower Data Science Teams Create a Platform! Free and Open Source!
  11. 11. @teoliphant 11 Governance Provenance Security OPERATIONS Python R Spark | Scala JavaScript Java Fortran C/C++ DATA SCIENCE LANGUAGES DATA Flat Files (CSV, XLS…)) SQL DB NoSQL Hadoop Streaming HARDWARE Workstation Server Cluster APPLICATIONS Interactive Presentations, Notebooks, Reports & Apps Solution Templates Visual Data Exploration Advanced Spreadsheets APIs ANALYTICS Data Exploration Querying Data Prep Data Connectors Visual Programming Notebooks Analytics Development Stats Data Mining Deep Learning Machine Learning Simulation & Optimization Geospatial Text & NLP Graph & Network Image Analysis Advanced Analytics IDEsCICD Package, Dependency, Environment Management Web & Desktop App Dev SOFTWARE DEVELOPMENT HIGH PERFORMANCE Distributed Computing Parallelism &
 Multi-threading Compiled Assets GPUs & Multi-core Compiler Business Analyst Data Scientist Developer Data Engineer DevOps DATA SCIENCE TEAM Cloud On-premises
  12. 12. @teoliphant Free Anaconda now with MKL as default 12 •Intel MKL (Math Kernel Libraries) provide enhanced algorithms for basic math functions. •Using MKL provides optimal performance for basic BLAS, LAPACK, FFT, and math functions. •Anaconda since version 2.5 has MKL provided as the default in the free download of Anaconda (you can also distribute binaries linked against these MKL-enhanced tools).
  13. 13. @teoliphant Community maintained conda packages 13 Automatic building and testing for your recipes Awesome! conda install -c conda-forge conda config --add channels conda-forge
  14. 14. @teoliphant Promises Made 14 •In 2012 we talked about the need for parallel Pandas and Parallel NumPy (both scale-up and scale-out). •At the first PyData NYC in 2012 we described Blaze as an effort to deliver scale out and Numba as well as DyND as an effort to deliver scale-up. •We introduced them early to rally a developer community. •We can now rally a user-community around working code!
  15. 15. @teoliphant 15 Milestone success — 2016 Numba is delivering on scaling up • NumPy’s ufuncs and gufuncs on multiple CPU and GPU threads • General GPU code • General CPU code Dask is delivering on scaling out • dask.array (NumPy scaled out) • dask.dataframe (Pandas scaled out) • dask.bag (lists scaled out) • dask.delayed (general scale-out construction)
  16. 16. @teoliphantNumba + Dask Look at all of the data with Bokeh’s datashader. Decouple the data-processing from the visualization. Visualize arbitrarily large data. 16 • E.g. Open Street Map data: • About 3 billion GPS coordinates • 2012/04/01/bulk-gps-point-data/. • This image was rendered in one minute on a standard MacBook with 16 GB RAM • Renders in a milliseconds on several 128GB Amazon EC2 instances
  17. 17. @teoliphantCategorical data: 2010 US Census 17 • One point per person • 300 million total • Categorized by race • Interactive rendering with Numba+Dask • No pre-tiling
  18. 18. @teoliphant YARN JVM Bottom Line
 10-100X faster performance • Interact with data in HDFS and Amazon S3 natively from Python • Distributed computations without the JVM & Python/Java serialization • Framework for easy, flexible parallelism using directed acyclic graphs (DAGs) • Interactive, distributed computing with in-memory persistence/caching Bottom Line • Leverage Python & R with Spark Batch Processing Interactive Processing HDFS Ibis Impala PySpark & SparkR Python & R ecosystem MPI High Performance, Interactive, Batch Processing Native read & write NumPy, Pandas, … 720+ packages 18
  19. 19. What happened to Blaze? Still going strong — just at another place in the stack (and sometimes a description of an ecosystem).
  20. 20. @teoliphant 20 Expressions Metadata Runtime
  21. 21. @teoliphant 21 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape,dtype, shape,stride hdf5,json,csv,xls protobuf,avro,... NumPy,Pandas,R, Julia,K,SQL,Spark, Mongo,Cassandra,...
  22. 22. @teoliphantAPIs, syntax, language 22 Data Runtime Expressions metadata storage/containers compute datashape blaze dask odo parallelize optimize, JIT numba
  23. 23. @teoliphantBlaze 23 Interface to query data on different storage systems from blaze import Data iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv')S3 … Current focus is SQL and the pydata stack for run-time (dask, dynd, numpy, pandas, x- ray, etc.) + customer-driven needs (i.e. kdb, mongo, PostgreSQL).
  24. 24. @teoliphantBlaze 24 iris[['sepal_length', 'species']]Select columns log(iris.sepal_length * 10)Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) Text matching'*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
  25. 25. @teoliphant 25 datashapeblaze Blaze (like DyND) uses datashape as its type system >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  26. 26. @teoliphantDatashape 26 A structured data description language for all kinds of data dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels 4
  27. 27. @teoliphantDatashape 27 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32 # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64]
  28. 28. @teoliphant iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris Blaze Server — Lights up your Dark Data 28 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json server.yaml
  29. 29. @teoliphantBlaze Client 29 >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa
  30. 30. @teoliphant Compute recipes work with existing libraries and have multiple backends — write once and run anywhere. Layer over existing data! • Python list • Numpy arrays • Dynd • Pandas DataFrame • Spark, Impala, Ibis • Mongo • Dask 30 • Hint: Use odo to copy from one format and/or engine to another!
  31. 31. @teoliphant Data Fabric (pre-alpha, demo-ware) • New initiative just starting (if you can fund it, please tell me). • Generalization of Python buffer-protocol to all languages • A library to build a catalog of shared-memory chunks across the cluster holding data that can be described by data-shape • A helper type library in DyND that any language can use to select appropriate analytics for • Foundation for a cross-language “super-set” of PyData stack 31
  32. 32. Brief Overview of Numba Go to Graham Markall’s Talk at 2:15 in LG6 to hear more.
  33. 33. @teoliphantClassic Example 33 from numba import jit @jit def mandel(x, y, max_iters): c = complex(x,y) z = 0j for i in range(max_iters): z = z*z + c if z.real * z.real + z.imag * z.imag >= 4: return 255 * i // max_iters return 255 Mandelbrot
  34. 34. @teoliphantThe Basics 34 CPython 1x Numpy array-wide operations 13x Numba (CPU) 120x Numba (NVidia Tesla K20c) 2100x Mandelbrot
  35. 35. @teoliphantHow Numba works 35 Bytecode Analysis Python Function Function Arguments Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  36. 36. @teoliphantNumba Features • Numba supports: – Windows, OS X, and Linux – 32 and 64-bit x86 CPUs and NVIDIA GPUs – Python 2 and 3 – NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available) 36
  37. 37. @teoliphantHow to Use Numba 1. Create a realistic benchmark test case.
 (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.
 (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
 (see online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 
 (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement. 37
  38. 38. @teoliphant Example: Filter an array 38
  39. 39. @teoliphant Example: Filter an array 39 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array Numba decorator
 (nopython=True not required) 2.7x Speedup over NumPy!
  40. 40. @teoliphantNumPy UFuncs and GUFuncs 40 NumPy ufuncs (and gufuncs) are functions that operate “element- wise” (or “sub-dimension-wise”) across an array without an explicit loop. This implicit loop (which is in machine code) is at the core of why NumPy is fast. Dispatch is done internally to a particular code-segment based on the type of the array. It is a very powerful abstraction in the PyData stack. Making new fast ufuncs used to be only possible in C — painful! With nb.vectorize and numba.guvectorize it is now easy! The inner secrets of NumPy are now at your finger-tips for you to make your own magic!
  41. 41. @teoliphantExample: Making a windowed compute filter 41 Perform a computation on a finite window of the input. For a linear system, this is a FIR filter and what np.convolve or sp.signal.lfilter can do. But what if you want some arbitrary computation.
  42. 42. @teoliphantExample: Making a windowed compute filter 42 Hand-coded implementation Build a ufunc for the kernel which is faster for large arrays! This can now run easily on GPU with ‘target=gpu’ and many- cores ‘target=parallel’ Array-oriented!
  43. 43. @teoliphantExample: Making a windowed compute filter 43
  44. 44. @teoliphantOther interesting things • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) including GPU support and multi- core (threaded) support • Call ctypes and cffi functions directly and pass them as arguments • Support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: 44
  45. 45. @teoliphantRecently Added Numba Features • Support for named tuples in nopython mode • Support for sets in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • JIT classes (zero-cost abstraction) • Support of (and ‘@‘ operator on Python 3.5) • Support for some of np.linalg • generated_jit (jit the functions that are the return values of the decorated function — meta-programming) • SmartArrays which can exist on host and GPU (transparent data access). • Ahead of Time Compilation • Disk-caching of pre-compiled code 45
  46. 46. Overview of Dask as a Parallel Processing Framework with Distributed
  47. 47. @teoliphantPrecursors to Parallelism 47 • Consider the following approaches first: 1. Use better algorithms 2. Try Numba or C/Cython 3. Store data in efficient formats 4. Subsample your data • If you have to parallelize: 1. Start with your laptop (4 cores, 16 GB RAM, 1 TB disk) 2. Then a large workstation (24 cores, 1 TB RAM) 3. Finally, scale out to a cluster
  48. 48. @teoliphantOverview of Dask 48 Dask is a Python parallel computing library that is: • Familiar: Implements parallel NumPy and Pandas objects • Fast: Optimized for demanding for numerical applications • Flexible: for sophisticated and messy algorithms • Scales out: Runs resiliently on clusters of 100s of machines • Scales down: Pragmatic in a single process on a laptop • Interactive: Responsive and fast for interactive data science Dask complements the rest of Anaconda. It was developed with
 NumPy, Pandas, and scikit-learn developers.
  49. 49. @teoliphantSpectrum of Parallelization 49 Threads Processes MPI ZeroMQ Dask Hadoop Spark SQL: Hive Pig Impala Implicit control: Restrictive but easyExplicit control: Fast but hard
  50. 50. @teoliphant Dask: From User Interaction to Execution 50
  51. 51. @teoliphant Dask Collections: Familiar Expressions and API 51 x.T - x.mean(axis=0) df.groupby(df.index).value.mean() def load(filename): def clean(data): def analyze(result): Dask array (mimics NumPy) Dask dataframe (mimics Pandas) Dask imperative (wraps custom code) Dask bag (collection of data)
  52. 52. @teoliphant Dask Graphs: Example Machine Learning Pipeline 52
  53. 53. @teoliphant Dask Graphs: Example Machine Learning Pipeline + Grid Search 53
  54. 54. @teoliphant • simple: easy to use API • flexible: perform a lots of action with a minimal amount of code • fast: dispatching to run-time engines & cython • database like: familiar ops • interop: integration with the PyData Stack 54 (((A + 1) * 2) ** 3)
  55. 55. @teoliphant 55 (B - B.mean(axis=0)) 
 + (B.T / B.std())
  56. 56. @teoliphant Scheduler Worker Worker Worker Worker Client Same network User Machine (laptop)Client Worker Dask Schedulers: Example - Distributed Scheduler 56
  57. 57. @teoliphantCluster Architecture Diagram 57 Client Machine Compute Node Compute Node Compute Node Head Node
  58. 58. @teoliphant • Single machine with multiple threads or processes • On a cluster with SSH (dcluster) • Resource management: YARN (knit), SGE, Slurm • On the cloud with Amazon EC2 (dec2) • On a cluster with Anaconda for cluster management • Manage multiple conda environments and packages 
 on bare-metal or cloud-based clusters Using Anaconda and Dask on your Cluster 58
  59. 59. @teoliphantScheduler Visualization with Bokeh 59
  60. 60. @teoliphantExamples 60 Analyzing NYC Taxi CSV data using distributed Dask DataFrames • Demonstrate Pandas at scale • Observe responsive user interface Distributed language processing with text data using Dask Bags • Explore data using a distributed memory cluster • Interactively query data using libraries from Anaconda Analyzing global temperature data using Dask Arrays • Visualize complex algorithms • Learn about dask collections and tasks Handle custom code and workflows using Dask Imperative • Deal with messy situations • Learn about scheduling 1 2 3 4
  61. 61. @teoliphant Example 1: Using Dask DataFrames on a cluster with CSV data 61 • Built from Pandas DataFrames • Match Pandas interface • Access data from HDFS, S3, local, etc. • Fast, low latency • Responsive user interface January, 2016 Febrary, 2016 March, 2016 April, 2016 May, 2016 Pandas DataFrame} Dask DataFrame }
  62. 62. @teoliphant Example 2: Using Dask Bags on a cluster with text data 62 • Distributed natural language processing with text data stored in HDFS • Handles standard computations • Looks like other parallel frameworks (Spark, Hive, etc.) • Access data from HDFS, S3, local, etc. • Handles the common case ... (...) data ... (...) data function ... ... (...) data function ... result merge ... ... data function (...) ... function
  63. 63. @teoliphant NumPy Array } }Dask Array Example 3: Using Dask Arrays with global temperature data 63 • Built from NumPy
 n-dimensional arrays • Matches NumPy interface (subset) • Solve medium-large problems • Complex algorithms
  64. 64. @teoliphant Example 4: Using Dask Delayed to handle custom workflows 64 • Manually handle functions to support messy situations • Life saver when collections aren't flexible enough • Combine futures with collections for best of both worlds • Scheduler provides resilient and elastic execution
  65. 65. @teoliphant Scale Up vs Scale Out 65 Big Memory & Many Cores / GPU Box Best of Both (e.g. GPU Cluster) Many commodity nodes in a cluster ScaleUp (BiggerNodes) Scale Out (More Nodes) Numba Dask Blaze
  66. 66. @teoliphantContribute to the next stage of the journey 66 Numba, Blaze and the Blaze ecosystem (dask, odo, dynd, datashape, data-fabric, and beyond…)