Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Scale up and Scale Out Anaconda and PyData

3,948 views

Published on

With Anaconda (in particular Numba and Dask) you can scale up your NumPy and Pandas stack to many cpus and GPUs as well as scale-out to run on clusters of machines including Hadoop.

Published in: Software
  • ⇒ www.WritePaper.info ⇐ is a good website if you’re looking to get your essay written for you. You can also request things like research papers or dissertations. It’s really convenient and helpful.
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Real people just like you are kissing the idea of punching the clock for someone else goodbye, and embracing a new way of living. The internet economy is exploding, and there are literally THOUSANDS of great earnings opportunities available right now, all just one click away. ◆◆◆ http://t.cn/AisJWzdm
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Sex in your area is here: ♥♥♥ http://bit.ly/2F4cEJi ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/2F4cEJi ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Scale up and Scale Out Anaconda and PyData

  1. 1. Scale Up and Scale Out with the Anaconda Platform Travis Oliphant CEO
  2. 2. Travis Oliphant, PhD — About me • PhD 2001 from Mayo Clinic in Biomedical Engineering • MS/BS degrees in Elec. Comp. Engineering from BYU • Created SciPy (1999-2009) • Professor at BYU (2001-2007) • Author and Principal Dev of NumPy (2005-2012) • Started Numba (2012) • Founding Chair of NumFocus / PyData • Former PSF Director (2015) • Founder of Continuum Analytics in 2012. 2 SciPy
  3. 3. 3 Anaconda enables Scale Up and Scale Out VerticalScaling (BiggerNodes) Horizontal Scaling (More Nodes) Big Memory and ManyCore /GPU Box Many commodity nodes in a cluster Best of Both (e.g. GPU cluster)
  4. 4. 4 Anaconda enables Scale Up and Scale Out VerticalScaling (BiggerNodes) Horizontal Scaling (More Nodes) Numba DyND Anaconda + MKL Dask Blaze conda Anaconda Inside Hadoop
  5. 5. © 2015 Continuum Analytics- Confidential & Proprietary Open Source Communities Creates Powerful Technology for Data Science 5 Numba dask xlwings Airflow Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  6. 6. © 2015 Continuum Analytics- Confidential & Proprietary Python is the Common Language 6 Numba dask xlwings Airflow Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  7. 7. © 2015 Continuum Analytics- Confidential & Proprietary Not the Only One… 7 SQL Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  8. 8. © 2015 Continuum Analytics- Confidential & Proprietary But it’s also a Great Glue Language 8 SQL Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  9. 9. © 2015 Continuum Analytics- Confidential & Proprietary Anaconda is the Open Data Science Platform Bringing Technology Together… 9 Numba dask Airflow SQL xlwings Blaze Distributed Systems Business Intelligence Machine Learning / Statistics Web Scientific Computing / HPC
  10. 10. 10 • Package, dependency and environment manager • Language agnostic (Python, R, Java, C, FORTRAN…) • Cross-platform (Windows, OS X, Linux) $ conda install python=2.7 $ conda install pandas $ conda install -c r r $ conda install mongodb Conda
  11. 11. Where packages, notebooks, and environments are shared. Powerful collaboration and package management for open source and private projects. Public projects and notebooks are always free. REGISTER TODAY! ANACONDA.ORG
  12. 12. SCALE UP
  13. 13. 13 Anaconda now with MKL as default •Intel MKL (Math Kernel Libraries) provide enhanced algorithms for basic math functions. •Using MKL provides optimal performance for basic BLAS, LAPACK, FFT, and math functions. •Version 2.5 has MKL provided as the default in the free download of Anaconda.
  14. 14. Space of Python Compilation 14 Ahead Of Time Just In Time Relies on CPython / libpython Cython Shedskin Nuitka (today) Pythran Numba Numba HOPE Theano Pyjion Replaces CPython / libpython Nuitka (someday) Pyston PyPy
  15. 15. Compiler overview 15 C++ C Fortran ObjC Parsing  Frontend Intermediate Representation (IR) x86 ARM PTX Code  Generation     Backend
  16. 16. 16 Intermediate Representation (IR) x86 ARM PTX Python LLVMNumba Parsing  Frontend Code  Generation     Backend
  17. 17. Example 17 Numba
  18. 18. 18 @jit('void(f8[:,:],f8[:,:],f8[:,:])') def filter(image, filt, output): M, N = image.shape m, n = filt.shape for i in range(m//2, M-m//2): for j in range(n//2, N-n//2): result = 0.0 for k in range(m): for l in range(n): result += image[i+k-m//2,j+l-n//2]*filt[k, l] output[i,j] = result ~1500x speed-up
  19. 19. How Numba works 19 Bytecode Analysis Python Function Function Arguments Type Inference Numba IR LLVM IR Machine Code @jit def do_math(a,b): … >>> do_math(x, y) Cache Execute! Rewrite IR Lowering LLVM JIT
  20. 20. Numba Features 20 • Numba supports: Windows, OS X, and Linux 32 and 64-bit x86 CPUs and NVIDIA GPUs Python 2 and 3 NumPy versions 1.6 through 1.9 • Does not require a C/C++ compiler on the user’s system. • < 70 MB to install. • Does not replace the standard Python interpreter
 (all of your existing Python libraries are still available)
  21. 21. Numba Modes 21 • object mode: Compiled code operates on Python objects. Only significant performance improvement is compilation of loops that can be compiled in nopython mode (see below). • nopython mode: Compiled code operates on “machine native” data. Usually within 25% of the performance of equivalent C or FORTRAN.
  22. 22. How to Use Numba 22 1. Create a realistic benchmark test case.
 (Do not use your unit tests as a benchmark!) 2. Run a profiler on your benchmark.
 (cProfile is a good choice) 3. Identify hotspots that could potentially be compiled by Numba with a little refactoring.
 (see rest of this talk and online documentation) 4. Apply @numba.jit and @numba.vectorize as needed to critical functions. 
 (Small rewrites may be needed to work around Numba limitations.) 5. Re-run benchmark to check if there was a performance improvement.
  23. 23. The Basics 23
  24. 24. The Basics 24 Array Allocation Looping over ndarray x as an iterator Using numpy math functions Returning a slice of the array 2.7x speedup over NumPy! Numba decorator
 (nopython=True not required)
  25. 25. Calling Other Functions 25
  26. 26. Calling Other Functions 26 This function is not inlined This function is inlined 9.8x speedup compared to doing this with numpy functions
  27. 27. Making Ufuncs 27
  28. 28. Making Ufuncs 28 Monte Carlo simulating 500,000 tournaments in 50 ms
  29. 29. Case-study -- j0 from scipy.special 29 • scipy.special was one of the first libraries I wrote (in 1999) • extended “umath” module by adding new “universal functions” to compute many scientific functions by wrapping C and Fortran libs. • Bessel functions are solutions to a differential equation: x2 d2 y dx2 + x dy dx + (x2 ↵2 )y = 0 y = J↵ (x) Jn (x) = 1 ⇡ Z ⇡ 0 cos (n⌧ x sin (⌧)) d⌧
  30. 30. scipy.special.j0 wraps cephes algorithm 30 Don’t  need  this  anymore!
  31. 31. Result --- equivalent to compiled code 31 In [6]: %timeit vj0(x) 10000 loops, best of 3: 75 us per loop In [7]: from scipy.special import j0 In [8]: %timeit j0(x) 10000 loops, best of 3: 75.3 us per loop But! Now code is in Python and can be experimented with more easily (and moved to the GPU / accelerator more easily)!
  32. 32. Numba is very popular! 32 A  numba  mailing  list  reports  experiments  of  a  SciPy  author  who  got  2x  speed-­‐ up  by  removing  their  Cython  type  annotations  and  surrounding  function  with   numba.jit  (with  a  few  minor  changes  needed  to  the  code). With  Numba’s  ahead-­‐of-­‐time  compilation  one  can  legitimately  use  Numba  to   create  a  library  that  you  ship  to  others  (who  then  don’t  need  to  have  Numba   installed  —  or  just  need  a  Numba  run-­‐time  installed). SciPy  (and  NumPy)  would  look  very  different  in  Numba  had  existed  16  years   ago  when  SciPy  was  getting  started….  —  and  you  would  all  be  happier.
  33. 33. Releasing the GIL 33 Only nopython mode functions can release the GIL
  34. 34. Releasing the GIL 34 2.8x speedup with 4 cores
  35. 35. CUDA Python (in open-source Numba!) 35 CUDA Development using Python syntax for optimal performance! You have to understand CUDA at least a little — writing kernels that launch in parallel on the GPU
  36. 36. Example: Black-Scholes 36
  37. 37. Black-Scholes: Results 37 core i7 GeForce GTX 560 Ti About 9x faster on this GPU ~ same speed as CUDA-C
  38. 38. 38 from numba import jit @jit def mandel(x, y, max_iters): c = complex(x,y) z = 0j for i in range(max_iters): z = z*z + c if z.real * z.real + z.imag * z.imag >= 4: return 255 * i // max_iters return 255 Mandelbrot
  39. 39. 39 CPython 1x Numpy array-wide operations 13x Numba (CPU) 120x Numba (NVidia Tesla K20c) 2100x Mandelbrot
  40. 40. Other interesting things 40 • CUDA Simulator to debug your code in Python interpreter • Generalized ufuncs (@guvectorize) including GPU support and multi-core (threaded) support • Call ctypes and cffi functions directly and pass them as arguments • Support for types that understand the buffer protocol • Pickle Numba functions to run on remote execution engines • “numba annotate” to dump HTML annotated version of compiled code • See: http://numba.pydata.org/numba-doc/0.23.0/
  41. 41. What Doesn’t Work? 41 (A non-comprehensive list) • Sets, lists, dictionaries, user defined classes (tuples do work!) • List, set and dictionary comprehensions • Recursion • Exceptions with non-constant parameters • Most string operations (buffer support is very preliminary!) • yield from • closures inside a JIT function (compiling JIT functions inside a closure works…) • Modifying globals • Debugging of compiled code (you have to debug in Python mode).
  42. 42. Recently Added Numba Features 42 • Support for named tuples in nopython mode • Limited support for lists in nopython mode • On-disk caching of compiled functions (opt-in) • JIT classes (zero-cost abstraction) • Support of np.dot (and ‘@‘ operator on Python 3.5) • Support for some of np.linalg • generated_jit (jit the functions that are the return values of the decorated function) • SmartArrays which can exist on host and GPU (transparent data access).
  43. 43. © 2015 Continuum Analytics- Confidential & Proprietary More New Features • Support for ARMv7 (Raspbery Pi 2) • Python 3.5 support • NumPy 1.10 support • Faster loading of pre-compiled functions from the disk cache • Ahead of Time compilation (you can write code with numba, compile it ahead of time and ship binary without requiring numba). • ufunc compilation for multithreaded CPU and GPU targets (features only in Accelerate previously). 43
  44. 44. Conclusion 44 • Lots of progress in the past year! • Try out Numba on your numerical and NumPy-related projects: conda install numba • Your feedback helps us make Numba better!
 Tell us what you would like to see:
 
 https://github.com/numba/numba • Extension API coming soon and support for more data structures
  45. 45. SCALE OUT
  46. 46. 46 • Infrastructure for meta-data, meta-compute, and expression graphs/dataflow • Data glue for scale-up or scale-out • Generic remote computation & query system • (NumPy+Pandas+LINQ+OLAP+PADL).mashup() Blaze is an extensible high-level interface for data analytics. It feels like NumPy/Pandas. It drives other data systems. Blaze expressions enable high-level reasoning. An ecosystem of tools. http://blaze.pydata.org Blaze
  47. 47. 47 Expressions Metadata Runtime
  48. 48. 48 + - / * ^ [] join, groupby, filter map, sort, take where, topk datashape,dtype, shape,stride hdf5,json,csv,xls protobuf,avro,... NumPy,Pandas,R, Julia,K,SQL,Spark, Mongo,Cassandra,...
  49. 49. APIs, syntax, language 49 Data Runtime Expressions metadata storage/containers compute datashape blaze dask odo parallelize optimize, JIT
  50. 50. Blaze 50 Interface to query data on different storage systems http://blaze.pydata.org/en/latest/ from blaze import Data iris = Data('iris.csv') iris = Data('sqlite:///flowers.db::iris') iris = Data('mongodb://localhost/mydb::iris') iris = Data('iris.json') CSV SQL MongoDB JSON iris = Data('s3://blaze-data/iris.csv')S3 … Current focus is the “dark data” and pydata stack for run-time (dask, dynd, numpy, pandas, x-ray, etc.) + customer needs (i.e. kdb, mongo).
  51. 51. Blaze 51 iris[['sepal_length', 'species']]Select columns log(iris.sepal_length * 10)Operate Reduce iris.sepal_length.mean() Split-apply -combine by(iris.species, shortest=iris.petal_length.min(), longest=iris.petal_length.max(), average=iris.petal_length.mean()) Add new columns transform(iris, sepal_ratio = iris.sepal_length / iris.sepal_width, petal_ratio = iris.petal_length / iris.petal_width) Text matching iris.like(species='*versicolor') iris.relabel(petal_length='PETAL-LENGTH', petal_width='PETAL-WIDTH') Relabel columns Filter iris[(iris.species == 'Iris-setosa') & (iris.sepal_length > 5.0)]
  52. 52. 52 datashapeblaze Blaze uses datashape as its type system (like DyND) >>> iris = Data('iris.json') >>> iris.dshape dshape("""var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }""")
  53. 53. Datashape 53 A structured data description language http://datashape.pydata.org/ dimension dtype unit types var 3 string int32 4 float64 * * * * var * { x : int32, y : string, z : float64 } datashape tabular datashape record ordered struct dtype { x : int32, y : string, z : float64 } collection of types keyed by labels 4
  54. 54. Datashape 54 { flowersdb: { iris: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } }, iriscsv: var * { sepal_length: ?float64, sepal_width: ?float64, petal_length: ?float64, petal_width: ?float64, species: ?string }, irisjson: var * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string }, irismongo: 150 * { petal_length: float64, petal_width: float64, sepal_length: float64, sepal_width: float64, species: string } } # Arrays of Structures 100 * { name: string, birthday: date, address: { street: string, city: string, postalcode: string, country: string } } # Structure of Arrays { x: 100 * 100 * float32, y: 100 * 100 * float32, u: 100 * 100 * float32, v: 100 * 100 * float32, } # Function prototype (3 * int32, float64) -> 3 * float64 # Function prototype with broadcasting dimensions (A... * int32, A... * int32) -> A... * int32 # Arrays 3 * 4 * int32 3 * 4 * int32 10 * var * float64 3 * complex[float64]
  55. 55. iriscsv: source: iris.csv irisdb: source: sqlite:///flowers.db::iris irisjson: source: iris.json dshape: "var * {name: string, amount: float64}" irismongo: source: mongodb://localhost/mydb::iris Blaze Server — Lights up your Dark Data 55 Builds off of Blaze uniform interface to host data remotely through a JSON web API. $ blaze-server server.yaml -e localhost:6363/compute.json server.yaml
  56. 56. Blaze Server 56 Blaze Client >>> from blaze import Data >>> s = Data('blaze://localhost:6363') >>> t.fields [u'iriscsv', u'irisdb', u'irisjson', u’irismongo'] >>> t.iriscsv sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa >>> t.irisdb petal_length petal_width sepal_length sepal_width species 0 1.4 0.2 5.1 3.5 Iris-setosa 1 1.4 0.2 4.9 3.0 Iris-setosa 2 1.3 0.2 4.7 3.2 Iris-setosa
  57. 57. Compute recipes work with existing libraries and have multiple backends — write once and run anywhere. • python list • numpy arrays • dynd • pandas DataFrame • Spark, Impala • Mongo • dask 57
  58. 58. • You can layer expressions over any data
 • Write once, deploy anywhere
 • Practically, expressions will work better on specific data structures, formats, and engines
 • Use odo to copy from one format and/or engine to another 58
  59. 59. 59 Dask: Distributed PyData • A parallel computing framework • That leverages the excellent Python ecosystem (NumPy and Pandas) • Using blocked algorithms and task scheduling • Written in pure Python Core Ideas • Dynamic task scheduling yields sane parallelism • Simple library to enable parallelism • Dask.array/dataframe to encapsulate the functionality • Distributed scheduler
  60. 60. DAG of Computation 60
  61. 61. • Collections build task graphs • Schedulers execute task graphs • Graph specification = uniting interface • A generalization of RDDs 61
  62. 62. Simple Architecture for Scaling 62 Dask  collections   • dask.array   • dask.dataframe   • dask.bag   • dask.imperative* Python  Ecosystem Dask  Graph  Specification Dask  Schedulers
  63. 63. dask.array: OOC, parallel, ND array 63 Arithmetic: +, *, ... Reductions: mean, max, ... Slicing: x[10:, 100:50:-2] Fancy indexing: x[:, [3, 1, 2]] Some linear algebra: tensordot, qr, svd Parallel algorithms (approximate quantiles, topk, ...) Slightly overlapping arrays Integration with HDF5
  64. 64. Dask Array 64 numpy dask >>> import numpy as np >>> np_ones = np.ones((5000, 1000)) >>> np_ones array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> np_y = np.log(np_ones + 1)[:5].sum(axis=1) >>> np_y array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, 693.14718056]) >>> import dask.array as da >>> da_ones = da.ones((5000000, 1000000), chunks=(1000, 1000)) >>> da_ones.compute() array([[ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], ..., [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.], [ 1., 1., 1., ..., 1., 1., 1.]]) >>> da_y = da.log(da_ones + 1)[:5].sum(axis=1) >>> np_da_y = np.array(da_y) #fits in memory array([ 693.14718056, 693.14718056, 693.14718056, 693.14718056, …, 693.14718056]) # If result doesn’t fit in memory >>> da_y.to_hdf5('myfile.hdf5', 'result')
  65. 65. dask.dataframe: OOC, parallel, ND array 65 Elementwise operations: df.x + df.y Row-wise selections: df[df.x > 0] Aggregations: df.x.max() groupby-aggregate: df.groupby(df.x).y.max() Value counts: df.x.value_counts() Drop duplicates: df.x.drop_duplicates() Join on index: dd.merge(df1, df2, left_index=True, right_index=True)
  66. 66. Dask Dataframe 66 pandas dask >>> import pandas as pd >>> df = pd.read_csv('iris.csv') >>> df.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa >>> max_sepal_length_setosa = df[df.species == 'setosa'].sepal_length.max() 5.7999999999999998 >>> import dask.dataframe as dd >>> ddf = dd.read_csv('*.csv') >>> ddf.head() sepal_length sepal_width petal_length petal_width species 0 5.1 3.5 1.4 0.2 Iris-setosa 1 4.9 3.0 1.4 0.2 Iris-setosa 2 4.7 3.2 1.3 0.2 Iris-setosa 3 4.6 3.1 1.5 0.2 Iris-setosa 4 5.0 3.6 1.4 0.2 Iris-setosa … >>> d_max_sepal_length_setosa = ddf[ddf.species == 'setosa'].sepal_length.max() >>> d_max_sepal_length_setosa.compute() 5.7999999999999998
  67. 67. • simple: easy to use API • flexible: perform a lots of action with a minimal amount of code • fast: dispatching to run-time engines & cython • database like: familiar ops • interop: integration with the PyData Stack 67 (((A + 1) * 2) ** 3)
  68. 68. 68 (B - B.mean(axis=0)) 
 + (B.T / B.std())
  69. 69. More Complex Graphs 69 cross validation
  70. 70. 70 http://continuum.io/blog/xray-dask
  71. 71. 71 from dask import dataframe as dd columns = ["name", "amenity", "Longitude", "Latitude"] data = dd.read_csv('POIWorld.csv', usecols=columns) with_name = data[data.name.notnull()] with_amenity = data[data.amenity.notnull()] is_starbucks = with_name.name.str.contains('[Ss]tarbucks') is_dunkin = with_name.name.str.contains('[Dd]unkin') starbucks = with_name[is_starbucks] dunkin = with_name[is_dunkin] locs = dd.compute(starbucks.Longitude, starbucks.Latitude, dunkin.Longitude, dunkin.Latitude) # extract arrays of values fro the series: lon_s, lat_s, lon_d, lat_d = [loc.values for loc in locs] %matplotlib inline import matplotlib.pyplot as plt from mpl_toolkits.basemap import Basemap def draw_USA(): """initialize a basemap centered on the continental USA""" plt.figure(figsize=(14, 10)) return Basemap(projection='lcc', resolution='l', llcrnrlon=-119, urcrnrlon=-64, llcrnrlat=22, urcrnrlat=49, lat_1=33, lat_2=45, lon_0=-95, area_thresh=10000) m = draw_USA() # Draw map background m.fillcontinents(color='white', lake_color='#eeeeee') m.drawstates(color='lightgray') m.drawcoastlines(color='lightgray') m.drawcountries(color='lightgray') m.drawmapboundary(fill_color='#eeeeee') # Plot the values in Starbucks Green and Dunkin Donuts Orange style = dict(s=5, marker='o', alpha=0.5, zorder=2) m.scatter(lon_s, lat_s, latlon=True, label="Starbucks", color='#00592D', **style) m.scatter(lon_d, lat_d, latlon=True, label="Dunkin' Donuts", color='#FC772A', **style) plt.legend(loc='lower left', frameon=False);
  72. 72. dask distributed 72 Pythonic Multiple-machine Parallelism that understands Dask graphs 1) Defines Center (dcenter) and Worker (dworker) 2) Simplified setup with dcluster for example — Center dcluster 192.168.0.{1,2,3,4} dcluster —hostfile hostfile.txt or 3) Create Executor objects like
 concurrent.futures (Python 3) or
 futures (Python 2.7 back-port) 4) Data locality supported with ad-hoc task graphs
 by returning futures wherever possible
  73. 73. 73 Anaconda (PyData) Inside Hadoop conda Dask MPI High Performance All of Python/R • Part of Dask Project • native HDFS reader • YARN/mesos integration • parquet, avro, thrift readers • Preview releases available now Coming GA in Q2 of 2016. • The native way to do Hadoop with PyData stack! • For Python users it’s better than Spark (faster and integration with current code). • Integrates easily with NumPy, Pandas, scikit- learn, etc.
  74. 74. 74 PyData Inside Hadoop conda •knit (http://knit.readthedocs.org/en/latest/) •enables python code to interact with Hadoop schedulers) •hdfs3 (http://hdfs3.readthedocs.org/en/latest/) •wrapper to Pivotal’s libhdfs3 which provides native reading and writing of hdfs (without the JVM) Two key libraries — enable the connection https://github.com/dask/dec2 might also be useful for you (to ease starting dask distributed clusters on EC2)
  75. 75. 75
  76. 76. HDFS without Java 76 1. HDFS splits large files into many small blocks replicated on many datanodes 2. For efficient computation we must use data directly on datanodes 3. distributed.hdfs queries the locations of the individual blocks 4. distributed executes functions directly on those blocks on the datanodes 5. distributed+pandas enables distributed CSV processing on HDFS in pure Python 6. Distributed, pythonic computation, directly on hdfs!
  77. 77. 77 $ hdfs dfs -cp yellow_tripdata_2014-01.csv /data/nyctaxi/ >>> from distributed import hdfs >>> blocks = hdfs.get_locations('/data/nyctaxi/', '192.168.50.100', 9000) >>> columns = ['vendor_id', 'pickup_datetime', 'dropoff_datetime', 'passenger_count', 'trip_distance', 'pickup_longitude', 'pickup_latitude', 'rate_code', 'store_and_fwd_flag', 'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount', 'surcharge', 'mta_tax', 'tip_amount', 'tolls_amount', 'total_amount'] >>> from distributed import Executor >>> executor = Executor('192.168.1.100:8787') >>> dfs = [executor.submit(pd.read_csv, block['path'], workers=block['hosts'], ... columns=columns, skiprows=1) ... for block in blocks] These operations produce Future objects that point to remote results on the worker computers. This does not pull results back to local memory. We can use these futures in later computations with the executor.
  78. 78. 78 def sum_series(seq): result = seq[0] for s in seq[1:]: result = result.add(s, fill_value=0) return result >>> counts = executor.map(lambda df: df.passenger_count.value_counts(), dfs) >>> total = executor.submit(sum_series, counts) >>> total.result() 0 259 1 9727301 2 1891581 3 566248 4 267540 5 789070 6 540444 7 7 8 5 9 16 208 19

×