Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Strata Beijing 2017: Jumpy, a python interface for nd4j


Published on

This covers jumpy:

our new python interface.

Published in: Data & Analytics
  • Be the first to comment

Strata Beijing 2017: Jumpy, a python interface for nd4j

  1. 1. Who are we? This slide shows that GPUs should complement the big data stack on the Hadoop ecosystem, rather than trying to replace Hadoop etc. outright. Wholesale replacement of the big data stack will be cost-prohibitive to many clients. We believe the right approach is to sell GPUs for accelerated computation and a few other use cases. That’s our beach head. (Obviously, the widening functionality of the Volta will change the GPU ecosystem.) Founded 2014 Distributed worldwide Lots of activity in China
  2. 2. Skymind in China
  3. 3. Most JVM python interfaces ● Network based. Requires gateway and py4j ● Tons of overhead. Often a bottleneck with real Spark jobs ● Places a focus on “pushing logic down to scala” ● Doesn’t interop well with existing python ecosystem ● Often api compatibility issues ● “Good enough” for basic use cases despite overhead
  4. 4. Basic facts about overhead ● In depth paper: ● Python vs scala: 15x slower ● Much of this is due to network traffic ● Serialization is another big problem ● Imagine saving objects every time you run compute.
  5. 5. Distributed Deep Learning bottlenecks ● Network overhead from param servers ● Data movement between cpu and gpu ● Buffer allocation for compute ● Data Loading and input creation (creating tensors from data)
  6. 6. Linear Algebra in python ● C based internally ● Python is just an interface ● Tend to interop with numpy pointers directly ● Supports cpu and gpu ● For DL often varied engines (MPI,GRPC,..) ● Often extended in C
  7. 7. Linear Algebra in spark ● Based on breeze and net lib java (not maintained anymore, limited to cpu) ● Most routines are Scala based ● On heap memory (bad for latency) ● Cuda support is sparse at best ● Doesn’t conform with industry standards (python) ● Not meant for heavy compute (hardware accel) ● Relies on spark for most ops (you can’t do this with deep learning)
  8. 8. Minor conclusions ● 1 of these is not like the other ● Hard to interop with python ecosystem ● Spark tries to be something it’s not re: linear algebra ● Spark should do data loading. Not linear algebra better handled by c++ (simd,gpus,..) ● Alternatives are needed (more specialization) (a focus on c++ with pythonic conventions)
  9. 9. Nd4j ● Java based api, c++ core ● Own off heap memory management (even for gpu) ● Soon: Autodiff and graph execution (graph of operations) and sparse ● Similar architecture to numpy (easy interop) ( ● Works with blas/lapack ● Generally faster than numpy even from python (as we’ll see soon) ● It’s not python though!
  10. 10. Nd4j Parameter Server Aeron: More stable latency than GRPC and way faster (25x!) than TF
  11. 11. Jumpy: A better python interface ● Low latency using c internally ● Interface with nd4j <-> numpy via direct pointers ● Syntax sugar similar to numpy ● Uses jnius underneath( ● JNIUS starts and manages a JVM for you. Interops via JNI and Cython ● Easy to extend
  12. 12. Jumpy examples
  13. 13. Thanks! Join our QQ group:
  14. 14. Conclusions and future work ● No networks! An actual path to improvement ● Reflection can be a bottleneck ● Like most useful things in python, most of it is c! ● Plans to optimize pyjnius itself ● Can enable us to interop with other parts of python