Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Microservices & Teraflops: Effortlessly Scaling Data Science with PyWren | AnacondaCON 2017

1,951 views

Published on

One of the earliest challenges facing new data science practitioners is how to scale their work from something that runs on their laptop to larger-scale jobs. Tools like Spark and Hadoop can have a steep learning curve, and often require explicit management of a compute cluster. Here we talk about pyWren, a python library that lets users run their workloads on hundreds of cloud machines with no distributed computing knowledge, for a few dollars at a time. We will walk the audience from writing simple data analysis functions on their laptop to running on 1000 cores on Amazon's web services, in 30 minutes.

Presented at AnacondaCON 2017 by Eric Jonas, UC Berkeley.

Published in: Data & Analytics
  • Be the first to comment

Microservices & Teraflops: Effortlessly Scaling Data Science with PyWren | AnacondaCON 2017

  1. 1. MICROSERVICES 
 &TERAFLOPS
 Effortlessly scaling data science #thecloudistoodamnhard Eric Jonas Postdoctoral Researcher jonas@eecs.berkeley.edu | @stochastician
  2. 2. A BIG FAN OF ANACONDA
  3. 3. “BIG” DATA (near-by) stars neurons nuclei size 10^9 m 10^-5m 10^-14m number 1 10^11 10^26 data size 2 PB 12 TB/sec ??/sec
  4. 4. images courtesy NASA SOHO Sun in UV (304 Å)you are here
  5. 5. Solar Flare Prediction Using Photospheric and Coronal Image Data. Jonas, Bobra, Shankar, Recht. American Geophysical Union, 2016
  6. 6. NEUROSCIENCE AT ALL SCALES
  7. 7. Could a Neuroscientist understand a microprocessor? Jonas, Kording. PLOS Computational Biology, 2017
  8. 8. AND I WANT MORE!
  9. 9. Superresolution Phase contrastTomography Adaptive Optics
  10. 10. How do you get busy physicists and electrical engineers to give up Matlab? How do we get busy astronomers to give up IDL?
  11. 11. Why is there no “cloud button”? PREVIOUSLY, ON
  12. 12. The cloud is too damn hard! Jimmy McMillan Founder and Chairman The Rent isToo Damn High Party Less than half of the graduate students in our group have ever written a 
 Spark or Hadoop job
  13. 13. –Eric Jonas, 2017 “I hate computers”
  14. 14. #THECLOUDISTOODAMNHARD • What type? what instance? What base image? • How many to spin up? What price? spot? • wait,Wait,WAIT oh god • now what? DEVOPS
  15. 15. WHAT DO WE WANT? 1. Very little overhead for setup 
 once someone has an AWS account. In particular, no persistent overhead -- you don't have to keep a large (expensive) cluster up and you don't have to wait 10+ min for a cluster to come up
  16. 16. WHAT DO WE WANT? 2. As close to zero overhead for users as possible
 In particular, anyone who can write python should be able to invoke it through a reasonable interface. It should support all legacy code
  17. 17. WHAT DO WE WANT? 3. Target jobs that run in the minutes-or-more regime.
  18. 18. WHAT DO WE WANT? 4. I don't want to run a service. 
 That is, I personally don't want to offer the front-end for other people to use, rather, I want to directly pay AWS.
  19. 19. WHAT DO WE WANT? 5. It has to be from a cloud player that's likely to give out an academic grant -- AWS, Google, MS Azure. 
 
 There are startups in this space that might build cool technology, but often don't want to be paid in AWS research credits.
  20. 20. WHAT WE WANT 1.Very little overhead for setup once someone has an AWS account. In particular, no persistent overhead -- you don't have to keep a large (expensive) cluster up and you don't have to wait 10+ min for a cluster to come up 2.As close to zero overhead for users as possible -- in particular, anyone who can write python should be able to invoke it through a reasonable interface. 3.Target jobs that run in the minutes-or-more regime. 4.I don't want to run a service.That is, I personally don't want to offer the front-end for other people to use, rather, I want to directly pay AWS. 5.It has to be from a cloud player that's likely to give out an academic grant -- AWS, Google,Azure.There are startups in this space that might build cool technology, but often don't want to be paid in AWS research credits.
  21. 21. Powered by Continuum Analytics +
  22. 22. –Eric Jonas, 2017 “I hate computers” servers
  23. 23. • 300 seconds 
 single-core (AVX2) • 512 MB in /tmp • 1.5GB RAM • Python, Java, Node AWS LAMBDA
  24. 24. THE API
  25. 25. LAMBDA SCALABILITY Compute Data
  26. 26. YOU CAN DO A LOT OF WORK WITH MAP! ETL parameter tuning
  27. 27. IMAGENET EXAMPLE Preprocess 1.4M images from IMAGENET Compute GIST image descriptor (some random python code off the internet)
  28. 28. HOW IT WORKS pull job from s3 download anaconda runtime python to run code pickle result stick in S3 your laptop the cloud future = runner.map(fn, data) Serialize func and data Put on S3 Invoke Lambda func datadatadata future.result() poll S3 unpickle and return result
  29. 29. A BRIEF HISTORY OF SHARING Overhead Isolation Processes 1960s, MULTICS Virtual Machines 1990s,VMWare, Xen Renting/VPS 1990s, SGE HWVMs 2000s, IntelVT-X Containers 2008 chroot/LXC (mostly wrong) • Process isolation • network isolation • filesystem isolation • memory / cpu constraints
  30. 30. (Leptotyphlops carlae) Start Delete non-AVX2 MKL strip shared libs conda clean eliminate pkg delete pyc 977 MB 1205MB 441MB 946 MB 670 MB 510MB Want our runtime to include
  31. 31. MAP IS NOT ENOUGH? A lot of data analytics looks like: ETL / preprocessing featurizationData machine learning Distributed! Scale!TensorFlow Deep MLBaseGreat PyWren Fit
  32. 32. –Paul Barnum, quoted in McSherry, 2015 “You can have a second computer when you’ve shown you know how to use the first one.”
  33. 33. Scalability! But at what COST? Frank McSherry, Michael Isard, Derek G. Murray. USENIX HotTopics In Operating Systems, 2015
  34. 34. SINGLE-MACHINE REDUCE But I don’t have a big server! futures = exec.map(function, data)
 
 answer = exec.reduce(reduce_func, futures) cores RAM COST x1.32xlarge 64 2TB $14/hr x1.16xlarge 32 1TB $7/hr p2.16xlarge 32 + 
 16 GPUs 750 GB $14/hr r4.16xlarge 32 500 GB $4/hr
  35. 35. STUPID LAMBDATRICKS Shivaram told me today he has this up to 6M/sec transactions (!)
  36. 36. BUT I CAN’T USETHE CLOUD!
  37. 37. PYWREN MAKES SCALE A BIT EASIER • Do you have a python function? • Do you want to scale it? • Try it out! • Map :Today • BigReduce : 1.0 in a week • Parameter server: Experimental
  38. 38. THANKS! https://github.com/ericmjonas/pywren Shivaram Venkataraman Ben Recht Ion Stoica
  39. 39. EXTRA SLIDES
  40. 40. BEHINDTHE HOOD
  41. 41. UNDERSTANDING HOST ALLOCATION
  42. 42. SO WHEN ISTHIS USEFUL? • Parameter searching • Last-minute NIPS experiments • Expensive forward modelsmassivelyparallelcompute serial / local massivelyparallelcompute serial / local massivelyparallelcompute serial / local massivelyparallelcompute serial / local
  43. 43. GETTING AROUNDTHE LIMITATIONS • Runtime [anaconda] • Job lifetime [generators] • Synchronization (memcache/ redis?) • inter-lambda IPC
  44. 44. WORKER REUSE
  45. 45. COORDINATION?

×