Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018


Published on

Big Data applications are increasingly being run on Kubernetes. Data scientists commonly use python-based workflows, with tools like PySpark and Jupyter for wrangling large amounts of data. The Kubernetes community over the past year has been actively investing in tools and support for frameworks such as Apache Spark, Jupyter and Apache Airflow. Attendees will learn how these tools can be used together to build a scalable self-service platform for data science on Kubernetes as well as the benefits that Kubernetes can provide over traditional options.

Published in: Internet

Big data with Python on kubernetes (pyspark on k8s) - Big Data Spain 2018

  1. 1. Big Data w/Python On Kubernetes & Apache Spark with @holdenkarau Missing Ilan :(
  2. 2. Some links (slides & recordings will be at): CatLoversShow
  3. 3. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, Beam contributor ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share ● Code review livestreams: / ● Spark Talk Videos ● Talk feedback (if you are so inclined):
  4. 4. What is going to be covered: ● What is Kubernetes ● How it’s different from YARN and other similar systems we use Spark on ● How “simple” it is to switch cluster managers ○ Plus the not so simple (where’s my HDFS and auto-scaling?) ● WiFi co-operating a PySpark on K8s demo (everyone loves wordcount!) ● A brief detour in Kubeflow ● Future work and directions Andrew
  5. 5. Kubernetes “New” open-source cluster manager. - Runs programs in Linux containers. 1600+ contributors and 60,000+ commits.
  6. 6. Kubernetes “New” open-source cluster manager. - libs app kernel libs app libs app libs app Runs programs in Linux containers. 1600+ contributors and 60,000+ commits.
  7. 7. More isolation is good Kubernetes provides each program with: ● a lightweight virtual file system -- Docker image ○ an independent set of S/W packages ● a virtual network interface ○ a unique virtual IP address ○ an entire range of ports Aleksei I
  8. 8. Other isolation layers ● Separate process ID space ● Max memory limit ● CPU share throttling ● Mountable volumes ○ Config files -- ConfigMaps ○ Credentials -- Secrets ○ Local storages -- EmptyDir, HostPath ○ Network storages -- PersistentVolumes Jarek Reiner
  9. 9. Dependencies ● Spark alone isn’t enough ● Think: spacy, sci-kit learn, tensorflow, etc. ● YARN: Shared conda env, but supporting different version is hard Fuzzy Gerdes
  10. 10. Kubernetes architecture node A node B Pod 1 Pod 2 Pod 3 Pod, a unit of scheduling and isolation. ● runs a user program in a primary container ● holds isolation layers like a virtual IP in an infra container Robbt
  11. 11. Big Data on Kubernetes Since Spark 2.3, the community has been working on a few important new features that make Spark on Kubernetes more usable and ready for a broader spectrum of use cases: ● non-JVM binding support and memory customization ● client-mode support for running interactive apps ● Kerberos support ● large framework refactors: rm init-container; scheduler The Last Cookie
  12. 12. Spark on Kubernetes Spark Core Kubernetes Scheduler Backend Kubernetes Clusternew executors remove executors configuration • Resource Requests • Authnz • Communication with K8s babbagecabbage
  13. 13. Spark on Kubernetes node A node B Driver Pod Executor Pod 1 Executor Pod 2 Client Client Driver Pod Executor Pod 1 Executor Pod 2 Job 1 Job 2
  14. 14. How to change to running on Kubernetes? In theory “just”: --master yarn to --master k8s://[...] In practice: ● Build a container with your dependencies ● Possibly change your storage (HDFS to S3 or GCS) ● Change your cluster manager ● Re-do your tuning work Hisashi
  15. 15. Demo: Everyone loves wordcount! It’s big data which means we have to do WordCount Recorded demo - Hisashi
  16. 16. Demo #2: Wordcount in client mode on K8s Recorded demo - Luxus M
  17. 17. Demo #3: Wordcount in a notebook on K8s Everyone loves notebooks, except ops, qa and your very stressed out data engineers. Recorded demo - Tim (Timothy) Pearce
  18. 18. What do we need to do next? ● Support dynamic scaling ● Storage? ● Better auth integration ● Better documentation (ugh client mode) Hisashi
  19. 19. Dynamic Scaling: ● Need a seperate shuffle service ● We could do smart scale down maybe - Jennifer C.
  20. 20. Related talks & blog posts ● Running custom Spark on GKE and Azure - d-kubernetes ● Deploying Spark on Kubernetes - ● Getting PySpark 2.4 working on GKE recorded livestream - Interested in OSS (especially Spark)? ● Check out my Twitch & Youtube for livestreams - & Becky Lai
  21. 21. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  22. 22. High Performance Spark! Available today, not a lot on testing and almost nothing on validation, but that should not stop you from buying several copies (if you have an expense account). Cat’s love it! Amazon sells it: :D
  23. 23. Sign up for the mailing list @
  24. 24. And some upcoming talks: ● November ○ Saturday - Scale By The Bay - San Francisco ● December ○ ScalaX - London ● January ○ Data Day Texas ● February ○ TBD ● March ○ Strata San Francisco
  25. 25. Cat wave photo by Quinn Dombrowski k thnx bye! (or questions…) If you want to fill out survey: Give feedback on this presentation I’ll be around - I have a light up jacket but you can message me on twitter too.