Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Just enough DevOps for Data Scientists (Part II)

824 views

Published on

Imagine we have Ada, our data science intern. Let's run through a very simple wordcount spark job, and find a handful of potential failure points. Dozens of failures can and should happen when running spark jobs on commodity hardware. Given the basic foundation for infrastructure-level expectations, this talk gives Ada tools to ensure her job isn’t caught dead. Once the simple example job runs reliably, with the potential to scale, our data scientist can apply the same toolset to focus on some more interesting algorithms. Turn SNAFUs into successes by anticipating and handling Infra failures gracefully.

Note: this talk is a spark-focused extension of Part I, "Just Enough DevOps For Data Scientists" from Scale by The Bay 2018

https://www.youtube.com/watch?v=RqpnBl5NgW0&t=19s

Published in: Software
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Just enough DevOps for Data Scientists (Part II)

  1. 1. Just enough DevOps for Data Scientists (Part II)
  2. 2. About Anya (she/her) ​Sr. Member of Technical Staff (SRE) Salesforce Production Engineering Salesforce Einstein Platform ​Co-organizer SF Big Analytics ​Spark Tuning • Cheat-sheet • Talks ​Previously at Alpine Data, SRI PhD Mayo Clinic, BS Johns Hopkins ​@anyabida1
  3. 3. 1700s 1800s 1900s Today 1st Industrial Revolution Steam 2nd Industrial Revolution Electricity 3rd Industrial Revolution Computing 4th Industrial Revolution Intelligence Fourth Industrial Revolution ​Intelligence is transforming the customer experience
  4. 4. Just Enough DevOps for Data Scientists Part II: Handling Infra Failures When Running Spark abida@salesforce.com @ anyabida1 ​ Anya Bida, SRE at Salesforce
  5. 5. What is DevOps? Software Development Network & SecurityInfrastructure Build & Release
  6. 6. What is DevOps? Software Development Network & SecurityInfrastructure Build & Release
  7. 7. What is DevOps? Software Development Network & SecurityInfrastructure Build & Release ​ Data Science HelloAda!
  8. 8. Spark Primer Apache Spark
  9. 9. https://spark.apache.org/examples.html
  10. 10. https://spark.apache.org/examples.html
  11. 11. Blue Green Deployments https://docs.mobingi.com/official/guide/bg-deploy Blue Machine (old) Green Machine (new) Users
  12. 12. https://spark.apache.org/examples.html How to avoid potential HDFS failures - Use high availability for the namenode - Plenty of disk space for hdfs - Plenty of disk space per disk - Block replication = 3 - Monitor disk I/O, network connectivity - Correct permissions
  13. 13. https://spark.apache.org/examples.html Spark Context defines the application
  14. 14. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile
  15. 15. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile Stage Boundaries
  16. 16. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile Stage Boundaries Wide Transformationdefines new stage
  17. 17. Anatomy of a Spark Job High Performance Spark, Karau & Warren, O’Reilly Spark Context / Spark Session Object Actions (eg collect, saveAsTextFile) Wide Transformations (sort, groupByKey) Computation to evaluate one partition (combine narrow transforms) Spark Application Job Stage Stage Task Task
  18. 18. https://spark.apache.org/examples.html Spark operations saveAsTextFile map flatMap . reduceByKey textFile Stage Boundaries Where are the tasks?
  19. 19. Tasks run on executors Apache Spark
  20. 20. Tasks run on executors Apache Spark How to avoid common task failures - Use default retry & exponential backoff settings - Spark is tolerant to single / multi node failures - Spark 2.2 is tolerant to single disk failures even on non-raid commodity hardware - Etc. - Optimize number of partitions - Beware data skew & dirty data - Etc. - Etc.
  21. 21. https://spark.apache.org/examples.html Spark operations . reduceByKey Stage Boundaries The Shuffle
  22. 22. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use ​Persist to improve speed, Checkpoint to improve fault tolerance
  23. 23. https://spark.apache.org/examples.html Spark operations . reduceByKey Stage Boundaries The Shuffle - Persist to improve speed - Checkpoint to improve fault tolerance
  24. 24. https://spark.apache.org/examples.html Spark operations . Stage Boundaries The Write saveAsTextFile
  25. 25. https://spark.apache.org/examples.html Spark operations . Stage Boundaries saveAsTextFile The Write - reading and writing != efficient - Writing a few large files files is more efficient than writing thousands of small files
  26. 26. https://spark.apache.org/examples.html Spark operations . Stage Boundaries The Write - S3 - S3 partitions != hdfs partitions - S3 partitions != spark partitions - S3 partitioning can slow your write saveAsTextFile
  27. 27. https://spark.apache.org/examples.html Spark operations . Stage Boundaries The Write - S3 - S3 partitions != hdfs partitions - S3 partitions != spark partitions - S3 partitioning can slow your write - S3 partitioning depends on the first few characters of the bucket path - S3://mybucket/hash-myresultfile saveAsTextFile
  28. 28. https://spark.apache.org/examples.html FAILURE FAILURE FAILURE Common failures
  29. 29. https://spark.apache.org/examples.html FAILURE FAILURE FAILURE Common failures
  30. 30. Where do I find Metrics? Logs? ​Ganglia • windowing, dashboarding ​Spark History Server
  31. 31. More info: ​SRE How Google Runs Production Systems book ​High Performance spark book​Chaos Engineering
  32. 32. ​abida@salesforce.com @ anyabida1 ​ Anya Bida, SRE at Salesforce

×