Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Validating spark ml jobs stopping failures before production on Apache Spark @ Spark Summit 2019

706 views

Published on

Validating spark ml jobs stopping failures before production on apache spark presented @ Spark Summit 2019.

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Validating spark ml jobs stopping failures before production on Apache Spark @ Spark Summit 2019

  1. 1. @holdenkarau Validating Big Data & ML Pipelines With Apache Spark Stopping Failures Before Production Melinda Seckington
  2. 2. @holdenkarau Slides will be at: http://bit.ly/2L1zHdt CatLoversShow
  3. 3. @holdenkarau Holden: ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC & Committer ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Past Spark Talk Videos http://bit.ly/holdenSparkVideos ● Direct Talk feedback: http://bit.ly/holdenTalkFeedback ● Working on a book on Kubeflow (ML + Kubernetes): http://www.introductiontomlwithkubeflow.com/
  4. 4. @holdenkarau
  5. 5. @holdenkarau What is going to be covered: ● What validation is & why you should do it for your data pipelines ● How to make simple validation rules & our current limitations ● ML Validation - Guessing if our black box is “correct” ● Cute & scary pictures ○ I promise at least one cat ○ And at least one picture of my scooter club Andrew
  6. 6. @holdenkarau Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Possibly Familiar with Spark, if your new WELCOME! ● Want to make better software ○ (or models, or w/e) ● Or just want to make software good enough to not have to keep your resume up to date ● Open to the idea that pipeline validation can be explained with a scooter club that is definitely not a gang.
  7. 7. @holdenkarau Everything is awesome! Possibly you
  8. 8. @holdenkarau Test are not perfect: See Motorcycles/Scooters/... ● Are not property checking ● It’s just multiple choice ● You don’t even need one to ride a scoot!
  9. 9. @holdenkarau Why don’t we validate? ● We already tested our code ○ Riiiight? ● What could go wrong? Also extra hard in distributed systems ● Distributed metrics are hard ● not much built in (not very consistent) ● not always deterministic ● Complicated production systems
  10. 10. @holdenkarau So why should you validate? ● tl;dr - Your tests probably aren’t perfect ● You want to know when you're aboard the failboat ● Our code will most likely fail at some point ○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”) ○ That jerk on that other floor changed the meaning of a field :( ○ Our tests won’t catch all of the corner cases that the real world finds ● We should try and minimize the impact ○ Avoid making potentially embarrassing recommendations ○ Save having to be woken up at 3am to do a roll-back ○ Specifying a few simple invariants isn’t all that hard ○ Repeating Holden’s mistakes is still not fun
  11. 11. @holdenkarau So why should you validate Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  12. 12. @holdenkarau So why should you validate Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  13. 13. @holdenkarau What happens when we don’t This talk is being recorded so no company or rider names: ● Go home after an accident rather than checking on bones Or with computers: ● Breaking a feature that cost a few million dollars ● Every search result was a coffee shop ● Rabbit (“bunny”) versus rabbit (“queue”) versus rabbit (“health”) ● VA, BoA, etc. itsbruce
  14. 14. @holdenkarau Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  15. 15. @holdenkarau Lets focus on validation some more: *Can be used during integration tests to further validate integration results
  16. 16. @holdenkarau
  17. 17. @holdenkarau So how do we validate our jobs? ● The idea is, at some point, you made software which worked. ○ If you don’t you probably want to run it a few times and manually validate it ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever.
  18. 18. @holdenkarau How many people have something like this? val data = ... val parsed = data.flatMap(x => try { Some(parse(x)) } catch { case _ => None // Whatever, it's JSON } } Lilithis
  19. 19. @holdenkarau But we need some data... val data = ... data.cache() val validData = data.filter(isValid) val badData = data.filter(! isValid(_)) if validData.count() < badData.count() { // Ruh Roh! Special business error handling goes here } ... Pager photo by Vitachao CC-SA 3
  20. 20. @holdenkarau Well that’s less fun :( ● Our optimizer can’t just magically chain everything together anymore ● My flatMap.map.map is fnur :( ● Now I’m blocking on a thing in the driver Sn.Ho
  21. 21. @holdenkarau Counters* to the rescue**! ● Both BEAM & Spark have their it own counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems…. Miguel Olaya
  22. 22. @holdenkarau So what does that look like? val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3 Phoebe Baker
  23. 23. @holdenkarau Ok but what about those *s ● Beam counters are implementation dependent ● Spark counters aren’t great for data properties ● etc. Miguel Olaya
  24. 24. @holdenkarau General Rules for making Validation rules ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ○ I was probably a bit sleep deprived when I wrote it because looking at it… idk ○ I have a rewrite which is going through our open source releasing process. Maybe it will be released! Not a guarantee. ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Remember those property tests? Could be Validation rules ● Historical data ● Domain specific solutions ● Do you have property tests? ○ You should! Check out spark-testing-base ○ But you can use your property tests as a basis for validation rules as well Photo by: Paul Schadler
  25. 25. @holdenkarau Input Schema Validation ● Handling the “wrong” type of cat ● Many many different approaches ○ filter/flatMap stages ○ Working in Scala/Java? .as[T] ○ Manually specify your schema after doing inference the first time :p ● Unless your working on mnist.csv there is a good chance your validation is going to be fuzzy (reject some records accept others) ● How do we know if we’ve rejected too much? Bradley Gordon
  26. 26. @holdenkarau + You need to understand your domain, like bubbles
  27. 27. @holdenkarau So using names & logging & accs could be: rejectedCount = sc.accumulator(0) def loggedDivZero(x): import logging try: return [x / 0] except Exception as e: rejectedCount.add(1) logging.warning("Error found " + repr(e)) return [] transform1 = data.flatMap(loggedDivZero) transform2 = transform1.map(add1) transform2.count() print("Reject " + str(rejectedCount.value))
  28. 28. @holdenkarau % of data change ● Not just invalid records, if a field’s value changes everywhere it could still be “valid” but have a different meaning ○ Remember that example about almost recommending illegal content? ● Join and see number of rows different on each side ● Expensive operation, but if your data changes slowly / at a constant ish rate ○ Sometimes done as a separate parallel job ● Can also be used on output if applicable ○ You do have a table/file/as applicable to roll back to right?
  29. 29. @holdenkarau Validation rules can be a separate stage(s) ● Sometimes data validation in parallel in a separate process ● Combined with counters/metrics from your job ● Can then be compared with a seperate job that looks at the results and decides if the pipeline should continue
  30. 30. @holdenkarau TFDV: Magic* ● Counters, schema inference, anomaly detection, oh my! # Compute statistics over a new set of data new_stats = tfdv.generate_statistics_from_csv(NEW_DATA) # Compare how new data conforms to the schema anomalies = tfdv.validate_statistics(new_stats, schema) # Display anomalies inline tfdv.display_anomalies(anomalies) Details: https://medium.com/tensorflow/introducing-tensorflow-data-validation-data-underst anding-validation-and-monitoring-at-scale-d38e3952c2f0
  31. 31. @holdenkarau TFDV: Magic* ● Not in exactly in Spark (works with direct runner) ● Buuut we have the right tools to do the same computation in Spark Cats by moonwhiskers
  32. 32. @holdenkarau What can we learn from TFDV: ● Auto Schema Generation & Comparison ○ Spark SQL yay! ● We can compute summary statistics of your inputs & outputs ○ Spark SQL yay! ● If they change a lot "something" is on fire ● Anomaly detection: a few different spark libraries & talks here ○ Can help show you what might have gone wrong Tim Walker
  33. 33. @holdenkarau Not just data changes: Software too ● Things change! Yay! Often for the better. ○ Especially with handling edge cases like NA fields ○ Don’t expect the results to change - side-by-side run + diff ● Excellent PyData London talk about how this can impact ML models ○ Done with sklearn shows vast differences in CVE results only changing the model number Francesco
  34. 34. @holdenkarau Onto ML (or Beyond ETL :p) ● Some of the same principals work (yay!) ○ Schemas, invalid records, etc. ● Some new things to check ○ CV performance, Feature normalization ranges ● Some things don’t really work ○ Output size probably isn’t that great a metric anymore ○ Eyeballing the results for override is a lot harder contraption
  35. 35. @holdenkarau Extra considerations for ML jobs: ● Harder to look at output size and say if its good ● We can look at the cross-validation performance ● Fixed test set performance ● Number of iterations / convergence rate ● Number of features selected / number of features changed in selection ● (If applicable) delta in model weights or delta in hyper params Hsu Luke
  36. 36. @holdenkarau Traditional theory (Models) ● Human decides it's time to “update their models” ● Human goes through a model update run-book ● Human does other work while their “big-data” job runs ● Human deploys X% new models ● Looks at graphs ● Presses deploy Andrew
  37. 37. @holdenkarau Traditional practice (Models) ● Human is cornered by stakeholders and forced to update models ● Spends a few hours trying to remember where the guide is ● Gives up and kind of wings it ● Comes back to a trained model ● Human deploys X% models ● Human reads reddit/hacker news/etc. ● Presses deploy Bruno Caimi
  38. 38. @holdenkarau New possible practice (sometimes) ● Computer kicks off job (probably at an hour boundary because *shrug*) to update model ● Workflow tool notices new model is available ● Computer deploys X% models ● Software looks at monitoring graphs, uses statistical test to see if it’s bad ● Robot rolls it back & pager goes off ● Human Presses overrides and deploys anyways Henrique Pinto
  39. 39. @holdenkarau Updating your model ● The real world changes ● Online learning (streaming) is super cool, but hard to version ○ Common kappa-like arch and then revert to checkpoint ○ Slowly degrading models, oh my! ● Iterative batches: automatically train on new data, deploy model, and A/B test ● But A/B testing isn’t enough -- bad data can result in wrong or even illegal results
  40. 40. @holdenkarau Cross-validation because saving a test set is effort ● Trains on X% of the data and tests on Y% ○ Multiple times switching the samples ● org.apache.spark.ml.tuning has the tools for auto fitting using CB ○ If your going to use this for auto-tuning please please save a test set ○ Otherwise your models will look awesome and perform like a ford pinto (or whatever a crappy car is here. Maybe a renault reliant?) Jonathan Kotta
  41. 41. @holdenkarau False sense of security: ● A/B test please even if CV says amazing ● Rank based things can have training bias with previous orders ○ Non-displayed options: unlikely to be chosen ○ Sometimes can find previous formulaic corrections ○ Sometimes we can “experimentally” determine ● Other times we just hope it’s better than nothing ● Try and make sure your ML isn’t evil or re-encoding human biases but stronger
  42. 42. @holdenkarau Some ending notes ● Your validation rules don’t have to be perfect ○ But they should be good enough they alert infrequently ● You should have a way for the human operator to override. ● Just like tests, try and make your validation rules specific and actionable ○ # of input rows changed is not a great message - table XYZ grew unexpectedly to Y% ● While you can use (some of) your tests as a basis for your rules, your rules need tests too ○ e.g. add junk records/pure noise and see if it rejects James Petts
  43. 43. @holdenkarau Related Links: ● https://github.com/holdenk/data-pipeline-validator ● Testing Spark Best Practices (Spark Summit 2014) ● https://www.tensorflow.org/tfx/data_validation/get_started ● Spark and Spark Streaming Unit Testing ● Making Spark Unit Testing With Spark Testing Base ● Testing strategy for Apache Spark jobs ● The BEAM programming guide Interested in OSS (especially Spark)? ● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau & https://www.youtube.com/user/holdenkarau Becky Lai
  44. 44. @holdenkarau Everything is broken! This (might be) you now. It's better to know when it's broken
  45. 45. @holdenkarau Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  46. 46. @holdenkarau High Performance Spark! Available today, not a lot on testing and almost nothing on validation, but that should not stop you from buying several copies (if you have an expense account). Cat’s love it! Amazon sells it: http://bit.ly/hkHighPerfSpark :D
  47. 47. @holdenkarau Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  48. 48. @holdenkarau Want to turn your code into "art"?
  49. 49. @holdenkarau Want to turn your failing code into "art"? https://haute.codes/ It doesn't use Spark* *yet
  50. 50. @holdenkarau And some upcoming talks: ● April ○ Strata London ● May ○ KiwiCoda Mania ○ KubeCon Barcelona ● June ○ Scala Days EU ○ Berlin Buzzwords
  51. 51. @holdenkarau Sparkling Pink Panda Scooter group photo by Kenzi k thnx bye! (or questions…) If you want to fill out a survey: http://bit.ly/holdenTestingSpark Give feedback on this presentation http://bit.ly/holdenTalkFeedback I'll be in the hallway and back tomorrow or you can email me: holden@pigscanfly.ca
  52. 52. @holdenkarau Property generating libs: QuickCheck / ScalaCheck ● QuickCheck (haskell) generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● Sscheck (scala check for spark) ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume PROtara hunt
  53. 53. @holdenkarau With spark-testing-base test("map should not change number of elements") { forAll(RDDGenerator.genRDD[String](sc)){ rdd => rdd.map(_.length).count() == rdd.count() } }

×