Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Validating
Big Data & ML Pipelines
(Apache Spark)
Now
mostly
“works”*
Melinda
Seckington
Holden:
● My name is Holden Karau
● Prefered pronouns are she/her
● Developer Advocate at Google
● Apache Spark PMC, contr...
What is going to be covered:
● Why my employer cares about this stuff
● My assumptions about y’all
● A super brief look at...
Some of the reasons my employer cares*
● We have a hoted Spark/Hadoop solution (called Dataproc)
● We also have hosted pip...
Who I think you wonderful humans are?
● Nice* people
● Like silly pictures
● Possibly Familiar with one of Scala, Java, or...
So why should you test?
● Makes you a better person
● Avoid making your users angry
● Save $s
○ Having an ML job fail in h...
So why should you validate?
● tl;dr - Your tests probably aren’t perfect
● You want to know when you're aboard the failboa...
So why should you test & validate:
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
So why should you test & validate - cont
Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
Why don’t we test?
● It’s hard
○ Faking data, setting up integration tests
● Our tests can get too slow
○ Packaging and bu...
Why don’t we test? (continued)
Why don’t we validate?
● We already tested our code
○ Riiiight?
● What could go wrong?
Also extra hard in distributed syst...
What happens when we don’t
● Personal stories go here
○ I have no comment about where these stories are from
This talk is ...
Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
Where do folks get the data for pipeline tests?
● Most people generate data by hand
● If you have production data you can
...
Property generating libs: QuickCheck / ScalaCheck
● QuickCheck (haskell) generates tests data under a set of constraints
●...
With spark-testing-base & a million entries
test("map should not change number of elements") {
implicit val generatorDrive...
But that can get a bit slow for all of our tests
● Not all of your tests should need a cluster (or even a cluster simulato...
Lets focus on validation some more:
*Can be used during integration tests to further validate integration results
So how do we validate our jobs?
● The idea is, at some point, you made software which worked.
● Maybe you manually tested ...
How many people have something like this?
val data = ...
val parsed = data.flatMap(x =>
try {
Some(parse(x))
} catch {
cas...
But if we’re going to validate...
val data = ...
data.cache()
val validData = data.filter(isValid)
val badData = data.filt...
Well that’s less fun :(
● Our optimizer can’t just magically chain everything together anymore
● My flatMap.map.map is fnu...
Counters* to the rescue**!
● Spark has built in counters
○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, e...
First counters free….
Just a little bit of code for the next ones….
val parsed = data.flatMap(x => try {
Some(parse(x))
happyCounter.add(1)
} ca...
Ok but what about those *s
● Turns out accumulators aren’t really great for tracking data properties
● Turns out sometimes...
General Rules for making Validation rules
● According to a sad survey most people check execution time & record count
● sp...
Turning property tests to validation rules*
● Yes in theory they’re already “tested” but...
● Common function to check acc...
Input Schema Validation
● Handling the “wrong” type of cat
● Many many different approaches
○ filter/flatMap stages
○ Work...
e.g. write our “rule” like:
val (ok, bad) = (sc.accumulator(0), sc.accumulator(0))
val records = input.flatMap{ x => if (i...
Validating records read matches our expectations:
val vc = new ValidationConf(tempPath, "1", true,
List[ValidationRule](
n...
% of data change
● Not just invalid records, if a field’s value changes everywhere it could still be
“valid” but have a di...
Not just data changes: Software too
● Things change! Yay! Often for the better.
○ Especially with handling edge cases like...
Onto ML (or Beyond ETL :p)
● Some of the same principals work (yay!)
○ Schemas, invalid records, etc.
● Some new things to...
Traditional theory (Models)
● Human decides it's time to “update their models”
● Human goes through a model update run-boo...
Traditional practice (Models)
● Human is cornered by stakeholders and forced to update models
● Spends a few hours trying ...
New possible practice (sometimes)
● Computer kicks off job (probably at an hour boundary because *shrug*) to
update model
...
Extra considerations for ML jobs:
● Harder to look at output size and say if its good
● We can look at the cross-validatio...
Cross-validation
because saving a test set is effort
● Trains on X% of the data and tests on Y%
○ Multiple times switching...
False sense of security:
● A/B test please even if CV says many many $s
● Rank based things can have training bias with pr...
Some ending notes
● Your validation rules don’t have to be perfect
○ But they should be good enough they alert infrequentl...
Related packages
● spark-testing-base: https://github.com/holdenk/spark-testing-base
● sscheck: https://github.com/juanrh/...
Learning Spark
Fast Data
Processing with
Spark
(Out of Date)
Fast Data
Processing with
Spark
(2nd edition)
Advanced
Analyt...
High Performance Spark!
Available today, not a lot on testing and almost nothing on
validation, but that should not stop y...
Sign up for the mailing list @
http://www.distributedcomputing4kids.com
Cat wave photo by Quinn Dombrowski
k thnx bye! (or questions…)
If you want to fill out survey:
http://bit.ly/holdenTesting...
The state of serving is generally a mess
● If it’s not ML models its can be better
○ Reports for everyone!
○ Or database u...
Updating your model
● The real world changes
● Online learning (streaming) is super cool, but hard to
version
○ Common kap...
Related talks & blog posts
● Testing Spark Best Practices (Spark Summit 2014)
● Every Day I’m Shuffling (Strata 2015) & sl...
And including spark-testing-base up to spark 2.3.1
sbt:
"com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test...
Other options for generating data:
● mapPartitions + Random + custom code
● RandomRDDs in mllib
○ Uniform, Normal, Possion...
RandomRDDs
val zipRDD = RandomRDDs.exponentialRDD(sc, mean = 1000, size
= rows).map(_.toInt.toString)
val valuesRDD = Rand...
Testing libraries:
● Spark unit testing
○ spark-testing-base - https://github.com/holdenk/spark-testing-base
○ sscheck - h...
Let’s talk about local mode
● It’s way better than you would expect*
● It does its best to try and catch serialization err...
Options beyond local mode:
● Just point at your existing cluster (set master)
● Start one with your shell scripts & change...
Integration testing - docker is awesome
● Spark-docker, kafka-docker, etc.
○ Not always super up to date sadly - if you ar...
Setting up integration on Yarn/Mesos
● So lucky!
● You can write your tests in the same way as before - just read from you...
“Business logic” only test w/kontextfrei
import com.danielwestheide.kontextfrei.DCollectionOps
trait UsersByPopularityProp...
Generating Data with Spark
import org.apache.spark.mllib.random.RandomRDDs
...
RandomRDDs.exponentialRDD(sc, mean = 1000, ...
 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau
Upcoming SlideShare
Loading in …5
×

of

 Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 1  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 2  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 3  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 4  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 5  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 6  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 7  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 8  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 9  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 10  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 11  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 12  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 13  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 14  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 15  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 16  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 17  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 18  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 19  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 20  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 21  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 22  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 23  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 24  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 25  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 26  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 27  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 28  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 29  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 30  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 31  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 32  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 33  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 34  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 35  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 36  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 37  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 38  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 39  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 40  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 41  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 42  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 43  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 44  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 45  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 46  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 47  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 48  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 49  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 50  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 51  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 52  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 53  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 54  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 55  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 56  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 57  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 58  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 59  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 60  Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau Slide 61
Upcoming SlideShare
What to Upload to SlideShare
Next
Download to read offline and view in fullscreen.

2 Likes

Share

Download to read offline

Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau

Download to read offline

As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes.

Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data.

For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF.

If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one 

Validating Big Data Jobs—Stopping Failures Before Production on Apache Spark with Holden Karau

  1. 1. Validating Big Data & ML Pipelines (Apache Spark) Now mostly “works”* Melinda Seckington
  2. 2. Holden: ● My name is Holden Karau ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC, contributor to many others (including Airflow) ● previously IBM, Alpine, Databricks, Google, Foursquare & Amazon ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share http://www.slideshare.net/hkarau ● Code review livestreams: https://www.twitch.tv/holdenkarau / https://www.youtube.com/user/holdenkarau ● Spark Talk Videos http://bit.ly/holdenSparkVideos ● Talk feedback (if you are so inclined): http://bit.ly/holdenTalkFeedback
  3. 3. What is going to be covered: ● Why my employer cares about this stuff ● My assumptions about y’all ● A super brief look at property testing ● What validation is & why you should do it for your data pipelines ● How to make simple validation rules & our current limitations ● ML Validation - Guessing if our black box is “correct” ● Cute & scary pictures ○ I promise at least one cat Andrew
  4. 4. Some of the reasons my employer cares* ● We have a hoted Spark/Hadoop solution (called Dataproc) ● We also have hosted pipeline management tools (based on Airflow called Cloud Composer) ● Being good open source community members *Probably, it’s not like I go to all of the meetings I’m invited to. Khairil Zhafri
  5. 5. Who I think you wonderful humans are? ● Nice* people ● Like silly pictures ● Possibly Familiar with one of Scala, Java, or Python? ● Possibly Familiar with one of Spark ● Want to make better software ○ (or models, or w/e) ● Or just want to make software good enough to not have to keep your resume up to date
  6. 6. So why should you test? ● Makes you a better person ● Avoid making your users angry ● Save $s ○ Having an ML job fail in hour 26 to restart everything can be expensive... ● Waiting for our jobs to fail is a pretty long dev cycle ● Honestly you’re probably not watching this unless you agree
  7. 7. So why should you validate? ● tl;dr - Your tests probably aren’t perfect ● You want to know when you're aboard the failboat ● Our code will most likely fail at some point ○ Sometimes data sources fail in new & exciting ways (see “Call me Maybe”) ○ That jerk on that other floor changed the meaning of a field :( ○ Our tests won’t catch all of the corner cases that the real world finds ● We should try and minimize the impact ○ Avoid making potentially embarrassing recommendations ○ Save having to be woken up at 3am to do a roll-back ○ Specifying a few simple invariants isn’t all that hard ○ Repeating Holden’s mistakes is still not fun
  8. 8. So why should you test & validate: Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  9. 9. So why should you test & validate - cont Results from: Testing with Spark survey http://bit.ly/holdenTestingSpark
  10. 10. Why don’t we test? ● It’s hard ○ Faking data, setting up integration tests ● Our tests can get too slow ○ Packaging and building scala is already sad ● It takes a lot of time ○ and people always want everything done yesterday ○ or I just want to go home see my partner ○ Etc. ● Distributed systems is particularly hard
  11. 11. Why don’t we test? (continued)
  12. 12. Why don’t we validate? ● We already tested our code ○ Riiiight? ● What could go wrong? Also extra hard in distributed systems ● Distributed metrics are hard ● not much built in (not very consistent) ● not always deterministic ● Complicated production systems
  13. 13. What happens when we don’t ● Personal stories go here ○ I have no comment about where these stories are from This talk is being recorded so we’ll leave it at: ● Negatively impacted the brand in difficult to quantify ways with words with multiple meanings ● Breaking a feature that cost a few million dollars ● Almost recommended illegal content (caught by a lucky manual) ● Every search result was a coffee shop itsbruce
  14. 14. Cat photo from http://galato901.deviantart.com/art/Cat-on-Work-Break-173043455
  15. 15. Where do folks get the data for pipeline tests? ● Most people generate data by hand ● If you have production data you can sample you are lucky! ○ If possible you can try and save in the same format ● If our data is a bunch of Vectors or Doubles Spark’s got tools :) ● Coming up with good test data can take a long time ● Important to test different distributions, input files, empty partitions etc. Lori Rielly
  16. 16. Property generating libs: QuickCheck / ScalaCheck ● QuickCheck (haskell) generates tests data under a set of constraints ● Scala version is ScalaCheck - supported by the two unit testing libraries for Spark ● Sscheck (scala check for spark) ○ Awesome people*, supports generating DStreams too! ● spark-testing-base ○ Also Awesome people*, generates more pathological (e.g. empty partitions etc.) RDDs *I assume PROtara hunt
  17. 17. With spark-testing-base & a million entries test("map should not change number of elements") { implicit val generatorDrivenConfig = PropertyCheckConfig(minSize = 0, maxSize = 1000000) val property = forAll(RDDGenerator.genRDD[String](sc)){ rdd => importantBussinesLogicFunction(rdd).count() == rdd.count() } check(property) }
  18. 18. But that can get a bit slow for all of our tests ● Not all of your tests should need a cluster (or even a cluster simulator) ● If you are ok with not using lambdas everywhere you can factor out that logic and test with traditional tools ● Or if you want to keep those lambdas - or verify the transformations logic without the overhead of running a local distributed systems you can try a library like kontextfrei ○ Don’t rely on this alone (but can work well with something like scalacheck)
  19. 19. Lets focus on validation some more: *Can be used during integration tests to further validate integration results
  20. 20. So how do we validate our jobs? ● The idea is, at some point, you made software which worked. ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever. Photo by: Paul Schadler
  21. 21. How many people have something like this? val data = ... val parsed = data.flatMap(x => try { Some(parse(x)) } catch { case _ => None // Whatever, it's JSON } } Lilithis
  22. 22. But if we’re going to validate... val data = ... data.cache() val validData = data.filter(isValid) val badData = data.filter(! isValid(_)) if validData.count() < badData.count() { // Ruh Roh! Special business error handling goes here } ... Pager photo by Vitachao CC-SA 3
  23. 23. Well that’s less fun :( ● Our optimizer can’t just magically chain everything together anymore ● My flatMap.map.map is fnur :( ● Now I’m blocking on a thing in the driver Sn.Ho
  24. 24. Counters* to the rescue**! ● Spark has built in counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters/Accumulators are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems…. Miguel Olaya
  25. 25. First counters free….
  26. 26. Just a little bit of code for the next ones…. val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3 Phoebe Baker
  27. 27. Ok but what about those *s ● Turns out accumulators aren’t really great for tracking data properties ● Turns out sometimes for validation we really care about data properties ● But we can kind of fake it and hope Miguel Olaya
  28. 28. General Rules for making Validation rules ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Remember those property tests? Could be Validation rules ● Historical data ● Domain specific solutions Photo by: Paul Schadler
  29. 29. Turning property tests to validation rules* ● Yes in theory they’re already “tested” but... ● Common function to check accumulator value between validation & tests ● The real-world is can be fuzzier Photo by: Paul Schadler
  30. 30. Input Schema Validation ● Handling the “wrong” type of cat ● Many many different approaches ○ filter/flatMap stages ○ Working in Scala/Java pre-filter then .as[T] ○ Manually specify your schema after doing inference the first time :p ● Unless your working on mnist.csv there is a good chance your validation is going to be fuzzy (reject some records accept others) ● How do we know if we’ve rejected too much? Bradley Gordon
  31. 31. e.g. write our “rule” like: val (ok, bad) = (sc.accumulator(0), sc.accumulator(0)) val records = input.flatMap{ x => if (isValid(x)) ok +=1 else bad += 1 // Actual parse logic here } // An action (e.g. count, save, etc.) if (bad.value > 0.1* ok.value) { throw Exception("bad data - do not use results") // Optional cleanup } // Mark as safe P.S: If you are interested in this check out spark-validator (still early stages). Found Animals Foundation Follow
  32. 32. Validating records read matches our expectations: val vc = new ValidationConf(tempPath, "1", true, List[ValidationRule]( new AbsoluteSparkCounterValidationRule("recordsRead", Some(3000000), Some(10000000))) ) val sqlCtx = new SQLContext(sc) val v = Validation(sc, sqlCtx, vc) //Business logic goes here assert(v.validate(5) === true) } Photo by Dvortygirl
  33. 33. % of data change ● Not just invalid records, if a field’s value changes everywhere it could still be “valid” but have a different meaning ○ Remember that example about almost recommending illegal content? ● Join and see number of rows different on each side ● Expensive operation, but if your data changes slowly / at a constant ish rate ○ Sometimes done as a separate parallel job ● Can also be used on output if applicable ○ You do have a table/file/as applicable to roll back to right?
  34. 34. Not just data changes: Software too ● Things change! Yay! Often for the better. ○ Especially with handling edge cases like NA fields ○ Don’t expect the results to change - side-by-side run + diff ● Blue/Green deployments aren’t just for microservices ○ Run your pipeline side-by-side and compare diffs when pushing new version ○ In CI you can do this on smaller test batches ● Excellent PyData London talk about how this can impact ML models Francesco
  35. 35. Onto ML (or Beyond ETL :p) ● Some of the same principals work (yay!) ○ Schemas, invalid records, etc. ● Some new things to check ○ CV performance, Feature normalization ranges ● Some things don’t really work ○ Output size probably isn’t that great a metric anymore ○ Eyeballing the results for override is a lot harder contraption
  36. 36. Traditional theory (Models) ● Human decides it's time to “update their models” ● Human goes through a model update run-book ● Human does other work while their “big-data” job runs ● Human deploys X% new models ● Looks at graphs ● Presses deploy Andrew
  37. 37. Traditional practice (Models) ● Human is cornered by stakeholders and forced to update models ● Spends a few hours trying to remember where the guide is ● Gives up and kind of wings it ● Comes back to a trained model ● Human deploys X% models ● Human reads reddit/hacker news/etc. ● Presses deploy Bruno Caimi
  38. 38. New possible practice (sometimes) ● Computer kicks off job (probably at an hour boundary because *shrug*) to update model ● Workflow tool notices new model is available ● Computer deploys X% models ● Software looks at monitoring graphs, uses statistical test to see if it’s bad ● Robot rolls it back & pager goes off ● Human Presses overrides and deploys anyways Henrique Pinto
  39. 39. Extra considerations for ML jobs: ● Harder to look at output size and say if its good ● We can look at the cross-validation performance ● Fixed test set performance ● Number of iterations / convergence rate ● Number of features selected / number of features changed in selection ● (If applicable) delta in model weights or tree size or ... Jennifer C.
  40. 40. Cross-validation because saving a test set is effort ● Trains on X% of the data and tests on Y% ○ Multiple times switching the samples ● org.apache.spark.ml.tuning has the tools for auto fitting using CB ○ If your going to use this for auto-tuning please please save a test set ○ Otherwise your models will look awesome and perform like a ford pinto (or whatever a crappy car is here. Maybe a renault reliant?) Jonathan Kotta
  41. 41. False sense of security: ● A/B test please even if CV says many many $s ● Rank based things can have training bias with previous orders ● Non-displayed options: unlikely to be chosen ● Sometimes can find previous formulaic corrections ● Sometimes we can “experimentally” determine ● Other times we just hope it’s better than nothing ● Try and make sure your ML isn’t evil or re-encoding human biases but stronger
  42. 42. Some ending notes ● Your validation rules don’t have to be perfect ○ But they should be good enough they alert infrequently ○ Occasional overrides are OK ● Your validation rules can live in seperate jobs ● Just like tests, try and make your validation rules specific and actionable ○ Execution time changed is not a great message - table XYZ grew unexpectedly to Y% James Petts
  43. 43. Related packages ● spark-testing-base: https://github.com/holdenk/spark-testing-base ● sscheck: https://github.com/juanrh/sscheck ● spark-validator: https://github.com/holdenk/spark-validator *Proof of concept, do not actually use* ● spark-perf - https://github.com/databricks/spark-perf ● spark-integration-tests - https://github.com/databricks/spark-integration-tests ● scalacheck - https://www.scalacheck.org/ Becky Lai
  44. 44. Learning Spark Fast Data Processing with Spark (Out of Date) Fast Data Processing with Spark (2nd edition) Advanced Analytics with Spark Spark in Action High Performance SparkLearning PySpark
  45. 45. High Performance Spark! Available today, not a lot on testing and almost nothing on validation, but that should not stop you from buying several copies (if you have an expense account). Cat’s love it! Amazon sells it: http://bit.ly/hkHighPerfSpark :D
  46. 46. Sign up for the mailing list @ http://www.distributedcomputing4kids.com
  47. 47. Cat wave photo by Quinn Dombrowski k thnx bye! (or questions…) If you want to fill out survey: http://bit.ly/holdenTestingSpark I will use update results in & give the talk again the next time Spark adds a major feature. Give feedback on this presentation http://bit.ly/holdenTalkFeedback Have questions? - sli.do: SL18 - Union Grand EF I’ll be giving another talk tomorrow @ 4:20 PM on ML on Spark Error Messages*
  48. 48. The state of serving is generally a mess ● If it’s not ML models its can be better ○ Reports for everyone! ○ Or database updates for everyone! ● Big challenge: when something goes wrong - how do I fix it? ○ Something will go wrong eventually - do you have an old snap shot you can roll back to quickly? ● One project which aims to improve this for ML is KubeFlow ○ Goal is unifying training & serving experiences ○ Despite the name targeting more than just TensorFlow ○ Doesn’t work with Spark yet, but it’s on my PR list.
  49. 49. Updating your model ● The real world changes ● Online learning (streaming) is super cool, but hard to version ○ Common kappa-like arch and then revert to checkpoint ○ Slowly degrading models, oh my! ● Iterative batches: automatically train on new data, deploy model, and A/B test ● But A/B testing isn’t enough -- bad data can result in wrong or even illegal results (ask me after a bud light lime) Jennifer C.
  50. 50. Related talks & blog posts ● Testing Spark Best Practices (Spark Summit 2014) ● Every Day I’m Shuffling (Strata 2015) & slides ● Spark and Spark Streaming Unit Testing ● Making Spark Unit Testing With Spark Testing Base ● Testing strategy for Apache Spark jobs ● The BEAM programming guide Interested in OSS (especially Spark)? ● Check out my Twitch & Youtube for livestreams - http://twitch.tv/holdenkarau & https://www.youtube.com/user/holdenkarau Becky Lai
  51. 51. And including spark-testing-base up to spark 2.3.1 sbt: "com.holdenkarau" %% "spark-testing-base" % "2.3.1_0.10.0" % "test" Maven: <dependency> <groupId>com.holdenkarau</groupId> <artifactId>spark-testing-base_2.11</artifactId> <version>${spark.version}_0.10.0</version> <scope>test</scope> </dependency> Vladimir Pustovit
  52. 52. Other options for generating data: ● mapPartitions + Random + custom code ● RandomRDDs in mllib ○ Uniform, Normal, Possion, Exponential, Gamma, logNormal & Vector versions ○ Different type: implement the RandomDataGenerator interface ● Random
  53. 53. RandomRDDs val zipRDD = RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows).map(_.toInt.toString) val valuesRDD = RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols).repartition(zipRDD.partitions.size) val keyRDD = sc.parallelize(1L.to(rows), zipRDD.getNumPartitions) keyRDD.zipPartitions(zipRDD, valuesRDD){ (i1, i2, i3) => new Iterator[(Long, String, Vector)] { ...
  54. 54. Testing libraries: ● Spark unit testing ○ spark-testing-base - https://github.com/holdenk/spark-testing-base ○ sscheck - https://github.com/juanrh/sscheck ● Simplified unit testing (“business logic only”) ○ kontextfrei - https://github.com/dwestheide/kontextfrei * ● Integration testing ○ spark-integration-tests (Spark internals) - https://github.com/databricks/spark-integration-tests ● Performance ○ spark-perf (also for Spark internals) - https://github.com/databricks/spark-perf ● Spark job validation ○ spark-validator - https://github.com/holdenk/spark-validator * Photo by Mike Mozart *Early stage or work-in progress, or proof of concept
  55. 55. Let’s talk about local mode ● It’s way better than you would expect* ● It does its best to try and catch serialization errors ● It’s still not the same as running on a “real” cluster ● Especially since if we were just local mode, parallelize and collect might be fine Photo by: Bev Sykes
  56. 56. Options beyond local mode: ● Just point at your existing cluster (set master) ● Start one with your shell scripts & change the master ○ Really easy way to plug into existing integration testing ● spark-docker - hack in our own tests ● YarnMiniCluster ○ https://github.com/apache/spark/blob/master/yarn/src/test/scala/org/apache/spark/deploy/yarn/ BaseYarnClusterSuite.scala ○ In Spark Testing Base extend SharedMiniCluster ■ Not recommended until after SPARK-10812 (e.g. 1.5.2+ or 1.6+) Photo by Richard Masoner
  57. 57. Integration testing - docker is awesome ● Spark-docker, kafka-docker, etc. ○ Not always super up to date sadly - if you are last stable release A-OK, if you build from master - sad pandas ● Or checkout JuJu Charms (from Canonical) - https://jujucharms.com/ ○ Makes it easy to deploy a bunch of docker containers together & configured in a reasonable way.
  58. 58. Setting up integration on Yarn/Mesos ● So lucky! ● You can write your tests in the same way as before - just read from your test data sources ● Missing a data source? ○ Can you sample it or fake it using the techniques from before? ○ If so - do that and save the result to your integration enviroment ○ If not… well good luck ● Need streaming integration? ○ You will probably need a second Spark (or other) job to generate the test data
  59. 59. “Business logic” only test w/kontextfrei import com.danielwestheide.kontextfrei.DCollectionOps trait UsersByPopularityProperties[DColl[_]] extends BaseSpec[DColl] { import DCollectionOps.Imports._ property("Each user appears only once") { forAll { starredEvents: List[RepoStarred] => val result = logic.usersByPopularity(unit(starredEvents)).collect().toList result.distinct mustEqual result } } … (continued in example/src/test/scala/com/danielwestheide/kontextfrei/example/)
  60. 60. Generating Data with Spark import org.apache.spark.mllib.random.RandomRDDs ... RandomRDDs.exponentialRDD(sc, mean = 1000, size = rows) RandomRDDs.normalVectorRDD(sc, numRows = rows, numCols = numCols)
  • sefdeni

    Nov. 13, 2018
  • searchs

    Oct. 18, 2018

As big data jobs move from the proof-of-concept phase into powering real production services, we have to start consider what will happen when everything eventually goes wrong (such as recommending inappropriate products or other decisions taken on bad data). This talk will attempt to convince you that we will all eventually get aboard the failboat (especially with ~40% of respondents automatically deploying their Spark jobs results to production), and its important to automatically recognize when things have gone wrong so we can stop deployment before we have to update our resumes. Figuring out when things have gone terribly wrong is trickier than it first appears, since we want to catch the errors before our users notice them (or failing that before CNN notices them). We will explore general techniques for validation, look at responses from people validating big data jobs in production environments, and libraries that can assist us in writing relative validation rules based on historical data. For folks working in streaming, we will talk about the unique challenges of attempting to validate in a real-time system, and what we can do besides keeping an up-to-date resume on file for when things go wrong. To keep the talk interesting real-world examples (with company names removed) will be presented, as well as several creative-common licensed cat pictures and an adorable panda GIF. If you’ve seen Holden’s previous testing Spark talks this can be viewed as a deep dive on the second half focused around what else we need to do besides good testing practices to create production quality pipelines. If you haven’t seen the testing talks watch those on YouTube after you come see this one 

Views

Total views

1,369

On Slideshare

0

From embeds

0

Number of embeds

0

Actions

Downloads

64

Shares

0

Comments

0

Likes

2

×