Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)


Published on

Powering tensor flow with big data using apache beam, flink, and spark cern 2019

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Powering tensor flow with big data using apache beam, flink, and spark cern 2019 (3)

  1. 1. @holdenkarau Powering TensorFlow with big data With Apache Beam, Flink & Spark bonus KF @holdenkarau
  2. 2. @holdenkarau Slides will be at: CatLoversShow
  3. 3. @holdenkarau Holden: ● Prefered pronouns are she/her ● Developer Advocate at Google ● Apache Spark PMC ● co-author of Learning Spark & High Performance Spark ● Twitter: @holdenkarau ● Slide share ● Code review livestreams: / ● Spark Talk Videos ● Talk feedback (if you are so inclined): ● Helping organize Data Track @ ITNEXT AMS - CFP Open!
  4. 4. @holdenkarau
  5. 5. @holdenkarau Who I think you wonderful humans are? ● Nice enough people ● Don’t mind pictures of cats ● Maybe somewhat familiar with Tensorflow? ● Maybe somewhat familiar with Beam or Spark or Flink? Lori Erickson
  6. 6. @holdenkarau What will be covered? ● Why we need big data for deep learning ● The state of Java/Python integration ● And why this matters for Tensorflow ● Tools to simplify this (TFT, TFMA, TFDV, etc.) ● Pipelining & validation Then choose your own demo or Q&A: ● TensorFlowOnSpark ● Tensorflow Transform on Apache Beam on {Apache Flink, Dataflow} ● Kubeflow w/Spark
  7. 7. Part of what lead to the success of Spark ● Integrated different tools which traditionally required different systems ○ Mahout, hive, etc. ● e.g. can use same system to do ML and SQL *Often written in Python! Apache Spark SQL, DataFrames & Datasets Structured Streaming Scala, Java, Python, & R Spark ML bagel & Graph X MLLib Scala, Java, PythonStreaming Graph Frames Paul Hudson
  8. 8. What is Spark? ● General purpose distributed system ○ With a really nice API including Python :) ● Apache project (one of the most active) ● Must faster than Hadoop Map/Reduce ● Good when too big for a single machine ● Built on top of two abstractions for distributed data: RDDs & Datasets
  9. 9. Why people come to Spark: Well this MapReduce job is going to take 16 hours - how long could it take to learn Spark? dougwoods
  10. 10. Why people come to Spark: My DataFrame won’t fit in memory on my cluster anymore, let alone my MacBook Pro :( Maybe this Spark business will solve that... brownpau
  11. 11. Why this all matters? cuatrok77
  12. 12. What’s the state of non-JVM big data? Most of the tools are built in the JVM, so how do we play together from Python? ● Pickling, Strings, JSON, XML, oh my! Over ● Unix pipes, Sockets, files, and mmapped files (sometimes in the same program) What about if we don’t want to copy the data all the time? ● Or standalone “pure”* re-implementations of everything ○ Reasonable option for things like Kafka where you would have the I/O regardless. ○ Also cool projects like dask (pure python) -- but hard to talk to existing ecosystem David Brown
  13. 13. @holdenkarau The "state" of TF + Big Data ● TensorFlowOnSpark w/basic Apache Arrow ○ Still needs more work ○ New scheduler, with improvements in Spark 3 ● Basic TF Transform on Apache Flink via Apache Beam ● New* Beam architecture allowing for better portability & handling dependencies (like Tensorflow) ● feed_dict + scheduler luck Vladimir Pustovit
  14. 14. @holdenkarau So why do I need to power DL w/Big Data? ● Deep learning is most effective with large sample sets for training ● You need to clean your large datasets ● You also (probably)* need some feature preparation ○ even if you’re looking at mnist.csv you probably have _some_ feature prep ● You need to transform your datasets into the formats your DL wants ● Even if your just trying to raise some VC money it's going to go a lot better if you add some keywords about a large proprietary dataset
  15. 15. @holdenkarau TensorFlow isn’t enough on its own ● Enter TFX & friends like Kubeflow ○ Current related TFX OSS components: TF.Transform TF.Serving (with more coming) ● Alternative 1: Data prep in an "exportable" format and serve with Seldon ○ Yay extra RPCs? ● Alternatives 2: piles of custom code re-created at serving time. ○ Yay job security? PROJennifer C.
  16. 16. @holdenkarau How do I do feature prep? (old skool) ● Write custom preparation jobs in your favourite big data tool ○ I like Apache Spark, some folks like Apache Beam or Flink. ○ So long as it not ● Run it, train on the prepared data ● Rewrite your feature prep code to run at serving time ○ Error prone and sad
  17. 17. @holdenkarau Enter: TF.Transform ● For pre-processing of your data ○ e.g. where you spend 90% of your dev time anyways ● Integrates into serving time :D ● OSS ● Written in Python ● Runs on top of Apache Beam ○ Works really on Dataflow ○ On master this can run on Flink, but has bugs currently. ○ Please don’t use this in production today unless your on GCP/Dataflow ○ Python 2 only for now PROKathryn Yengel
  18. 18. @holdenkarau Defining a Transform processing function def preprocessing_fn(inputs): x = inputs['x'] y = inputs['y'] s = inputs['s'] x_centered = x - tft.mean(x) y_normalized = tft.scale_to_0_1(y) s_int = tft.string_to_int(s) return { 'x_centered': x_centered, 'y_normalized': y_normalized, 's_int': s_int}
  19. 19. @holdenkarau mean stddev normalize multiply quantiles bucketize Analyzers Reduce (full pass) Implemented as a distributed data pipeline Transforms Instance-to-instance (don’t change batch dimension) Pure TensorFlow
  20. 20. @holdenkarau Analyze normalize multiply bucketize constant tensors data mean stddev normalize multiply quantiles bucketize
  21. 21. @holdenkarau Scale to ... Bag of Words / N-Grams Bucketization Feature Crosses tft.ngrams tft.string_to_int tf.string_split tft.scale_to_z_score tft.apply_buckets tft.quantiles tft.string_to_int tf.string_join ... Some common use-cases...
  22. 22. @holdenkarau BEAM Beyond the JVM: Current release ● Works pretty well on Dataflow ● non-JVM BEAM on Apache Flink is relatively early stages ● tl;dr : uses grpc / protobuf ○ Similar to the common design but with more efficient representations (often) ● But exciting new plans to unify the runners and ease the support of different languages (called SDKS) ○ See Emma
  23. 23. @holdenkarau BEAM Beyond the JVM: Master + Experiments ● Common interface for setting up jobs ● Portability framework allows SDK harnesses in arbitrary to be kicked off ● Runners ship in their own docker containers (goodbye dependency hell, hello container hell) ○ Also for now rolling containers leaves something to be desired (e.g. edit docker file by hand) ● Hacked up Python SDK works with the new interface ● Go SDK talks to the new interface, still missing some features Nick
  24. 24. @holdenkarau BEAM Beyond the JVM: Master w/ experiments *ish *ish *ish Nick portability
  25. 25. @holdenkarau So what does that look like? Driver Worker 1 Docker grpc Worker K Docker grpc
  26. 26. @holdenkarau Sample of the chicago taxi data: for key in taxi.DENSE_FLOAT_FEATURE_KEYS: # Preserve this feature as a dense float, setting nan's to the mean. outputs[key] = transform.scale_to_z_score(inputs[key]) for key in taxi.VOCAB_FEATURE_KEYS: # Build a vocabulary for this feature. outputs[key] = transform.string_to_int( inputs[key], top_k=taxi.VOCAB_SIZE, num_oov_buckets=taxi.OOV_SIZE) for key in taxi.BUCKET_FEATURE_KEYS: outputs[key] = transform.bucketize(inputs[key], taxi.FEATURE_BUCKET_COUNT)
  27. 27. @holdenkarau BEAM Beyond the JVM: The “future” E.g. not now *ish *ish *ish Nick portability *ish *ish
  28. 28. @holdenkarau This seems complicated, options? ● Spoiler: mostly it’s not better ○ Although it tends to be more finished ○ Sometimes it's different ● Different tradeoffs, maybe better for your use case but all tradeoffs Kate Neilan
  29. 29. @holdenkarau A quick detour into PySpark’s internals + + JSON TimOve
  30. 30. @holdenkarau PySpark ● The Python interface to Spark ● Same general technique used as the bases for the C#, R, Julia, etc. interfaces to Spark ● Fairly mature, integrates well-ish into the ecosystem, less a Pythonrific API ● Has some serious performance hurdles from the design
  31. 31. @holdenkarau So what does that look like? Driver py4j Worker 1 Worker K pipe pipe
  32. 32. @holdenkarau And in flink…. Driver custom Worker 1 Worker K mmap mmap
  33. 33. @holdenkarau So how does that impact Py[X] forall X in {Big Data}-{Native Python Big Data} ● Double serialization cost makes everything more expensive ● Python worker startup takes a bit of extra time ● Python memory isn’t controlled by the JVM - easy to go over container limits if deploying on YARN or similar ● Error messages make ~0 sense ● Dependency management makes limited sense ● features aren’t automatically exposed, but exposing them is normally simple
  34. 34. @holdenkarau TensorFlowOnSpark, everyone loves mnist! cluster =, mnist_dist_dataset.map_fun, args, args.cluster_size, num_ps, args.tensorboard, TFCluster.InputMode.SPARK) if args.mode == "train": cluster.train(dataRDD, args.epochs) Lida
  35. 35. @holdenkarau The “future”*: faster interchange ● By future I mean availability today but running it in production is “adventurous” ● Unifying our cross-language experience ○ And not just “normal” languages, CUDA counts yo Tambako The Jaguar
  36. 36. @holdenkarau Andrew Skudder *Arrow: Spark 2.3 and beyond & GPUs & R & Python & …. * *
  37. 37. @holdenkarau What does the future look like?* *Source: *Vendor benchmark. Trust but verify.
  38. 38. @holdenkarau Arrow (a poorly drawn big data view) Logos trademarks of their respective projects Juha Kettunen *ish
  39. 39. @holdenkarau Rewriting your code because why not spark.catalog.registerFunction( "add", lambda x, y: x + y, IntegerType()) => add = pandas_udf(lambda x, y: x + y, IntegerType()) Jennifer C.
  40. 40. @holdenkarau And we can do this in TFOnSpark*: unionRDD.foreachPartition(TFSparkNode.train(self.cluster_info, self.cluster_meta, qname)) Will Transform Into something magical (aka fast but unreliable) on the next slide! Delaina Haslam
  41. 41. @holdenkarau Which becomes train_func = TFSparkNode.train(self.cluster_info, self.cluster_meta, qname) @pandas_udf("int") def do_train(inputSeries1, inputSeries2): # Sad hack for now modified_series = map(lambda x: (x[0], x[1]), zip(inputSeries1, inputSeries2)) train_func(modified_series) return pandas.Series([0] * len(inputSeries1)) ljmacphee
  42. 42. @holdenkarau And this now looks like: Logos trademarks of their respective projects Juha Kettunen *ish
  43. 43. @holdenkarau So how TF does this relate to TF? ● Tensorflow is in Python (kind of) ● At some point you want to get the data from your big data tool into Tensorflow ● Worst case: you can write out a bunch of files and read them back in ● Possibly better case: you use the things we talked about
  44. 44. Dask: a new beginning? ● Pure* python implementation ● Provides real enough DataFrame interface for distributed data ○ Much more like a Panda’s DataFrame than Spark’s DataFrames ● Also your standard-ish distributed collections ● Multiple backends ● Primary challenge: interacting with the rest of the big data ecosystem ○ Arrow & friends make this better, but it’s still a bit rough ● There is a proof-of-concept to bootstrap a dask cluster on Spark ● See & Lisa Zins
  45. 45. @holdenkarau Ok now what? ● Integrate this into your model serving pipeline of choice ○ Don’t have one or open to change? Checkout TFMA which can directly serve it ● There’s a guide (it doesn’t show Flink because not released yet) but steps are similar ○ But you’re not using this in production today anyways? ○ Right? ● Automate your pipeline so you don't have to run it every week by hand ● Validate that your models aren't getting worse Nick Perla
  46. 46. @holdenkarau (Optionally): Putting it together with Kubeflow VIK hotels group "The Machine Learning Toolkit for Kubernetes" - Kubeflow Website
  47. 47. @holdenkarau Introducing* Kubeflow VIK hotels group
  48. 48. @holdenkarau Components Buffet argo automation chainer-job core credentials-pod-preset katib mpi-job mxnet-job openmpi pachyderm pytorch-job Seldon spark tf-serving Paul Harrison
  49. 49. @holdenkarau What are those pipelines? “Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.” - Directed Acyclic Graph (DAG) of “pipeline components” (read “docker containers”) each performing a function.
  50. 50. @holdenkarau Building that pipeline?
  51. 51. @holdenkarau Running that pipeline
  52. 52. @holdenkarau Ok cool, but… we need to validate Results from: Testing with Spark survey
  53. 53. @holdenkarau
  54. 54. @holdenkarau So how do we validate our jobs? ● The idea is, at some point, you made software which worked. ○ If you don’t you probably want to run it a few times and manually validate it ● Maybe you manually tested and sampled your results ● Hopefully you did a lot of other checks too ● But we can’t do that every time, our pipelines are no longer write-once run-once they are often write-once, run forever, and debug-forever.
  55. 55. @holdenkarau Counters* to the rescue**! ● Both BEAM & Spark have their it own counters ○ Per-stage bytes r/w, shuffle r/w, record r/w. execution time, etc. ○ In UI can also register a listener from spark validator project ● We can add counters for things we care about ○ invalid records, users with no recommendations, etc. ○ Accumulators have some challenges (see SPARK-12469 for progress) but are an interesting option ● We can _pretend_ we still have nice functional code *Counters are your friends, but the kind of friends who steal your lunch money ** In a similar way to how regular expressions can solve problems…. Miguel Olaya
  56. 56. @holdenkarau So what does that look like? val parsed = data.flatMap(x => try { Some(parse(x)) happyCounter.add(1) } catch { case _ => sadCounter.add(1) None // What's it's JSON } } // Special business data logic (aka wordcount) // Much much later* business error logic goes here Pager photo by Vitachao CC-SA 3 Phoebe Baker
  57. 57. @holdenkarau General Rules for making Validation rules ● According to a sad survey most people check execution time & record count ● spark-validator is still in early stages but interesting proof of concept ○ I have an updated variant of it that is going our OSS releasing process internally ● Sometimes your rules will miss-fire and you’ll need to manually approve a job ● Do you have property tests? Could be Validation rules ● Historical data ○ what did your counters look like yesterday ● Domain specific solutions ○ The best, but also the most work Photo by: Paul Schadler
  58. 58. @holdenkarau % of data change ● Not just invalid records, if a field’s value changes everywhere it could still be “valid” but have a different meaning ○ Remember that example about almost recommending illegal content? ● Join and see number of rows different on each side ● Expensive operation, but if your data changes slowly / at a constant ish rate ○ Sometimes done as a separate parallel job ● Can also be used on output if applicable ○ You do have a table/file/as applicable to roll back to right?
  59. 59. @holdenkarau TFDV: Magic* ● Counters, schema inference, anomaly detection, oh my! # Compute statistics over a new set of data new_stats = tfdv.generate_statistics_from_csv(NEW_DATA) # Compare how new data conforms to the schema anomalies = tfdv.validate_statistics(new_stats, schema) # Display anomalies inline tfdv.display_anomalies(anomalies)
  60. 60. @holdenkarau Not just data changes: Software too ● Things change! Yay! Often for the better. ○ Especially with handling edge cases like NA fields ○ Don’t expect the results to change - side-by-side run + diff ● Have an ML model? ○ Welcome to new params - or old params with different default values. ○ We’ll talk more about that later ● Excellent PyData London talk about how this can impact ML models ○ Done with sklearn shows vast differences in CVE results only changing the version number Francesco
  61. 61. @holdenkarau Optional Demos: (or early Q&A) ● Go on beam on Flink Wordcount ● Spark on Kubeflow? ● Tensorflow Transform on Beam on Flink ● TensorflowOnSpark ● Tensorflow Data Validation on Beam On Dataflow
  62. 62. @holdenkarau References ● TFMA + TFT example guide - ● Apache Beam github repo (w/early alpha portable Flink support)- ● TFMA Example fork for use w/Beam on Flink - ● TensorFlowOnSpark - ● Spark Deep Learning Pipelines - ● flink-tensorflow - ● TF.Transform - ● Beam portability design: ● Beam on Flink + portability PROR. Crap Mariner
  63. 63. @holdenkarau And some upcoming talks: ● April ○ Spark Summit ○ Strata London ● May ○ KiwiCoda Mania ○ KubeCon Barcelona ● June ○ Scala Days EU ○ Berlin Buzzwords ● July ○ OSCON Portland ○ Skills Matter meetup in London ● August ○ ScalaWorld
  64. 64. @holdenkarau k thnx bye :) Will tweet results “eventually” @holdenkarau Do you want more realistic benchmarks? Share your UDFs! Pssst: Have feedback on the presentation? Give me a shout ( if you feel comfortable doing so :) Give feedback on this presentation I have some free books on Spark if anyone wants :) Q&A session this afternoon