Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine learning platform for TensorFlow - Robert Crowe

1,103 views

Published on

As machine learning evolves from experimentation to serving production workloads, so does the need to effectively manage the end-to-end training and production workflow including model management, versioning, and serving. TFX together with Apache Beam and Apache Flink unlocks new and exciting use cases. Clemens Mewald offers an overview of TensorFlow Extended (TFX), the end-to-end machine learning platform for TensorFlow that powers products across all of Alphabet. Many TFX components rely on the Beam SDK to define portable data processing workflows. This talk explores how Apache Flink runner for Apache Beam Python enables TFX pipelines for production ready machine learning workloads.

Published in: Technology
  • Login to see the comments

Flink Forward San Francisco 2019: TensorFlow Extended: An end-to-end machine learning platform for TensorFlow - Robert Crowe

  1. 1. Robert Crowe
  2. 2. ML Code
  3. 3. Configuration Data Collection Data Verification Feature Extraction Process Management Tools Analysis Tools Machine Resource Management Serving Infrastructure Monitoring ML Code
  4. 4. Data Ingestion Data Analysis + Validation Feature Engineering Trainer Model Evaluation and Validation Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Tuner Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization An ML pipeline is part of the solution to this problem
  5. 5. Data Ingestion TensorFlow Data Validation TensorFlow Transform Estimator or Keras Model TensorFlow Model Analysis TensorFlow Serving Logging Shared Utilities for Garbage Collection, Data Access Controls Pipeline Storage Tuner Shared Configuration Framework and Job Orchestration Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization TensorFlow Extended (TFX) is an end-to-end ML pipeline for TensorFlow
  6. 6. (incl. ) Major ProductsAlphaBets
  7. 7. ● A unified batch and stream distributed processing API ● A set of SDK frontends: Java, Python, Go, Scala, SQL, … ● A set of runners which can execute Beam jobs into various backends: Local, Apache Flink, Apache Spark, Apache Gearpump, Apache Samza, Apache Hadoop, Google Cloud Dataflow, …
  8. 8. Provides a comprehensive portability framework for data processing pipelines, which allows you to write your pipeline once in your language of choice and run it with minimal effort on your execution engine of choice.
  9. 9. Data Ingestion TensorFlow Transform Estimator Model TensorFlow Model Analysis Honoring Validation Outcomes TensorFlow Data Validation TensorFlow Serving ExampleGen StatisticsGen SchemaGen Example Validator Transform Trainer Evaluator Model Validator Pusher Model Server Powered by Beam Powered by Beam
  10. 10. Model Validator Packaged binary or container
  11. 11. Last Validated Model New (Candidate) Model Model Validator Validation Outcome Well defined inputs and outputs
  12. 12. Config Last Validated Model New (Candidate) Model Model Validator Validation Outcome Well defined configuration
  13. 13. Metadata Store Config Last Validated Model New (Candidate) Model Model Validator Validation Outcome Context
  14. 14. Metadata Store Trainer Config Last Validated Model New (Candidate) Model New Model Model Validator Validation Outcome Pusher New (Candidate) Model Validation Outcome Deployment targets: TensorFlow Serving TensorFlow Lite TensorFlow JS TensorFlow Hub
  15. 15. Trainer Task-Aware Pipelines Transform
  16. 16. Trainer Task-Aware Pipelines Input Data Transformed Data Trained Models Deployment Task- and Data-Aware Pipelines Pipeline + Metadata Storage Training Data Transform TrainerTransform
  17. 17. Trained Models Type definitions of Artifacts and their Properties E.g., Models, Data, Evaluation Metrics
  18. 18. Trained Models Type definitions of Artifacts and their Properties E.g., Models, Data, Evaluation Metrics Trainer Execution Records (Runs) of Components E.g., Runtime Configuration, Inputs + Outputs
  19. 19. Trained Models Type definitions of Artifacts and their Properties E.g., Models, Data, Evaluation Metrics Trainer Execution Records (Runs) of Components E.g., Runtime Configuration, Inputs + Outputs Lineage Tracking Across All Executions E.g., to recurse back to all inputs of a specific artifact
  20. 20. Model artifact that was created
  21. 21. Use-cases enabled by lineage tracking
  22. 22. Use-cases enabled by lineage tracking Compare previous model runs
  23. 23. Use-cases enabled by lineage tracking Compare previous model runs Carry-over state from previous models
  24. 24. Use-cases enabled by lineage tracking Compare previous model runs Carry-over state from previous models Re-use previously computed outputs
  25. 25. Component Component Component Component Legend
  26. 26. Component Legend Component Driver Metadata Store Component Component Publisher Driver and Publisher Driver Publisher Driver Publisher
  27. 27. Component Executor Legend Driver and Publisher Driver Metadata Store Publisher Driver Publisher Driver Publisher Executor Executor Executor
  28. 28. TFX Config Component Executor Legend Driver and Publisher Driver Metadata Store Publisher Driver Publisher Driver Publisher Executor Executor Executor
  29. 29. Metadata Store Driver Transform, etc. Publisher Flink Dataflow Beam
  30. 30. Metadata Store Driver Trainer Publisher TensorFlow
  31. 31. Metadata Store Driver Pusher, etc. Publisher
  32. 32. def create_pipeline(): """Implements the chicago taxi pipeline with TFX.""" examples = csv_input(os.path.join(data_root, 'simple')) example_gen = CsvExampleGen(input_base=examples) statistics_gen = StatisticsGen(input_data=...) infer_schema = SchemaGen(stats=...) validate_stats = ExampleValidator(stats=..., schema=...) # Performs transformations and feature engineering in training and serving transform = Transform( input_data=example_gen.outputs.examples, schema=infer_schema.outputs.output, module_file=_taxi_module_file) trainer = Trainer(...) model_analyzer = Evaluator(examples=..., model_exports=...) model_validator = ModelValidator(examples=..., model=...) pusher = Pusher(model_export=..., model_blessing=..., serving_model_dir=...) return [example_gen, statistics_gen, infer_schema, validate_stats, transform, trainer, model_analyzer, model_validator, pusher] pipeline = AirflowDAGRunner(_airflow_config).run(_create_pipeline())
  33. 33. class Executor(base_executor.BaseExecutor): """Generic TFX statsgen executor.""" ... def Do(...) -> None: """Computes stats for each split of input using tensorflow_data_validation. ... with beam.Pipeline(argv=self._get_beam_pipeline_args()) as p: for split, instance in split_to_instance.items(): ... output_path = os.path.join(output_uri, _DEFAULT_FILE_NAME) _ = ( p | 'ReadData.' + split >> beam.io.ReadFromTFRecord(file_pattern=input_uri) | 'DecodeData.' + split >> tf_example_decoder.DecodeTFExample() | 'GenerateStatistics.' + split >> stats_api.GenerateStatistics(stats_options) | 'WriteStatsOutput.' + split >> beam.io.WriteToTFRecord( output_path,shard_name_template='', coder=beam.coders.ProtoCoder( statistics_pb2.DatasetFeatureStatisticsList))) tf.logging.info('Statistics written to {}.'.format(output_uri))
  34. 34. def preprocessing_fn(inputs): with beam.Pipeline() as pipeline: ... raw_data = ( pipeline | 'ReadTrainData' >> beam.io.ReadFromText(train_data_file) | 'FixCommasTrainData' >> beam.Map( lambda line: line.replace(', ', ',')) | 'DecodeTrainData' >> MapAndFilterErrors(converter.decode)) transformed_dataset, transform_fn = ( raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn)) ... return outputs
  35. 35. TFX Config Component Executor Legend Metadata Store Driver and Publisher
  36. 36. Your own runtime ...Kubeflow Runtime Metadata Store Component Driver and Publisher Executor Legend Airflow Runtime TFX Config
  37. 37. Airflow Kubeflow Pipelines
  38. 38. Kubeflow Runtime ExampleGen StatisticsGen SchemaGen Example Validator Transform Trainer Evaluator Model Validator Pusher TFX Config Metadata Store Training + Eval Data TensorFlow Serving TensorFlow Hub TensorFlow Lite TensorFlow JS TFX: Putting it all together. Airflow Runtime
  39. 39. Component: ExampleGen examples = csv_input(os.path.join(data_root, 'simple')) example_gen = CsvExampleGen(input_base=examples) Configuration Example Gen Raw Data Inputs and Outputs CSV TF Record Split TF Record Data Training Eval
  40. 40. Component: StatisticsGen statistics_gen = StatisticsGen(input_data=example_gen.outputs.examples) Configuration Visualization StatisticsGen Data ExampleGen Inputs and Outputs Statistics
  41. 41. Analyzing Data with TensorFlow Data Validation
  42. 42. Component: SchemaGen SchemaGen Statistics StatisticsGen Inputs and Outputs Schema infer_schema = SchemaGen(stats=statistics_gen.outputs.output) Configuration Visualization
  43. 43. Component: ExampleValidator Example Validator Statistics Schema StatisticsGen SchemaGen Inputs and Outputs Anomalies Report validate_stats = ExampleValidator( stats=statistics_gen.outputs.output, schema=infer_schema.outputs.output) Configuration Visualization
  44. 44. Component: Transform transform = Transform( input_data=example_gen.outputs.examples, schema=infer_schema.outputs.output, module_file=taxi_module_file) Configuration for key in _DENSE_FLOAT_FEATURE_KEYS: outputs[_transformed_name(key)] = transform.scale_to_z_score( _fill_in_missing(inputs[key])) # ... outputs[_transformed_name(_LABEL_KEY)] = tf.where( tf.is_nan(taxi_fare), tf.cast(tf.zeros_like(taxi_fare), tf.int64), # Test if the tip was > 20% of the fare. tf.cast( tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64)) # ... CodeTransform Data Schema Transform Graph Transformed Data ExampleGen SchemaGen Trainer Inputs and Outputs Code
  45. 45. Using TensorFlow Transform for Feature Engineering
  46. 46. Using TensorFlow Transform for Feature Engineering Training Serving
  47. 47. Component: Trainer trainer = Trainer( module_file=taxi_module_file, transformed_examples=transform.outputs.transformed_examples, schema=infer_schema.outputs.output, transform_output=transform.outputs.transform_output, train_steps=10000, eval_steps=5000, warm_starting=True) Configuration Code: Just TensorFlow :) Trainer Data Schema Transform SchemaGen Evaluator Inputs and Outputs Code Transform Graph Model Validator Pusher Model(s)
  48. 48. Component: Evaluator Evaluator Data Model ExampleGen Trainer Inputs and Outputs Evaluation Metrics model_analyzer = Evaluator( examples=examples_gen.outputs.output, eval_spec=taxi_eval_spec, model_exports=trainer.outputs.output) Configuration Visualization
  49. 49. Component: ModelValidator Model Validator Data ExampleGen Trainer Inputs and Outputs Validation Outcome Model (x2) model_validator = ModelValidator( examples=examples_gen.outputs.output, model=trainer.outputs.output, eval_spec=taxi_mv_spec) Configuration ● Configuration options ○ Validate using current eval data ○ “Next-day eval”, validate using unseen data
  50. 50. Component: Pusher Validation Outcome Pusher Model Validator Inputs and Outputs Pusher Pusher Deployment Options pusher = Pusher( model_export=trainer.outputs.output, model_blessing=model_validator.outputs.blessing, serving_model_dir=serving_model_dir) Configuration ● Block push on validation outcome ● Push destinations supported today ○ Filesystem (TensorFlow Lite, TensorFlow JS) ○ TensorFlow Serving
  51. 51. Apache Beam and Apache Flink
  52. 52. Apache Beam Sum Per Key ⋮ input | Sum.PerKey() Python input.apply( Sum.integersPerKey()) Java stats.Sum(s, input) Go SELECT key, SUM(value) FROM input GROUP BY key SQL Cloud Dataflow Apache Spark Apache Flink Apache Apex Gearpump Apache Samza Apache Nemo (incubating) IBM Streams
  53. 53. PTransforms ● More transforms available in Java than Python ● Python can invoke Java transforms (coming soon) with self.create_pipeline() as p: res = ( p | GenerateSequence(start=1, stop=10, expansion_service=expansion_address)) GenerateSequence is written in Java
  54. 54. I/O ● More I/O available in Java than Python ● Python can invoke Java I/O (coming soon) (Coming soon)
  55. 55. Language File-based Messaging Database Java Beam Java supports Apache HDFS, Amazon S3, Google Cloud Storage, and local filesystems. FileIO (general-purpose reading, writing, and matching of files) AvroIO TextIO TFRecordIO XmlIO TikaIO ParquetIO RabbitMqIO SqsIO Amazon Kinesis AMQP Apache Kafka Google Cloud Pub/Sub JMS MQTT Apache Cassandra Apache Hadoop Input/Output Format Apache HBase Apache Hive (HCatalog) Apache Kudu Apache Solr Elasticsearch (v2.x, v5.x, v6.x) Google BigQuery Google Cloud Bigtable Google Cloud Datastore Google Cloud Spanner JDBC MongoDB Redis
  56. 56. Per element ParDo (Map, etc) Every item processed independently Stateless implementation
  57. 57. Per key Combine (Reduce, etc) 65 Items grouped by some key and combined Stateful streaming implementation But your code doesn't work with state, just associative & commutative function
  58. 58. 66 Event Time Windowing 8:00 8:00 8:00
  59. 59. Classic parallel IO 67 "Embarrassingly parallel" (idealized) Non-parallel execution time workersworkers time "Embarrassingly parallel" (actual, most systems) workers time
  60. 60. Beam's dynamic work rebalancing Without dynamic work rebalancing workers time With dynamic work rebalancing workers time Beam's APIs make this the default approach
  61. 61. Beam's dynamic work rebalancing 69 A classic MapReduce job (read from Google Cloud Storage, GroupByKey, write to Google Cloud Storage), 400 workers. Dynamic Work Rebalancing disabled to demonstrate stragglers. X axis: time (total ~20min.); Y axis: workers Same job, Dynamic Work Rebalancing enabled by Beam’s Splittable DoFn. X axis: time (total ~15min.); Y axis: workers Savings!
  62. 62. Dataflow’s Liquid Sharding ● Monitors worker progress and identify stragglers ● Asks stragglers to give away part of their unprocessed work (e.g., a sub-range of a file or a key range) ● Schedule new work items onto idle workers ● Repeat for the next stragglers The amount of work to give away is chosen so that the worker is expected to complete soon enough and stop being a straggler Non-trivial to implement
  63. 63. How does Beam map to Flink?
  64. 64. Beam’s Flink Runner Beam ParDo Element-wise transformation parameterized by a chunk of user code. Elements are processed in bundles, with initialization and termination hooks. Bundle size is chosen by the runner and cannot be controlled by user code. ParDo processes a main input PCollection one element at a time, but provides side input access to additional PCollections. Flink Python Runner Yes: fully supported ParDo itself, as per-element transformation with UDFs, is fully supported by Flink for both batch and streaming.
  65. 65. Beam’s Flink Runner Beam GroupByKey Grouping of key-value pairs per key, window, and pane. Flink Python Runner Yes: fully supported Uses Flink's keyBy for key grouping. When grouping by window in streaming (creating the panes) the Flink runner uses the Beam code. This guarantees support for all windowing and triggering mechanisms.
  66. 66. Beam’s Flink Runner Beam Stateful Processing Allows fine-grained access to per-key, per-window persistent state and timers. Timers are integral to stateful processing. Necessary for certain use cases (e.g. high-volume windows which store large amounts of data, but typically only access small portions of it; complex state machines; etc.) that are not easily or efficiently addressed via Combine or GroupByKey+ParDo. Flink Python Runner Partially: non-merging windows State is supported for non-merging windows. MapState fully supported.
  67. 67. Beam’s Flink Runner Beam Splittable DoFn (SDF) Allows users to develop DoFn's that process a single element in portions ("restrictions"), executed in parallel or sequentially. This supersedes the unbounded and bounded `Source` APIs by supporting all of their features on a per-element basis. See http://s.apache.org/splittable-do-fn. Design is in progress on achieving parity with Source API regarding progress signals. Flink Python Runner Not supported
  68. 68. github.com/tensorflow/tfx tensorflow.org/tfx

×