Robert Crowe's ML pipeline components

Conﬁguration
Data Collection
Data
Veriﬁcation
Feature Extraction
Process Management
Tools
Analysis Tools
Machine
Resource
Management
Serving
Infrastructure
Monitoring
ML
Code

Data
Ingestion
Data
Analysis + Validation
Feature
Engineering
Trainer
Model Evaluation
and Validation
Serving Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Conﬁguration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An ML pipeline is part of the solution to this problem

Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator
or Keras
Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Conﬁguration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
TensorFlow Extended (TFX) is an end-to-end ML
pipeline for TensorFlow

(incl. )
Major ProductsAlphaBets

● A unified batch and stream distributed processing API
● A set of SDK frontends: Java, Python, Go, Scala, SQL, …
● A set of runners which can execute Beam jobs into
various backends: Local, Apache Flink, Apache Spark,
Apache Gearpump, Apache Samza, Apache Hadoop,
Google Cloud Dataflow, …

Provides a comprehensive portability
framework for data processing pipelines, which
allows you to write your pipeline once in your
language of choice and run it with minimal effort
on your execution engine of choice.

Data Ingestion
TensorFlow
Transform
Estimator Model
TensorFlow
Model Analysis
Honoring
Validation
Outcomes
TensorFlow
Data Validation
TensorFlow
Serving
ExampleGen
StatisticsGen
SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher Model Server
Powered by Beam Powered by Beam

Model
Validator
Packaged binary
or container

Last Validated
Model
New (Candidate)
Model
Model
Validator
Validation
Outcome
Well deﬁned
inputs and outputs

Config
Last Validated
Model
New (Candidate)
Model
Model
Validator
Validation
Outcome
Well defined
configuration

Metadata Store
Conﬁg
Last Validated
Model
New (Candidate)
Model
Model
Validator
Validation
Outcome
Context

Metadata Store
Trainer
Conﬁg
Last Validated
Model
New (Candidate)
Model
New Model
Model
Validator
Validation
Outcome
Pusher
New (Candidate)
Model
Validation
Outcome
Deployment targets:
TensorFlow Serving
TensorFlow Lite
TensorFlow JS
TensorFlow Hub

Trainer
Task-Aware Pipelines
Transform

Trainer
Task-Aware Pipelines
Input Data
Transformed
Data
Trained
Models
Deployment
Task- and Data-Aware Pipelines
Pipeline + Metadata Storage
Training Data
Transform TrainerTransform

Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics

Trained
Models
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs

Trained
Models
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Lineage Tracking Across All Executions
E.g., to recurse back to all inputs of a specific artifact

Model artifact
that was created

Use-cases enabled by lineage tracking

Use-cases enabled by lineage tracking Compare previous model runs

Carry-over state from previous models

Carry-over state from previous models Re-use previously computed outputs

Component Component Component
Component
Legend

Component
Legend
Component
Driver
Metadata Store
Component Component
Publisher
Driver and Publisher
Driver
Publisher
Driver
Publisher

Component
Executor
Legend
Driver
Metadata Store
Publisher
Driver
Publisher
Driver
Publisher
Executor Executor Executor

TFX Conﬁg
Component
Executor
Legend
Driver
Metadata Store
Publisher
Driver
Publisher
Driver
Publisher
Executor Executor Executor

Metadata Store
Driver
Transform,
etc.
Publisher
Flink Dataflow
Beam

Metadata Store
Driver
Trainer
Publisher
TensorFlow

Metadata Store
Driver
Pusher, etc.
Publisher

def create_pipeline():
"""Implements the chicago taxi pipeline with TFX."""
examples = csv_input(os.path.join(data_root, 'simple'))
example_gen = CsvExampleGen(input_base=examples)
statistics_gen = StatisticsGen(input_data=...)
infer_schema = SchemaGen(stats=...)
validate_stats = ExampleValidator(stats=..., schema=...)
# Performs transformations and feature engineering in training and serving
transform = Transform(
input_data=example_gen.outputs.examples,
schema=infer_schema.outputs.output,
module_file=_taxi_module_file)
trainer = Trainer(...)
model_analyzer = Evaluator(examples=..., model_exports=...)
model_validator = ModelValidator(examples=..., model=...)
pusher = Pusher(model_export=..., model_blessing=..., serving_model_dir=...)
return [example_gen, statistics_gen, infer_schema, validate_stats, transform, trainer,
model_analyzer, model_validator, pusher]
pipeline = AirflowDAGRunner(_airflow_config).run(_create_pipeline())

class Executor(base_executor.BaseExecutor):
"""Generic TFX statsgen executor."""
...
def Do(...) -> None:
"""Computes stats for each split of input using tensorflow_data_validation.
...
with beam.Pipeline(argv=self._get_beam_pipeline_args()) as p:
for split, instance in split_to_instance.items():
...
output_path = os.path.join(output_uri, _DEFAULT_FILE_NAME)
_ = (
p | 'ReadData.' + split >> beam.io.ReadFromTFRecord(file_pattern=input_uri)
| 'DecodeData.' + split >> tf_example_decoder.DecodeTFExample()
| 'GenerateStatistics.' + split >> stats_api.GenerateStatistics(stats_options)
| 'WriteStatsOutput.' + split >> beam.io.WriteToTFRecord(
output_path,shard_name_template='',
coder=beam.coders.ProtoCoder(
statistics_pb2.DatasetFeatureStatisticsList)))
tf.logging.info('Statistics written to {}.'.format(output_uri))

def preprocessing_fn(inputs):
with beam.Pipeline() as pipeline:
...
raw_data = (
pipeline
| 'ReadTrainData' >> beam.io.ReadFromText(train_data_file)
| 'FixCommasTrainData' >> beam.Map(
lambda line: line.replace(', ', ','))
| 'DecodeTrainData' >> MapAndFilterErrors(converter.decode))
transformed_dataset, transform_fn = (
raw_dataset | tft_beam.AnalyzeAndTransformDataset(preprocessing_fn))
...
return outputs

TFX Conﬁg
Component
Executor
Legend
Metadata Store

Your own runtime ...Kubeflow Runtime
Metadata Store
Component
Executor
Legend
Airflow Runtime
TFX Config

Kubeflow Runtime
ExampleGen
StatisticsGen SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher
TFX Config
Metadata Store
Training +
Eval Data
TensorFlow
Serving
TensorFlow
Hub
TensorFlow
Lite
TensorFlow
JS
TFX: Putting it all together.
Airflow Runtime

Component: ExampleGen
examples = csv_input(os.path.join(data_root, 'simple'))
example_gen = CsvExampleGen(input_base=examples)
Conﬁguration
Example
Gen
Raw Data
Inputs and Outputs
CSV TF Record
Split
TF Record
Data
Training
Eval

Component: StatisticsGen
statistics_gen =
StatisticsGen(input_data=example_gen.outputs.examples)
Conﬁguration
Visualization
StatisticsGen
Data
ExampleGen
Inputs and Outputs
Statistics

Analyzing Data with TensorFlow Data Validation

Component: SchemaGen
SchemaGen
Statistics
StatisticsGen
Inputs and Outputs
Schema
infer_schema = SchemaGen(stats=statistics_gen.outputs.output)
Conﬁguration
Visualization

Component: ExampleValidator
Example
Validator
Statistics Schema
StatisticsGen SchemaGen
Inputs and Outputs
Anomalies
Report
validate_stats = ExampleValidator(
stats=statistics_gen.outputs.output,
schema=infer_schema.outputs.output)
Conﬁguration
Visualization

Component: Transform
transform = Transform(
input_data=example_gen.outputs.examples,
module_file=taxi_module_file)
Conﬁguration
for key in _DENSE_FLOAT_FEATURE_KEYS:
outputs[_transformed_name(key)] = transform.scale_to_z_score(
_fill_in_missing(inputs[key]))
# ...
outputs[_transformed_name(_LABEL_KEY)] = tf.where(
tf.is_nan(taxi_fare),
tf.cast(tf.zeros_like(taxi_fare), tf.int64),
# Test if the tip was > 20% of the fare.
tf.cast(
tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))
# ...
CodeTransform
Data Schema
Transform
Graph
Transformed
Data
ExampleGen SchemaGen
Trainer
Inputs and Outputs
Code

Using TensorFlow Transform for Feature Engineering

Using TensorFlow Transform for Feature Engineering
Training Serving

Component: Trainer
trainer = Trainer(
module_file=taxi_module_file,
transformed_examples=transform.outputs.transformed_examples,
transform_output=transform.outputs.transform_output,
train_steps=10000,
eval_steps=5000,
warm_starting=True)
Conﬁguration
Code: Just TensorFlow :)
Trainer
Data Schema
Transform SchemaGen
Evaluator
Inputs and Outputs
Code
Transform
Graph
Model
Validator
Pusher
Model(s)

Component: Evaluator
Evaluator
Data Model
ExampleGen Trainer
Inputs and Outputs
Evaluation
Metrics
model_analyzer = Evaluator(
examples=examples_gen.outputs.output,
eval_spec=taxi_eval_spec,
model_exports=trainer.outputs.output)
Conﬁguration
Visualization

Component: ModelValidator
Model
Validator
Data
ExampleGen Trainer
Inputs and Outputs
Validation
Outcome
Model (x2)
model_validator = ModelValidator(
examples=examples_gen.outputs.output,
model=trainer.outputs.output,
eval_spec=taxi_mv_spec)
Conﬁguration
● Conﬁguration options
○ Validate using current eval data
○ “Next-day eval”, validate using unseen data

Component: Pusher
Validation
Outcome
Pusher
Model
Validator
Inputs and Outputs
Pusher
Pusher
Deployment
Options
pusher = Pusher(
model_export=trainer.outputs.output,
model_blessing=model_validator.outputs.blessing,
serving_model_dir=serving_model_dir)
Conﬁguration
● Block push on validation outcome
● Push destinations supported today
○ Filesystem (TensorFlow Lite, TensorFlow JS)
○ TensorFlow Serving

Apache Beam
Sum Per Key
⋮
input | Sum.PerKey()
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataﬂow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams

PTransforms
● More transforms available in Java
than Python
● Python can invoke Java
transforms (coming soon)
with self.create_pipeline() as p:
res = (
p | GenerateSequence(start=1, stop=10,
expansion_service=expansion_address))
GenerateSequence is written in Java

I/O
● More I/O available in Java than
Python
● Python can invoke Java I/O
(coming soon)
(Coming soon)

Language File-based Messaging Database
Java Beam Java supports Apache HDFS, Amazon S3, Google Cloud Storage, and local
filesystems.
FileIO (general-purpose reading, writing, and matching of files)
AvroIO
TextIO
TFRecordIO
XmlIO
TikaIO
ParquetIO
RabbitMqIO
SqsIO
Amazon Kinesis
AMQP
Apache Kafka
Google Cloud Pub/Sub
JMS
MQTT
Apache Cassandra
Apache Hadoop Input/Output Format
Apache HBase
Apache Hive (HCatalog)
Apache Kudu
Apache Solr
Elasticsearch (v2.x, v5.x, v6.x)
Google BigQuery
Google Cloud Bigtable
Google Cloud Datastore
Google Cloud Spanner
JDBC
MongoDB
Redis

Per element ParDo (Map, etc)
Every item
processed
independently
Stateless
implementation

Per key Combine (Reduce, etc)
65
Items grouped by some
key and combined
Stateful streaming
implementation
But your code doesn't
work with state, just
associative &
commutative function

66
Event Time Windowing
8:00
8:00
8:00

Classic parallel IO
67
"Embarrassingly parallel" (idealized)
Non-parallel execution
time
workersworkers
time
"Embarrassingly parallel" (actual, most systems)
workers
time

Beam's dynamic work rebalancing
Without dynamic work rebalancing
workers
time
With dynamic work rebalancing
workers
time
Beam's APIs make this the default approach

Beam's dynamic work rebalancing
69
A classic MapReduce job (read from Google Cloud Storage,
GroupByKey, write to Google Cloud Storage), 400 workers.
Dynamic Work Rebalancing disabled to demonstrate stragglers.
X axis: time (total ~20min.); Y axis: workers
Same job, Dynamic Work Rebalancing
enabled by Beam’s Splittable DoFn.
X axis: time (total ~15min.); Y axis: workers
Savings!

Dataflow’s Liquid Sharding
● Monitors worker progress and identify
stragglers
● Asks stragglers to give away part of their
unprocessed work (e.g., a sub-range of a file or
a key range)
● Schedule new work items onto idle workers
● Repeat for the next stragglers
The amount of work to give away is chosen so that
the worker is expected to complete soon enough and
stop being a straggler
Non-trivial to implement

Beam’s Flink Runner
Beam ParDo
Element-wise transformation parameterized by
a chunk of user code. Elements are processed
in bundles, with initialization and termination
hooks. Bundle size is chosen by the runner and
cannot be controlled by user code. ParDo
processes a main input PCollection one
element at a time, but provides side input
access to additional PCollections.
Flink Python Runner
Yes: fully supported
ParDo itself, as per-element transformation
with UDFs, is fully supported by Flink for both
batch and streaming.

Beam GroupByKey
Grouping of key-value pairs per key, window,
and pane.
Flink Python Runner
Yes: fully supported
Uses Flink's keyBy for key grouping. When
grouping by window in streaming (creating the
panes) the Flink runner uses the Beam code.
This guarantees support for all windowing and
triggering mechanisms.

Beam Stateful Processing
Allows fine-grained access to per-key,
per-window persistent state and timers. Timers
are integral to stateful processing. Necessary
for certain use cases (e.g. high-volume
windows which store large amounts of data, but
typically only access small portions of it;
complex state machines; etc.) that are not
easily or efficiently addressed via Combine or
GroupByKey+ParDo.
Flink Python Runner
Partially: non-merging windows
State is supported for non-merging windows.
MapState fully supported.

Beam Splittable DoFn (SDF)
Allows users to develop DoFn's that process a
single element in portions ("restrictions"),
executed in parallel or sequentially. This
supersedes the unbounded and bounded
`Source` APIs by supporting all of their features
on a per-element basis. See
http://s.apache.org/splittable-do-fn. Design is in
progress on achieving parity with Source API
regarding progress signals.
Flink Python Runner
Not supported

github.com/tensorflow/tfx
tensorflow.org/tfx

Robert Crowe's ML pipeline components

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to Robert Crowe's ML pipeline components

Similar to Robert Crowe's ML pipeline components (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Robert Crowe's ML pipeline components