TFX (TensorFlow Extended) is an end-to-end machine learning platform for TensorFlow that addresses common challenges with ML workflows. It provides reusable components like ExampleGen, Transform, Trainer and Pusher that together form a complete ML pipeline. The components communicate with each other via a metadata store. TFX uses Apache Beam and Kubernetes to provide portability and scalability. It aims to make machine learning workflows easier to build, operate and monitor.
4. Data
Ingestion
Data
Analysis + Validation
Feature
Engineering
Trainer
Model Evaluation
and Validation
Serving Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
An ML pipeline is part of the solution to this problem
5. Data
Ingestion
TensorFlow
Data Validation
TensorFlow
Transform
Estimator
or Keras
Model
TensorFlow
Model Analysis
TensorFlow
Serving
Logging
Shared Utilities for Garbage Collection, Data Access Controls
Pipeline Storage
Tuner
Shared Configuration Framework and Job Orchestration
Integrated Frontend for Job Management, Monitoring, Debugging, Data/Model/Evaluation Visualization
TensorFlow Extended (TFX) is an end-to-end ML
pipeline for TensorFlow
8. ● A unified batch and stream distributed processing API
● A set of SDK frontends: Java, Python, Go, Scala, SQL, …
● A set of runners which can execute Beam jobs into
various backends: Local, Apache Flink, Apache Spark,
Apache Gearpump, Apache Samza, Apache Hadoop,
Google Cloud Dataflow, …
9. Provides a comprehensive portability
framework for data processing pipelines, which
allows you to write your pipeline once in your
language of choice and run it with minimal effort
on your execution engine of choice.
10. Data Ingestion
TensorFlow
Transform
Estimator Model
TensorFlow
Model Analysis
Honoring
Validation
Outcomes
TensorFlow
Data Validation
TensorFlow
Serving
ExampleGen
StatisticsGen
SchemaGen
Example
Validator
Transform Trainer
Evaluator
Model
Validator
Pusher Model Server
Powered by Beam Powered by Beam
15. Metadata Store
Trainer
Config
Last Validated
Model
New (Candidate)
Model
New Model
Model
Validator
Validation
Outcome
Pusher
New (Candidate)
Model
Validation
Outcome
Deployment targets:
TensorFlow Serving
TensorFlow Lite
TensorFlow JS
TensorFlow Hub
20. Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
21. Trained
Models
Type definitions of Artifacts and their Properties
E.g., Models, Data, Evaluation Metrics
Trainer Execution Records (Runs) of Components
E.g., Runtime Configuration, Inputs + Outputs
Lineage Tracking Across All Executions
E.g., to recurse back to all inputs of a specific artifact
47. Component: ExampleGen
examples = csv_input(os.path.join(data_root, 'simple'))
example_gen = CsvExampleGen(input_base=examples)
Configuration
Example
Gen
Raw Data
Inputs and Outputs
CSV TF Record
Split
TF Record
Data
Training
Eval
52. Component: Transform
transform = Transform(
input_data=example_gen.outputs.examples,
schema=infer_schema.outputs.output,
module_file=taxi_module_file)
Configuration
for key in _DENSE_FLOAT_FEATURE_KEYS:
outputs[_transformed_name(key)] = transform.scale_to_z_score(
_fill_in_missing(inputs[key]))
# ...
outputs[_transformed_name(_LABEL_KEY)] = tf.where(
tf.is_nan(taxi_fare),
tf.cast(tf.zeros_like(taxi_fare), tf.int64),
# Test if the tip was > 20% of the fare.
tf.cast(
tf.greater(tips, tf.multiply(taxi_fare, tf.constant(0.2))), tf.int64))
# ...
CodeTransform
Data Schema
Transform
Graph
Transformed
Data
ExampleGen SchemaGen
Trainer
Inputs and Outputs
Code
55. Component: Trainer
trainer = Trainer(
module_file=taxi_module_file,
transformed_examples=transform.outputs.transformed_examples,
schema=infer_schema.outputs.output,
transform_output=transform.outputs.transform_output,
train_steps=10000,
eval_steps=5000,
warm_starting=True)
Configuration
Code: Just TensorFlow :)
Trainer
Data Schema
Transform SchemaGen
Evaluator
Inputs and Outputs
Code
Transform
Graph
Model
Validator
Pusher
Model(s)
56. Component: Evaluator
Evaluator
Data Model
ExampleGen Trainer
Inputs and Outputs
Evaluation
Metrics
model_analyzer = Evaluator(
examples=examples_gen.outputs.output,
eval_spec=taxi_eval_spec,
model_exports=trainer.outputs.output)
Configuration
Visualization
57. Component: ModelValidator
Model
Validator
Data
ExampleGen Trainer
Inputs and Outputs
Validation
Outcome
Model (x2)
model_validator = ModelValidator(
examples=examples_gen.outputs.output,
model=trainer.outputs.output,
eval_spec=taxi_mv_spec)
Configuration
● Configuration options
○ Validate using current eval data
○ “Next-day eval”, validate using unseen data
60. Apache Beam
Sum Per Key
⋮
input | Sum.PerKey()
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value)
FROM input GROUP BY key
SQL
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo
(incubating)
IBM Streams
61. PTransforms
● More transforms available in Java
than Python
● Python can invoke Java
transforms (coming soon)
with self.create_pipeline() as p:
res = (
p | GenerateSequence(start=1, stop=10,
expansion_service=expansion_address))
GenerateSequence is written in Java
62. I/O
● More I/O available in Java than
Python
● Python can invoke Java I/O
(coming soon)
(Coming soon)
63. Language File-based Messaging Database
Java Beam Java supports Apache HDFS, Amazon S3, Google Cloud Storage, and local
filesystems.
FileIO (general-purpose reading, writing, and matching of files)
AvroIO
TextIO
TFRecordIO
XmlIO
TikaIO
ParquetIO
RabbitMqIO
SqsIO
Amazon Kinesis
AMQP
Apache Kafka
Google Cloud Pub/Sub
JMS
MQTT
Apache Cassandra
Apache Hadoop Input/Output Format
Apache HBase
Apache Hive (HCatalog)
Apache Kudu
Apache Solr
Elasticsearch (v2.x, v5.x, v6.x)
Google BigQuery
Google Cloud Bigtable
Google Cloud Datastore
Google Cloud Spanner
JDBC
MongoDB
Redis
64. Per element ParDo (Map, etc)
Every item
processed
independently
Stateless
implementation
65. Per key Combine (Reduce, etc)
65
Items grouped by some
key and combined
Stateful streaming
implementation
But your code doesn't
work with state, just
associative &
commutative function
67. Classic parallel IO
67
"Embarrassingly parallel" (idealized)
Non-parallel execution
time
workersworkers
time
"Embarrassingly parallel" (actual, most systems)
workers
time
68. Beam's dynamic work rebalancing
Without dynamic work rebalancing
workers
time
With dynamic work rebalancing
workers
time
Beam's APIs make this the default approach
69. Beam's dynamic work rebalancing
69
A classic MapReduce job (read from Google Cloud Storage,
GroupByKey, write to Google Cloud Storage), 400 workers.
Dynamic Work Rebalancing disabled to demonstrate stragglers.
X axis: time (total ~20min.); Y axis: workers
Same job, Dynamic Work Rebalancing
enabled by Beam’s Splittable DoFn.
X axis: time (total ~15min.); Y axis: workers
Savings!
70. Dataflow’s Liquid Sharding
● Monitors worker progress and identify
stragglers
● Asks stragglers to give away part of their
unprocessed work (e.g., a sub-range of a file or
a key range)
● Schedule new work items onto idle workers
● Repeat for the next stragglers
The amount of work to give away is chosen so that
the worker is expected to complete soon enough and
stop being a straggler
Non-trivial to implement
72. Beam’s Flink Runner
Beam ParDo
Element-wise transformation parameterized by
a chunk of user code. Elements are processed
in bundles, with initialization and termination
hooks. Bundle size is chosen by the runner and
cannot be controlled by user code. ParDo
processes a main input PCollection one
element at a time, but provides side input
access to additional PCollections.
Flink Python Runner
Yes: fully supported
ParDo itself, as per-element transformation
with UDFs, is fully supported by Flink for both
batch and streaming.
73. Beam’s Flink Runner
Beam GroupByKey
Grouping of key-value pairs per key, window,
and pane.
Flink Python Runner
Yes: fully supported
Uses Flink's keyBy for key grouping. When
grouping by window in streaming (creating the
panes) the Flink runner uses the Beam code.
This guarantees support for all windowing and
triggering mechanisms.
74. Beam’s Flink Runner
Beam Stateful Processing
Allows fine-grained access to per-key,
per-window persistent state and timers. Timers
are integral to stateful processing. Necessary
for certain use cases (e.g. high-volume
windows which store large amounts of data, but
typically only access small portions of it;
complex state machines; etc.) that are not
easily or efficiently addressed via Combine or
GroupByKey+ParDo.
Flink Python Runner
Partially: non-merging windows
State is supported for non-merging windows.
MapState fully supported.
75. Beam’s Flink Runner
Beam Splittable DoFn (SDF)
Allows users to develop DoFn's that process a
single element in portions ("restrictions"),
executed in parallel or sequentially. This
supersedes the unbounded and bounded
`Source` APIs by supporting all of their features
on a per-element basis. See
http://s.apache.org/splittable-do-fn. Design is in
progress on achieving parity with Source API
regarding progress signals.
Flink Python Runner
Not supported