Google Cloud Dataflow meets TensorFlow

Google Cloud Dataflow
meets TensorFlow
Dataflow TensorFlow Datastore
H.Yoshikawa (@hayatoy)
GCPUG Shonan #12
25 Mar. 2017

@hayatoy
GAE/Py 7~8 ?
APAC
TensorFlow
Presenter

TensorFlow
DatastoreIO

MapReduce
GCE
Jupyter Notebook deploy

Prerequisites
Create GCP account
Enable billing
Enable Google Dataflow API

Installation
Apache Beam
$ git clone https://github.com/apache/beam.git
$ cd beam/sdks/python/
$ python setup.py sdist
$ cd dist/
$ pip install apache-beam-sdk-*.tar.gz
>> 0.7.0.dev0
google‑cloud‑dataflow
$ pip install google-cloud-dataflow
>> 0.5.5

Installation
# gcloud-python
$ pip install gcloud
>> 0.18.3
# Application Default Credentials
$ gcloud beta auth application-default login
gs://

Pipeline
p = beam.Pipeline('DirectRunner')
(p | 'input' >> beam.Create(['Hello', 'World'])
| 'output' >> beam.io.WriteToText('gs://bucket/hello')
)
p.run()
(Pythonista )

Pipeline
(p | 'input' >> beam.Create(['Hello', 'World'])
| 'output' >> beam.io.WriteToText('gs://bucket/hello')
)
Pipeline |

PCollection
PCollection
PCollection key-value
bounded unbounded

PCollection
beam.Create(['Hello', 'World'])
plaintext 'Hello' 'World' 2
In Memory

TextFile
beam.io.ReadFromText('gs://bucket/input.txt')
* OK

BigQuery
Table
beam.io.Read(
beam.io.BigQuerySource(
'clouddataflow-readonly:samples.weather_stations'))
Query
beam.io.Read(
beam.io.BigQuerySource(
query='SELECT year, mean_temp FROM samples.weather_stations'))

Dynamic Work Rebalancing
https://cloud.google.com/blog/big‑data/2016/05/no‑shard‑left‑behind‑dynamic‑work‑rebalancing‑in‑google‑cloud‑dataflow

Q: Google Cloud Dataflow
TensorFlow
A:

TensorFlow Pipeline
(p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train)
| 'output' >> beam.io.WriteToText('gs://bucket/acc')
)
generate params
train
output

train
# (p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train) # ←
# | 'output' >> beam.io.WriteToText('gs://bucket/acc')
# )
PCollection
PCollection Windowing
Worker
Prediction PCollection

train
TensorFlow OK Dataflow
def train(param):
import tensorflow as tf
from sklearn import cross_validation
#
iris = tf.contrib.learn.datasets.base.load_iris()
train_x, test_x, train_y, test_y = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.2, random_state=0
)
# https://www.tensorflow.org/get_started/tflearn
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=param['hidden_units'],
dropout=param['dropout'],
n_classes=3,
model_dir='gs://{BUCKET}/models/%s'% model_id)
classifier.fit(x=train_x,
y=train_y,
steps=param['steps'],
batch_size=50)
result = classifier.evaluate(x=test_x, y=test_y)
ret = {'accuracy': float(result['accuracy']),
'loss': float(result['loss']),
'model_id': model_id,
'param': json.dumps(param)}
return ret

Pipeline
| 'train' >> beam.Map(train)
| 'output' >> beam.io.WriteToText('gs://bucket/acc')
)

# | 'train' >> beam.Map(train)
# | 'output' >> beam.io.WriteToText('gs://bucket/acc')
#)
PCollection
train
Dataflow

worker_options.max_num_workers = 10
worker_options.num_workers = 10
worker_options.disk_size_gb = 20
worker_options.machine_type = 'n1-standard-16'
max_num_workers
num_workers auto scaling
disk_size_gb GB 250GB
machine_type GCE REST
GPU

Branch
1‑Transform if
Pipeline
Console

Branch
def split_branch(n, side=0):
if n % 2 == side:
yield n
pipe_0 = p | 'param' >> beam.Create(range(100))
branch1 = (pipe_0 | 'branch1' >>
beam.FlatMap(split_branch, 0))
branch2 = (pipe_0 | 'branch2' >>
beam.FlatMap(split_branch, 1))

Dynamic Work Rebalancing
Struggle with Stragglers

2000
20 → 5 /
→ 5 DNN
15
ON/OFF OFF

DatastoreIO (Python )
Python Dataflow GA DatastoreIO
Protobuf
← New!

from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
from apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastore
from google.cloud.proto.datastore.v1 import entity_pb2
from google.cloud.proto.datastore.v1 import query_pb2
from googledatastore import helper as datastore_helper, PropertyFilter
from gcloud.datastore.helpers import entity_from_protobuf
DatastoreSource DatastoreSink
→ (Beam v0.7.0 )

Read from Datastore
Pipeline
(p | 'read from datastore' >>
ReadFromDatastore(project=PROJECTID,
query=query)
...

Query
query = query_pb2.Query()
query.kind.add().name = 'Test'
datastore_helper.set_property_filter(
query.filter, 'foo',
PropertyFilter.EQUAL, 'lorem'
)
GQL
SELECT * FROM Test WHERE foo = 'lorem'

Protobuf Datastore Client lib
def csv_format(entity_pb):
entity = entity_from_protobuf(entity_pb)
columns = ['"%s"' % entity[k]
for k in sorted(entity.keys())]
return ','.join(columns)
p = beam.Pipeline(options=options)
(p | 'read from datastore' >>
ReadFromDatastore(project=PROJECTID,
query=query)
| 'format entity to csv' >>
beam.Map(csv_format)
...

Write to Datastore
Pipeline
...
| 'create entity' >>
beam.Map(create_entity)
| 'write to datastore' >>
WriteToDatastore(project=PROJECTID))

Entity
Protobuf
def create_entity(param):
entity = entity_pb2.Entity()
datastore_helper.add_key_path(entity.key,
'Test',
str(uuid.uuid4()))
datastore_helper.add_properties(entity,
{"foo": u"hoge",
"bar": u"fuga",
"baz": 42})
return entity

Thank you!
bit.ly/gcp‑dataflow
Qiita
http://qiita.com/hayatoy/items/2eb2bc9223dd6f5c91e0
Medium
https://medium.com/@hayatoy/training‑multiple‑models‑of‑tensorflow‑using‑
dataflow‑7a5a9efafe53#.yvrblb6r3

Google Cloud Dataflow meets TensorFlow

More Related Content

What's hot

Viewers also liked

Similar to Google Cloud Dataflow meets TensorFlow

Recently uploaded

Google Cloud Dataflow meets TensorFlow