Google Cloud Dataflow
meets TensorFlow
Dataflow TensorFlow Datastore
H.Yoshikawa (@hayatoy)
GCPUG Shonan #12
25 Mar. 2017
@hayatoy
GAE/Py 7~8 ?
APAC
TensorFlow
Presenter
Disclaimer
Google Cloud Dataflow
TensorFlow
DatastoreIO
Google Cloud Dataflow
Google Cloud Dataflow
MapReduce
GCE
Jupyter Notebook deploy
Google Cloud Dataflow
Python
Prerequisites
Create GCP account
Enable billing
Enable Google Dataflow API
Installation
Apache Beam
$ git clone https://github.com/apache/beam.git
$ cd beam/sdks/python/
$ python setup.py sdist
$ cd dist/
$ pip install apache-beam-sdk-*.tar.gz
>> 0.7.0.dev0
google‑cloud‑dataflow
$ pip install google-cloud-dataflow
>> 0.5.5
Installation
# gcloud-python
$ pip install gcloud
>> 0.18.3
# Application Default Credentials
$ gcloud beta auth application-default login
 gs:// 
Pipeline
Pipeline
p = beam.Pipeline('DirectRunner')
(p | 'input' >> beam.Create(['Hello', 'World'])
| 'output' >> beam.io.WriteToText('gs://bucket/hello')
)
p.run()
(Pythonista )
Pipeline
(p | 'input' >> beam.Create(['Hello', 'World'])
| 'output' >> beam.io.WriteToText('gs://bucket/hello')
)
Pipeline  | 
PCollection
PCollection
 PCollection 
 PCollection   key-value 
 bounded   unbounded 
PCollection
beam.Create(['Hello', 'World'])
 plaintext  'Hello' 'World' 2
In Memory
TextFile
beam.io.ReadFromText('gs://bucket/input.txt')
* OK
BigQuery
Table
beam.io.Read(
beam.io.BigQuerySource(
'clouddataflow-readonly:samples.weather_stations'))
Query
beam.io.Read(
beam.io.BigQuerySource(
query='SELECT year, mean_temp FROM samples.weather_stations'))
Datastore
Python
Dynamic Work Rebalancing
Dynamic Work Rebalancing
https://cloud.google.com/blog/big‑data/2016/05/no‑shard‑left‑behind‑dynamic‑work‑rebalancing‑in‑google‑cloud‑dataflow
TensorFlow
Q: Google Cloud Dataflow
TensorFlow
A:
TensorFlow Pipeline
(p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train)
| 'output' >> beam.io.WriteToText('gs://bucket/acc')
)
 generate params 
 train 
 output 
 train 
# (p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train) # ←
# | 'output' >> beam.io.WriteToText('gs://bucket/acc')
# )
 PCollection 
 PCollection  Windowing
Worker
Prediction PCollection
 train 
TensorFlow OK Dataflow
def train(param):
import tensorflow as tf
from sklearn import cross_validation
#
iris = tf.contrib.learn.datasets.base.load_iris()
train_x, test_x, train_y, test_y = cross_validation.train_test_split(
iris.data, iris.target, test_size=0.2, random_state=0
)
# https://www.tensorflow.org/get_started/tflearn
feature_columns = [tf.contrib.layers.real_valued_column("", dimension=4)]
classifier = tf.contrib.learn.DNNClassifier(feature_columns=feature_columns,
hidden_units=param['hidden_units'],
dropout=param['dropout'],
n_classes=3,
model_dir='gs://{BUCKET}/models/%s'% model_id)
classifier.fit(x=train_x,
y=train_y,
steps=param['steps'],
batch_size=50)
result = classifier.evaluate(x=test_x, y=test_y)
ret = {'accuracy': float(result['accuracy']),
'loss': float(result['loss']),
'model_id': model_id,
'param': json.dumps(param)}
return ret
Dataflow
Dataflow
Pipeline
(p | 'generate params' >> beam.Create(params)
| 'train' >> beam.Map(train)
| 'output' >> beam.io.WriteToText('gs://bucket/acc')
)
(p | 'generate params' >> beam.Create(params)
# | 'train' >> beam.Map(train)
# | 'output' >> beam.io.WriteToText('gs://bucket/acc')
#)
 PCollection 
 train 
Dataflow
Auto Scaling, Machine Type
worker_options.max_num_workers = 10
worker_options.num_workers = 10
worker_options.disk_size_gb = 20
worker_options.machine_type = 'n1-standard-16'
 max_num_workers 
 num_workers  auto scaling
 disk_size_gb GB 250GB
 machine_type GCE REST
GPU
Pipeline
Branch
1‑Transform if
Pipeline
Console
Branch
Branch
def split_branch(n, side=0):
if n % 2 == side:
yield n
pipe_0 = p | 'param' >> beam.Create(range(100))
branch1 = (pipe_0 | 'branch1' >>
beam.FlatMap(split_branch, 0))
branch2 = (pipe_0 | 'branch2' >>
beam.FlatMap(split_branch, 1))
Dynamic Work Rebalancing
Struggle with Stragglers
2000
20 → 5 /
→ 5 DNN
15
ON/OFF OFF
‑
DNN
‑
Dataflow
DatastoreIO
DatastoreIO (Python )
Python Dataflow GA DatastoreIO
Protobuf
← New!
from apache_beam.io.gcp.datastore.v1.datastoreio import ReadFromDatastore
from apache_beam.io.gcp.datastore.v1.datastoreio import WriteToDatastore
from google.cloud.proto.datastore.v1 import entity_pb2
from google.cloud.proto.datastore.v1 import query_pb2
from googledatastore import helper as datastore_helper, PropertyFilter
from gcloud.datastore.helpers import entity_from_protobuf
 DatastoreSource   DatastoreSink 
→ (Beam v0.7.0 )
Read from Datastore
Pipeline
(p | 'read from datastore' >>
ReadFromDatastore(project=PROJECTID,
query=query)
...
Query
query = query_pb2.Query()
query.kind.add().name = 'Test'
datastore_helper.set_property_filter(
query.filter, 'foo',
PropertyFilter.EQUAL, 'lorem'
)
GQL
SELECT * FROM Test WHERE foo = 'lorem'
Protobuf Datastore Client lib
def csv_format(entity_pb):
entity = entity_from_protobuf(entity_pb)
columns = ['"%s"' % entity[k]
for k in sorted(entity.keys())]
return ','.join(columns)
p = beam.Pipeline(options=options)
(p | 'read from datastore' >>
ReadFromDatastore(project=PROJECTID,
query=query)
| 'format entity to csv' >>
beam.Map(csv_format)
...
Write to Datastore
Pipeline
...
| 'create entity' >>
beam.Map(create_entity)
| 'write to datastore' >>
WriteToDatastore(project=PROJECTID))
Entity
Protobuf
def create_entity(param):
entity = entity_pb2.Entity()
datastore_helper.add_key_path(entity.key,
'Test',
str(uuid.uuid4()))
datastore_helper.add_properties(entity,
{"foo": u"hoge",
"bar": u"fuga",
"baz": 42})
return entity
Demo
Questions?
Thank you!
bit.ly/gcp‑dataflow
Qiita
http://qiita.com/hayatoy/items/2eb2bc9223dd6f5c91e0
Medium
https://medium.com/@hayatoy/training‑multiple‑models‑of‑tensorflow‑using‑
dataflow‑7a5a9efafe53#.yvrblb6r3

Google Cloud Dataflow meets TensorFlow