Dataflow shuffle service

Dataflow Shuffle Service
酒とゲームとインフラとGCP 第6回
#gcpja
Jul 19th, 2017
Cloud Customer Engineer, Google
Yuta Hono | 寳野雄太
yutah@google.com @yutah_3

Yuta Hono | 寳野雄太
Cloud Customer Engineer, Google
yutah@google.com
@yutah_3
● Google Cloud のソリューションのご提案
● お客様プロジェクトのアーキテクチャ設計/
プロダクトの技術的なご支援
(主に Google Cloud Platform )
Disclaimer: this presentation include my personal performance test
results. This is not an official performance testing, just an example of the
performance.

ETL with Cloud Dataflow
p.apply(
TextIO.Read.from(“gs://…”)
)
p.apply(
BigQueryIO.Write.to(“tablespec…”)
)p.apply(
ParDo.of(new YourTransform())
) Stream/Batch
Batch
Batch
Stream
p.apply(
PubsubIO.Read.topic(“input_topic”)
)
p.apply(
TextIO.Write.to(“gs://…”)
)
Parallel / Distributed processing
with your class in ParDo,
Both Streaming, Batch runs.

Agenda
1. Shuffle ?
2. Dataflow Shuffle Service　[Beta, As of Jul 19]

Shuffle
Map -> Shuffle (Sort) -> Reduce

https://research.google.com/archive/mapreduce-osdi04-slides/index-auto-0008.html
MapReduce - Execution

Cloud Dataflow Shuffle Service

<user1, record>
<user5, record>
<user3, record>
<user8, record>
<user4, record>
...
<user5, record>
<user5, record>
<user2, record>
<user2, record>
<user7, record>
...
<user3, record>
<user3, record>
<user8, record>
<user3, record>
<user6, record>
...
<user2, record>
<user1, record>
<user5, record>
<user8, record>
<user4, record>
...
Run Distributed Processing...

<user1, record>
<user5, record>
<user3, record>
<user8, record>
<user4, record>
...
<user5, record>
<user5, record>
<user2, record>
<user2, record>
<user7, record>
...
<user3, record>
<user3, record>
<user8, record>
<user3, record>
<user6, record>
...
<user2, record>
<user1, record>
<user5, record>
<user8, record>
<user4, record>
...
Shuffling

<user1, record>
<user1, record>
<user2, record>
<user2, record>
<user2, record>
...
<user3, record>
<user3, record>
<user3, record>
<user3, record>
<user4, record>
<user4, record>
...
<user5, record>
<user5, record>
<user5, record>
<user5, record>
<user6, record>
...
<user7, record>
<user8, record>
<user8, record>
<user8, record>
...
Shuffled

Batch Shuffling: Stages
1. Create shuffle dataset
2. Write to the shuffle dataset
(from possibly several stages)
3. Close shuffle dataset
4. Read from the shuffle dataset
(from possibly several stages)

Shuffle in Dataflow
Worker
Worker
Worker
Worker
Worker
PD
PD
PD
PD
PD
write
write
write
write
write
read
read
read
read
read

Dataflow Shuffle Service
/
Read/
Write
/
Read/
Write /
Read/
Write /
Read/
Write
New: Dataflow Shuffle Service

Ex. BigQuery Shuffle
https://cloud.google.com/blog/big-data/2016/08/in-memory-query-execution-in-google-bigquery

Example. Dataflow Shuffle Service
Joining 1TB of data
● The default configuration (worker-based
shuffle on n1-standard-1 machine type
using Persistent Disk storage)
● Tuned configuration (worker-based shuffle
on a larger n1-standard-4 machine type and
using SSD storage)
● The new Cloud Dataflow Shuffle
40% faster than the tuned version
of the worker-based shuffle.

Try Dataflow Shuffle Service
Condition:
● Export gdelt-bq:internetarchivebooks from BigQuery
○ approx. 180GB
○ sharded to 182 files
● Wordcount ! Previous shuffle vs Shuffle Service
○ No code change
○ --maxNumWorkers=30
○ us-central1

Result:
Shuffle Service
Normal

Shuffle ServiceNormal

Requirements:
● Dataflow SDK for Java version 1.9.x. Or 2.0 (from Today)
● us-central1 (Iowa) region
● Batch mode
● No --zone parameter
Try:
--experiments=shuffle_mode=service
Costs (in Beta, Could be changed):
● $0.0216 per GB-Hour :
○ The total bill for Dataflow pipelines is expected to be less than or
equal to the cost of Dataflow pipelines not using this option.
Try the shuffle service ! (and Give a feedback)

● dataflow-feedback-shuffle@google.com (en) or
● yutah@google.com (ja)
Issue:
● Not decreased the duration of the shuffle operation
● The cost of your pipeline job (using the optimization feature) increases
without any reduction in processing time
It is our intent to continue improving the service-based Shuffle optimization
feature until its performance is better than that of the worker-based Shuffle.
Feedback

Summary - Cloud Dataflow Shuffle Service
Scalable
Efficient
Fault-tolerant manner

Thank You!
yutah@google.com
https://cloud.google.com/

Dataflow shuffle service

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (10)

Similar to Dataflow shuffle service

Similar to Dataflow shuffle service (20)

Recently uploaded

Recently uploaded (20)

Dataflow shuffle service