2. Yuta Hono | 寳野 雄太
Cloud Customer Engineer, Google
yutah@google.com
@yutah_3
● Google Cloud のソリューションのご提案
● お客様プロジェクトのアーキテクチャ設計/
プロダクトの技術的なご支援
(主に Google Cloud Platform )
Disclaimer: this presentation include my personal performance test
results. This is not an official performance testing, just an example of the
performance.
3. ETL with Cloud Dataflow
p.apply(
TextIO.Read.from(“gs://…”)
)
p.apply(
BigQueryIO.Write.to(“tablespec…”)
)p.apply(
ParDo.of(new YourTransform())
) Stream/Batch
Batch
Batch
Stream
p.apply(
PubsubIO.Read.topic(“input_topic”)
)
p.apply(
TextIO.Write.to(“gs://…”)
)
Parallel / Distributed processing
with your class in ParDo,
Both Streaming, Batch runs.
14. Batch Shuffling: Stages
1. Create shuffle dataset
2. Write to the shuffle dataset
(from possibly several stages)
3. Close shuffle dataset
4. Read from the shuffle dataset
(from possibly several stages)
18. Example. Dataflow Shuffle Service
Joining 1TB of data
● The default configuration (worker-based
shuffle on n1-standard-1 machine type
using Persistent Disk storage)
● Tuned configuration (worker-based shuffle
on a larger n1-standard-4 machine type and
using SSD storage)
● The new Cloud Dataflow Shuffle
40% faster than the tuned version
of the worker-based shuffle.
19. Try Dataflow Shuffle Service
Condition:
● Export gdelt-bq:internetarchivebooks from BigQuery
○ approx. 180GB
○ sharded to 182 files
● Wordcount ! Previous shuffle vs Shuffle Service
○ No code change
○ --maxNumWorkers=30
○ us-central1
23. Requirements:
● Dataflow SDK for Java version 1.9.x. Or 2.0 (from Today)
● us-central1 (Iowa) region
● Batch mode
● No --zone parameter
Try:
--experiments=shuffle_mode=service
Costs (in Beta, Could be changed):
● $0.0216 per GB-Hour :
○ The total bill for Dataflow pipelines is expected to be less than or
equal to the cost of Dataflow pipelines not using this option.
Try the shuffle service ! (and Give a feedback)
24. ● dataflow-feedback-shuffle@google.com (en) or
● yutah@google.com (ja)
Issue:
● Not decreased the duration of the shuffle operation
● The cost of your pipeline job (using the optimization feature) increases
without any reduction in processing time
It is our intent to continue improving the service-based Shuffle optimization
feature until its performance is better than that of the worker-based Shuffle.
Feedback