Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Google Cloud Dataflow
the next generation of managed big data service
based on the Apache Beam programming model
Sub Szabo...
You leave here understanding the fundamentals of
the Apache Beam model and the Google Cloud Dataflow managed service
We ha...
Background and Historical
overview
The trade-off quadrant of Big Data
CompletenessSpeed
Cost
Optimization
Complexity
Time to Answer
MapReduce
Hadoop
Flume
Storm
Spark
MillWheel
Flink
Apache Beam
*
Batch
Streaming
Pipelines
UnifiedAPI
NoLambda
Iterative
I...
Deep dive, probing familiarity with the subject
1M
Devices
16.6K Events/sec
43B Events/month
518B Events/year
Before Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
OR
OR
OR
OR
After Apache Beam
Batch
Accuracy
Simplicity
Savings
Stream
Speed
Sophistication
Scalability
AND
AND
AND
AND
Balancing corr...
http://research.google.com/search.html?q=dataflow
Apache Beam (incubating)
Java https://github.com
/GoogleCloudPlatform/DataflowJavaSDK
Python (ALPHA)
Scala /darkjh/scalafl...
• Movement
• Filtering
• Enrichment
• Shaping
• Reduction
• Batch computation
• Continuous
computation
• Composition
• Ext...
Why would you go with a
managed service?
GCP
Managed Service
User Code & SDK
Work Manager
Deploy & Schedule
Monitoring UI
Job Manager
Cloud Dataflow Managed Servic...
Deploy Schedule & Monitor Tear Down
Worker Lifecycle Management
Cloud Dataflow Service
❯ Time & life never stop
❯ Data rates & schema are not static
❯ Scaling models are not static
❯ Non-elastic compute is was...
Auto-scaling
800 QPS 1200 QPS 5000 QPS 50 QPS
10:00 11:00 12:00 13:00
Cloud Dataflow Service
100 mins. 65 mins.
vs.
Dynamic Work Rebalancing
Cloud Dataflow Service
● ParDo fusion
○ Producer Consumer
○ Sibling
○ Intelligent fusion
boundaries
● Combiner lifting e.g. partial
aggregations ...
Deep dive into
the programming model
The Apache Beam Logical Model
What are you computing?
Where in event time?
When in processing time?
How do refinements rel...
What are you computing?
● A Pipeline represents a graph
● Nodes are data processing
transformations
● Edges are data sets ...
What are you computing? PCollections
● is a collection of homogenous
data of the same type
● Maybe be bounded or
unbounded...
Challenge: completeness when processing continuous data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
What are you computing? PTransforms
transform PCollections into other PCollections.
What Where When How
Element-Wise
(Map ...
GroupByKey
Pair With Ones
Sum Values
Count
❯ Define new PTransforms by building up
subgraphs of existing transforms
❯ Some...
Example: Computing Integer Sums
What Where When How
What Where When How
Example: Computing Integer Sums
Key
2
Key
1
Key
3
1
Fixed
2
3
4
Key
2
Key
1
Key
3
Sliding
1
2
3
5
4
Key
2
Key
1
Key
3
Sessions
2
4
3
1
Where in Event Time...
What Where When How
Example: Fixed 2-minute Windows
What Where When How
When in Processing Time?
● Triggers control
when results are
emitted.
● Triggers are often
relative to...
What Where When How
Example: Triggering at the Watermark
What Where When How
Example: Triggering for Speculative & Late Data
What Where When How
How do Refinements Relate?
● How should multiple outputs per window
accumulate?
● Appropriate choice d...
What Where When How
Example: Add Newest, Remove Previous
1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + La...
The key takeaway
Optimizing Your Time To Answer
More time to dig
into your data
Programming
Resource
provisioning
Performance
tuning
Monito...
How much more
time?
You do not just save
on processing, but
code complexity
and size as well!
Source: https://cloud.google...
What do customers have to say about
Google Cloud Dataflow
"We are utilizing Cloud Dataflow to overcome elasticity
challeng...
Demo Time!
Let’s build something - Demo!
Ingest stream from Wikipedia edits https://wikitech.wikimedia.
org/wiki/Stream.wikimedia.org...
Thank You!
cloud.google.com/dataflow
cloud.google.com/blog/big-data/
cloud.google.com/solutions/articles#bigdata
cloud.goo...
Apache Beam and Google Cloud Dataflow - IDG - final
Upcoming SlideShare
Loading in …5
×

Apache Beam and Google Cloud Dataflow - IDG - final

2,169 views

Published on

  • Be the first to comment

Apache Beam and Google Cloud Dataflow - IDG - final

  1. 1. Google Cloud Dataflow the next generation of managed big data service based on the Apache Beam programming model Sub Szabolcs Feczak, Cloud Solutions Engineer Google 9th Cloud & Data Center World 2016 - 한국 IDG
  2. 2. You leave here understanding the fundamentals of the Apache Beam model and the Google Cloud Dataflow managed service We have some fun. 1 Goals 2
  3. 3. Background and Historical overview
  4. 4. The trade-off quadrant of Big Data CompletenessSpeed Cost Optimization Complexity Time to Answer
  5. 5. MapReduce Hadoop Flume Storm Spark MillWheel Flink Apache Beam * Batch Streaming Pipelines UnifiedAPI NoLambda Iterative Interactive ExactlyOnce State Timers Auto-Awesome Watermarks Windowing High-levelAPI ManagedService Triggers OpenSource UnifiedEngine * * Optimizer * * * * * * * * * *
  6. 6. Deep dive, probing familiarity with the subject 1M Devices 16.6K Events/sec 43B Events/month 518B Events/year
  7. 7. Before Apache Beam Batch Accuracy Simplicity Savings Stream Speed Sophistication Scalability OR OR OR OR
  8. 8. After Apache Beam Batch Accuracy Simplicity Savings Stream Speed Sophistication Scalability AND AND AND AND Balancing correctness, latency and cost with a unified batch with a streaming model
  9. 9. http://research.google.com/search.html?q=dataflow
  10. 10. Apache Beam (incubating) Java https://github.com /GoogleCloudPlatform/DataflowJavaSDK Python (ALPHA) Scala /darkjh/scalaflow /jhlch/scala-dataflow-dsl Software Development Kits Runners http://incubator.apache.org/projects/beam.html The Dataflow submission to the Apache Incubator was accepted on February 1, 2016, and the resulting project is now called Apache Beam. Spark runner@ /cloudera/spark- dataflow Flink runner @ /dataArtisans/flink-dataflow
  11. 11. • Movement • Filtering • Enrichment • Shaping • Reduction • Batch computation • Continuous computation • Composition • External orchestration • Simulation Where might you use Apache Beam? AnalysisETL Orchestration
  12. 12. Why would you go with a managed service?
  13. 13. GCP Managed Service User Code & SDK Work Manager Deploy & Schedule Monitoring UI Job Manager Cloud Dataflow Managed Service advantages (GA since 2015 August) Progress & Logs
  14. 14. Deploy Schedule & Monitor Tear Down Worker Lifecycle Management Cloud Dataflow Service
  15. 15. ❯ Time & life never stop ❯ Data rates & schema are not static ❯ Scaling models are not static ❯ Non-elastic compute is wasteful and can create lag Challenge: cost optimization
  16. 16. Auto-scaling 800 QPS 1200 QPS 5000 QPS 50 QPS 10:00 11:00 12:00 13:00 Cloud Dataflow Service
  17. 17. 100 mins. 65 mins. vs. Dynamic Work Rebalancing Cloud Dataflow Service
  18. 18. ● ParDo fusion ○ Producer Consumer ○ Sibling ○ Intelligent fusion boundaries ● Combiner lifting e.g. partial aggregations before reduction ● http://research.google.com/search. html?q=flume%20java ... Graph Optimization Cloud Dataflow Service C D C+D consumer-producer = ParallelDo GBK = GroupByKey + = CombineValues sibling C D C+D A GBK + B A+ GBK + B combiner lifting
  19. 19. Deep dive into the programming model
  20. 20. The Apache Beam Logical Model What are you computing? Where in event time? When in processing time? How do refinements relate?
  21. 21. What are you computing? ● A Pipeline represents a graph ● Nodes are data processing transformations ● Edges are data sets flowing through the pipeline ● Optimized and executed as a unit for efficiency
  22. 22. What are you computing? PCollections ● is a collection of homogenous data of the same type ● Maybe be bounded or unbounded in size ● Each element has an implicit timestamp ● Initially created from backing data stores
  23. 23. Challenge: completeness when processing continuous data 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00 8:00
  24. 24. What are you computing? PTransforms transform PCollections into other PCollections. What Where When How Element-Wise (Map + Reduce = ParDo) Aggregating (Combine, Join Group) Composite
  25. 25. GroupByKey Pair With Ones Sum Values Count ❯ Define new PTransforms by building up subgraphs of existing transforms ❯ Some utilities are included in the SDK • Count, RemoveDuplicates, Join, Min, Max, Sum, ... ❯ You can define your own: • DoSomething, DoSomethingElse, etc. ❯ Why bother? • Code reuse • Better monitoring experience Composite PTransforms Apache BeamSDK
  26. 26. Example: Computing Integer Sums What Where When How
  27. 27. What Where When How Example: Computing Integer Sums
  28. 28. Key 2 Key 1 Key 3 1 Fixed 2 3 4 Key 2 Key 1 Key 3 Sliding 1 2 3 5 4 Key 2 Key 1 Key 3 Sessions 2 4 3 1 Where in Event Time? ● Windowing divides data into event- time-based finite chunks. ● Required when doing aggregations over unbounded data. What Where When How
  29. 29. What Where When How Example: Fixed 2-minute Windows
  30. 30. What Where When How When in Processing Time? ● Triggers control when results are emitted. ● Triggers are often relative to the watermark. ProcessingTime Event Time Watermark Skew
  31. 31. What Where When How Example: Triggering at the Watermark
  32. 32. What Where When How Example: Triggering for Speculative & Late Data
  33. 33. What Where When How How do Refinements Relate? ● How should multiple outputs per window accumulate? ● Appropriate choice depends on consumer. Firing Elements Speculative 3 Watermark 5, 1 Late 2 Total Observ 11 Discarding 3 6 2 11 Accumulating 3 9 11 23 Acc. & Retracting 3 9, -3 11, -9 11
  34. 34. What Where When How Example: Add Newest, Remove Previous
  35. 35. 1. Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming with Retractions 4. Streaming with Speculative + Late Data Customizing What Where When How What Where When How
  36. 36. The key takeaway
  37. 37. Optimizing Your Time To Answer More time to dig into your data Programming Resource provisioning Performance tuning Monitoring Reliability Deployment & configuration Handling Growing Scale Utilization improvements Data Processing with Cloud DataflowTypical Data Processing Programming
  38. 38. How much more time? You do not just save on processing, but code complexity and size as well! Source: https://cloud.google. com/dataflow/blog/dataflow-beam-and- spark-comparison
  39. 39. What do customers have to say about Google Cloud Dataflow "We are utilizing Cloud Dataflow to overcome elasticity challenges with our current Hadoop cluster. Starting with some basic ETL workflow for BigQuery ingestion, we transitioned into full blown clickstream processing and analysis. This has helped us significantly improve performance of our overall system and reduce cost." Sudhir Hasbe, Director of Software Engineering, Zullily.com “The current iteration of Qubit’s real-time data supply chain was heavily inspired by the ground-breaking stream processing concepts described in Google’s MillWheel paper. Today we are happy to come full circle and build streaming pipelines on top of Cloud Dataflow - which has delivered on the promise of a highly-available and fault-tolerant data processing system with an incredibly powerful and expressive API.” Jibran Saithi, Lead Architect, Qubit "We are very excited about the productivity benefits offered by Cloud Dataflow and Cloud Pub/Sub. It took half a day to rewrite something that had previously taken over six months to build using Spark" Paul Clarke, Director of Technology, Ocado “Boosting performance isn’t the only thing we want to get from the new system. Our bet is that by using cloud-managed products we will have a much lower operational overhead. That in turn means we will have much more time to make Spotify’s products better.” Igor Maravić, Software Engineer working at Spotify
  40. 40. Demo Time!
  41. 41. Let’s build something - Demo! Ingest stream from Wikipedia edits https://wikitech.wikimedia. org/wiki/Stream.wikimedia.org Inspect the result set in our data warehouse (BigQuery) Create a pipeline and run a Dataflow job to extract the top 10 active editors and top 10 pages edited Extract words from a Shakespeare corpus, count the occurrences of each word, write sharded results as blobs into a key value store (Cloud Storage) 1. 2.
  42. 42. Thank You! cloud.google.com/dataflow cloud.google.com/blog/big-data/ cloud.google.com/solutions/articles#bigdata cloud.google.com/newsletter research.google.com

×