Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Stream processing by default
Modern processing for Big Data, as offered
by Google Cloud Dataflow and Flink
William Vambene...
Goals:
Write interesting
computations
Run in both batch &
streaming
Use custom timestamps
Handle late data
Data Shapes
Google’s Data Processing Story
The Dataflow Model
Agenda
Google Cloud Dataflow service
1
2
3
4
Data Shapes1
Data...
...can be big...
...really, really big...
Tuesday
Wednesday
Thursday
...maybe even infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
… with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00
1 + 1 = 2
Completeness Latency Cost
$$$
Data Processing Tradeoffs
Requirements: Billing Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Live Cost Estimate Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Requirements: Abuse Detection Backfill Pipeline
Completeness Low Latency Low Cost
Important
Not Important
Google’s Data Processing Story2
20122002 2004 2006 2008 2010
MapReduce
GFS Big Table
Dremel
Pregel
FlumeJava
Colossus
Spanner
2014
MillWheel
Dataflow
2016...
(Produce)
MapReduce: Batch Processing
(Prepare)
(Shuffle)
Map
Reduce
FlumeJava: Easy and Efficient MapReduce Pipelines
● Higher-level API with simple data
processing abstractions.
○ Focus on ...
MapReduce
Batch Patterns: Creating Structured Data
MapReduce
Batch Patterns: Repetitive Runs
Tuesday
Wednesday
Thursday
MapReduce
Tuesday [11:00 - 12:00)
[12:00 - 13:00)
[13:00 - 14:00)
[14:00 - 15:00)
[15:00 - 16:00)
[16:00 - 17:00)
[18:00 -...
MapReduce
TuesdayWednesday
Batch Patterns: Sessions
Jose
Lisa
Ingo
Asha
Cheryl
Ari
WednesdayTuesday
MillWheel: Streaming Computations
● Framework for building low-latency
data-processing applications
● User provides a DAG ...
Streaming Patterns: Element-wise transformations
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
Streaming Patterns: Aggregating Time Based Windows
13:00 14:008:00 9:00 10:00 11:00 12:00
Processing
Time
11:0010:00 15:0014:0013:0012:00Event Time
11:0010:00 15:0014:0013:0012:00
Processing
Time
Input
Output
Streaming Patterns:...
Streaming Patterns: Session Windows
Event Time
Processing
Time
11:0010:00 15:0014:0013:0012:00
11:0010:00 15:0014:0013:001...
ProcessingTime
Event Time
Skew
Event-Time Skew
Watermark
Watermarks describe event time
progress.
"No timestamp earlier th...
The Dataflow Model3
What are you computing?
Where in event time?
When in processing time?
How do refinements relate?
What are you computing?
• A Pipeline represents a graph
of data processing
transformations
• PCollections flow through the...
What are you computing?
• A PCollection<T> is a collection
of data of type T
• Maybe be bounded or unbounded
in size
• Eac...
What are you computing?
PTransforms transform PCollections into other
PCollections.
What Where When How
Element-Wise Aggre...
Example: Computing Integer Sums
What Where When How
What Where When How
Example: Computing Integer Sums
Key
2
Key
1
Key
3
1
Fixed
2
3
4
Key
2
Key
1
Key
3
Sliding
1
2
3
5
4
Key
2
Key
1
Key
3
Sessions
2
4
3
1
Where in Event Time...
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2))))
.apply(Sum.integersPerKey...
What Where When How
Example: Fixed 2-minute Windows
What Where When How
When in Processing Time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to...
PCollection<KV<String, Integer>> output = input
.apply(Window.into(FixedWindows.of(Minutes(2)))
.trigger(AtWatermark())
.a...
What Where When How
Example: Triggering at the Watermark
What Where When How
Example: Triggering for Speculative & Late Data
PCollection<KV<String, Integer>> output = input
.apply...
What Where When How
Example: Triggering for Speculative & Late Data
What Where When How
How do Refinements Relate?
• How should multiple outputs per window
accumulate?
• Appropriate choice d...
PCollection<KV<String, Integer>> output = input
.apply(Window.into(Sessions.withGapDuration(Minutes(1)))
.trigger(AtWaterm...
What Where When How
Example: Add Newest, Remove Previous
1. Classic Batch 2. Batch with Fixed
Windows
3. Streaming 5. Streaming with
Retractions
4. Streaming with
Speculative + La...
Dataflow improvements over Lambda architecture
Low-latency, approximate results
Complete, correct results as soon as possi...
Open Source SDKs
● Used to construct a Dataflow pipeline.
● Java available now. Python in the works.
● Pipelines can run…
...
Google Cloud Dataflow service4
Fully Managed Dataflow Service
Runs the pipeline on Google Cloud Platform. Includes:
● Graph optimization: Modular code, e...
Cloud Dataflow as a No-op Cloud service
Google Cloud Platform
Managed Service
User Code & SDK Work Manager
Deploy&
Schedul...
Cloud Dataflow is part of a broader data platform
Cloud Logs
Google App
Engine
Google Analytics
Premium
Cloud Pub/Sub
BigQ...
http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/
Great Flink perf on GCE
E.g.: matri...
Google Cloud Datalab
Jupyter notebooks
created in one click.
Direct BigQuery
integration.
Automatically stored in
git repo...
Learn More
● The Dataflow Model @VLDB 2015
http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf
● Dataflow SDK for Java
https:/...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by Default
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by Default
You’ve finished this document.
Download and read it offline.
Upcoming SlideShare
Ufuc Celebi – Stream & Batch Processing in one System
Next
Upcoming SlideShare
Ufuc Celebi – Stream & Batch Processing in one System
Next
Download to read offline and view in fullscreen.

Share

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by Default

Download to read offline

Flink Forward 2015

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by Default

  1. 1. Stream processing by default Modern processing for Big Data, as offered by Google Cloud Dataflow and Flink William Vambenepe Lead Product Manager for Data Processing Google Cloud Platform @vambenepe / vbp@google.com
  2. 2. Goals: Write interesting computations Run in both batch & streaming Use custom timestamps Handle late data
  3. 3. Data Shapes Google’s Data Processing Story The Dataflow Model Agenda Google Cloud Dataflow service 1 2 3 4
  4. 4. Data Shapes1
  5. 5. Data...
  6. 6. ...can be big...
  7. 7. ...really, really big... Tuesday Wednesday Thursday
  8. 8. ...maybe even infinitely big... 9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00
  9. 9. … with unknown delays. 9:008:00 14:0013:0012:0011:0010:00 8:00 8:008:00 8:00
  10. 10. 1 + 1 = 2 Completeness Latency Cost $$$ Data Processing Tradeoffs
  11. 11. Requirements: Billing Pipeline Completeness Low Latency Low Cost Important Not Important
  12. 12. Requirements: Live Cost Estimate Pipeline Completeness Low Latency Low Cost Important Not Important
  13. 13. Requirements: Abuse Detection Pipeline Completeness Low Latency Low Cost Important Not Important
  14. 14. Requirements: Abuse Detection Backfill Pipeline Completeness Low Latency Low Cost Important Not Important
  15. 15. Google’s Data Processing Story2
  16. 16. 20122002 2004 2006 2008 2010 MapReduce GFS Big Table Dremel Pregel FlumeJava Colossus Spanner 2014 MillWheel Dataflow 2016 Google’s Data-Related Papers
  17. 17. (Produce) MapReduce: Batch Processing (Prepare) (Shuffle) Map Reduce
  18. 18. FlumeJava: Easy and Efficient MapReduce Pipelines ● Higher-level API with simple data processing abstractions. ○ Focus on what you want to do to your data, not what the underlying system supports. ● A graph of transformations is automatically transformed into an optimized series of MapReduces.
  19. 19. MapReduce Batch Patterns: Creating Structured Data
  20. 20. MapReduce Batch Patterns: Repetitive Runs Tuesday Wednesday Thursday
  21. 21. MapReduce Tuesday [11:00 - 12:00) [12:00 - 13:00) [13:00 - 14:00) [14:00 - 15:00) [15:00 - 16:00) [16:00 - 17:00) [18:00 - 19:00) [19:00 - 20:00) [21:00 - 22:00) [22:00 - 23:00) [23:00 - 0:00) Batch Patterns: Time Based Windows
  22. 22. MapReduce TuesdayWednesday Batch Patterns: Sessions Jose Lisa Ingo Asha Cheryl Ari WednesdayTuesday
  23. 23. MillWheel: Streaming Computations ● Framework for building low-latency data-processing applications ● User provides a DAG of computations to be performed ● System manages state and persistent flow of elements
  24. 24. Streaming Patterns: Element-wise transformations 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  25. 25. Streaming Patterns: Aggregating Time Based Windows 13:00 14:008:00 9:00 10:00 11:00 12:00 Processing Time
  26. 26. 11:0010:00 15:0014:0013:0012:00Event Time 11:0010:00 15:0014:0013:0012:00 Processing Time Input Output Streaming Patterns: Event Time Based Windows
  27. 27. Streaming Patterns: Session Windows Event Time Processing Time 11:0010:00 15:0014:0013:0012:00 11:0010:00 15:0014:0013:0012:00 Input Output
  28. 28. ProcessingTime Event Time Skew Event-Time Skew Watermark Watermarks describe event time progress. "No timestamp earlier than the watermark will be seen" Often heuristic-based. Too Slow? Results are delayed. Too Fast? Some data is late.
  29. 29. The Dataflow Model3
  30. 30. What are you computing? Where in event time? When in processing time? How do refinements relate?
  31. 31. What are you computing? • A Pipeline represents a graph of data processing transformations • PCollections flow through the pipeline • Optimized and executed as a unit for efficiency
  32. 32. What are you computing? • A PCollection<T> is a collection of data of type T • Maybe be bounded or unbounded in size • Each element has an implicit timestamp • Initially created from backing data stores
  33. 33. What are you computing? PTransforms transform PCollections into other PCollections. What Where When How Element-Wise Aggregating Composite
  34. 34. Example: Computing Integer Sums What Where When How
  35. 35. What Where When How Example: Computing Integer Sums
  36. 36. Key 2 Key 1 Key 3 1 Fixed 2 3 4 Key 2 Key 1 Key 3 Sliding 1 2 3 5 4 Key 2 Key 1 Key 3 Sessions 2 4 3 1 Where in Event Time? • Windowing divides data into event- time-based finite chunks. • Required when doing aggregations over unbounded data. What Where When How
  37. 37. PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2)))) .apply(Sum.integersPerKey()); What Where When How Example: Fixed 2-minute Windows
  38. 38. What Where When How Example: Fixed 2-minute Windows
  39. 39. What Where When How When in Processing Time? • Triggers control when results are emitted. • Triggers are often relative to the watermark. ProcessingTime Event Time Watermark
  40. 40. PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark()) .apply(Sum.integersPerKey()); What Where When How Example: Triggering at the Watermark
  41. 41. What Where When How Example: Triggering at the Watermark
  42. 42. What Where When How Example: Triggering for Speculative & Late Data PCollection<KV<String, Integer>> output = input .apply(Window.into(FixedWindows.of(Minutes(2))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1)))) .apply(Sum.integersPerKey());
  43. 43. What Where When How Example: Triggering for Speculative & Late Data
  44. 44. What Where When How How do Refinements Relate? • How should multiple outputs per window accumulate? • Appropriate choice depends on consumer. Firing Elements Speculative 3 Watermark 5, 1 Late 2 Total Observed 11 Discarding 3 6 2 11 Accumulating 3 9 11 23 Acc. & Retracting 3 9, -3 11, -9 11
  45. 45. PCollection<KV<String, Integer>> output = input .apply(Window.into(Sessions.withGapDuration(Minutes(1))) .trigger(AtWatermark() .withEarlyFirings(AtPeriod(Minutes(1))) .withLateFirings(AtCount(1))) .accumulatingAndRetracting()) .apply(new Sum()); What Where When How Example: Add Newest, Remove Previous
  46. 46. What Where When How Example: Add Newest, Remove Previous
  47. 47. 1. Classic Batch 2. Batch with Fixed Windows 3. Streaming 5. Streaming with Retractions 4. Streaming with Speculative + Late Data Customizing What Where When How What Where When How
  48. 48. Dataflow improvements over Lambda architecture Low-latency, approximate results Complete, correct results as soon as possible One system: less to manage, fewer resources, one set of bugs Tools for explicit reasoning about time = Power + Flexibility + Clarity Never re-architect a working pipeline for operational reasons
  49. 49. Open Source SDKs ● Used to construct a Dataflow pipeline. ● Java available now. Python in the works. ● Pipelines can run… ○ On your development machine ○ On the Dataflow Service on Google Cloud Platform ○ On third party environments like Spark (batch only) or Flink (streaming coming soon)
  50. 50. Google Cloud Dataflow service4
  51. 51. Fully Managed Dataflow Service Runs the pipeline on Google Cloud Platform. Includes: ● Graph optimization: Modular code, efficient execution ● Smart Workers: Lifecycle management, Autoscaling, and Smart task rebalancing ● Easy Monitoring: Dataflow UI, Restful API and CLI, Integration with Cloud Logging, etc.
  52. 52. Cloud Dataflow as a No-op Cloud service Google Cloud Platform Managed Service User Code & SDK Work Manager Deploy& Schedule Progress& Logs Monitoring UI Job Manager Graph optim ization
  53. 53. Cloud Dataflow is part of a broader data platform Cloud Logs Google App Engine Google Analytics Premium Cloud Pub/Sub BigQuery Storage (tables) Cloud Bigtable (NoSQL) Cloud Storage (files) Cloud Dataflow BigQuery Analytics (SQL) Capture Store Analyze Batch Cloud DataStore Process Stream Cloud Monitoring Cloud Bigtable Real time analytics and Alerts Cloud Dataflow Cloud Dataproc Cloud Datalab Flink via bdutil
  54. 54. http://data-artisans.com/computing-recommendations-at-extreme-scale-with-apache-flink/ Great Flink perf on GCE E.g.: matrix factorization (ALS) 40 instances, local SSD One-click deploy via bdutil https://github.com/GoogleCloudPlatform/bdutil/tree/master/extensions/flink Apache Flink on Google Cloud
  55. 55. Google Cloud Datalab Jupyter notebooks created in one click. Direct BigQuery integration. Automatically stored in git repo on GCP. FR E S H O FF TH E P R E S S
  56. 56. Learn More ● The Dataflow Model @VLDB 2015 http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf ● Dataflow SDK for Java https://github.com/GoogleCloudPlatform/DataflowJavaSDK ● Google Cloud Dataflow on Google Cloud Platform http://cloud.google.com/dataflow (Free Trial!) ● Contact me: vbp@google.com or on Twitter @vambenepe
  • qinyu14

    Dec. 29, 2020
  • MASeyal

    Apr. 8, 2020
  • TaherKhorshidi

    Apr. 29, 2019
  • yunchaozhang

    Feb. 22, 2019
  • RmiRiche

    Nov. 8, 2018
  • opilot

    Mar. 6, 2018
  • FranckChristien

    Dec. 22, 2017
  • DescartesChen

    Sep. 15, 2017
  • CarlosHernandezMSE

    May. 3, 2017
  • nfonrose

    Oct. 12, 2016
  • stefanobaghino

    Apr. 29, 2016
  • DanSimmons3

    Apr. 16, 2016
  • tomday999

    Jan. 29, 2016
  • ismailover

    Nov. 24, 2015
  • manuzhang

    Nov. 11, 2015
  • avi_sh

    Nov. 10, 2015
  • seohoseok14

    Nov. 8, 2015
  • jorokr21

    Oct. 28, 2015
  • viswanath7

    Oct. 22, 2015

Flink Forward 2015

Views

Total views

8,974

On Slideshare

0

From embeds

0

Number of embeds

4,830

Actions

Downloads

192

Shares

0

Comments

0

Likes

19

×