DataFlow
& Beam
Gabe Hamilton
So you’ve built
your perfect
video game.
People all over
the world are
playing it.
Now for Billing, High Scores, etc
People are playing your game on servers all over the world.
It’s time to start crunching all your data for billing, high scores, error reports, etc.
The time that events happened is important.
You charge per minute played, with surge pricing!
Data often arrives late.
Network delays, Servers go down and send their data hours later.
Google DataFlow?
Apache Beam?
Yes!
What we’re going to cover
What is Dataflow?
Start a demo!
DataFlow Code
Batches & Streaming
Event Time
What is Google DataFlow?
Distributed Streaming (and Batch) Data processing engine
Pulls in data from Sources
Writes data to Sinks
Spins up data processing nodes, pushes your code out to them
Like Hadoop but handles unbounded data? Yep similar idea
Up and running in 10 mins
1. https://cloud.google.com/dataflow/getting-started
a. create a project
b. add dataflow API
c. create a google storage bucket
d. gcloud auth login
2. git clone git@github.com:gabehamilton/DataflowGroovySDK-
examples.git
3. gradlew run -Pargs="project=PROJECT_NAME
stagingLocation=BUCKET_NAME” (requires a JDK)
Lets see some code - config
DataflowPipelineOptions options =
PipelineOptionsFactory.create().as(DataflowPipelineOptions)
options.setProject( ‘myproject’ )
options.setStagingLocation("gs://aStagingBucket")
options.setNumWorkers(1000) // ← !!! default is 3
options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);
Lets see some code - pipeline
pipeline // Extract and sum username/score pairs from the event data.
.apply(TextIO.Read.from(options.getInput())) // Read events from a text file
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn()))
.apply("SumByUser", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums",
new WriteToBigQuery(options.getTableName()));
Components
PCollection - Parallel Collection
Standard interface to a set of data.
Can be a streaming data set.
PTransform - Parallel Transform
takes Input, produces Output
ParDo - Parallel Do
Your custom Transform Function
Demo
Running Dataflow - staging files
Our code
Output
Dependencies
Code is staged in the
staging bucket
before getting
pushed to Workers
Staged files - detail
We don’t need no stinking Batches
Handles Batches!
Streaming!
Not real time streaming, unbounded data set streaming.
Continuously processing your
User Scores, Billing, Analytics
Risk, Spam, and other deviations from mean
Event Time
Dataflow lets you work in event time
when the event says it happened
rather than processing time
when the event was received
Allows Out of Order processing
a plane full of mobile users just landed, turned their phones back on and
start delivering the past 2 hours of data
Features for working with Event Time
Windowing hourly, session based
Watermarks All the data is in.
Fixed end of match, end of file
Heuristic the data is probably in,
Percentile 90% of the data is in
Triggers emitting partial results.
Accumulations ways of dealing with late data
Streaming Event Time Example
Window
Hourly - events per hour
Trigger
Each minute
Allowed Lateness
12 hours
Accumulation
Discarding
How many errors occurred between 5-6pm.
As we process data, update windows every
minute.
After 12 hours, discard any late data that arrives
When updating a window throw out the previous
result replacing it with the new one.
Code - streaming event time
.into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1)
.triggering(
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTE
S)))
.withAllowedLateness(TWELVE_HOURS)
.discardingFiredPanes())
Demo 2
Streaming
Fraud
Detection
What is Apache Beam?
A standard for running pipelines on different engines
Direct PipelineRunner (i. e. local)
Dataflow PipelineRunner
Flink PipelineRunner
Spark PipelineRunner (new)
Apache
Beam
What to remember
Process lots of data
Out of order & Late data
On cluster of your choice
Locally testable
Questions?
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers
Answers
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers Answers
Thanks!
Image credits:
http://fav.me/d80wco9 Game mashup
http://mrg.bz/UwguyD Red Beams
http://mrg.bz/ccBto0 Blue Beams
http://mrg.bz/QfHhyS Steel beam frame
http://mrg.bz/Dtcc1B Clock

DataFlow & Beam

  • 1.
  • 2.
    So you’ve built yourperfect video game. People all over the world are playing it.
  • 3.
    Now for Billing,High Scores, etc People are playing your game on servers all over the world. It’s time to start crunching all your data for billing, high scores, error reports, etc. The time that events happened is important. You charge per minute played, with surge pricing! Data often arrives late. Network delays, Servers go down and send their data hours later.
  • 4.
  • 5.
    What we’re goingto cover What is Dataflow? Start a demo! DataFlow Code Batches & Streaming Event Time
  • 6.
    What is GoogleDataFlow? Distributed Streaming (and Batch) Data processing engine Pulls in data from Sources Writes data to Sinks Spins up data processing nodes, pushes your code out to them Like Hadoop but handles unbounded data? Yep similar idea
  • 7.
    Up and runningin 10 mins 1. https://cloud.google.com/dataflow/getting-started a. create a project b. add dataflow API c. create a google storage bucket d. gcloud auth login 2. git clone git@github.com:gabehamilton/DataflowGroovySDK- examples.git 3. gradlew run -Pargs="project=PROJECT_NAME stagingLocation=BUCKET_NAME” (requires a JDK)
  • 8.
    Lets see somecode - config DataflowPipelineOptions options = PipelineOptionsFactory.create().as(DataflowPipelineOptions) options.setProject( ‘myproject’ ) options.setStagingLocation("gs://aStagingBucket") options.setNumWorkers(1000) // ← !!! default is 3 options.setStreaming(true); Pipeline pipeline = Pipeline.create(options);
  • 9.
    Lets see somecode - pipeline pipeline // Extract and sum username/score pairs from the event data. .apply(TextIO.Read.from(options.getInput())) // Read events from a text file .apply(ParDo.named("ParseGameEvent").of(new ParseEventFn())) .apply("SumByUser", new ExtractAndSumScore("user")) .apply("WriteUserScoreSums", new WriteToBigQuery(options.getTableName()));
  • 10.
    Components PCollection - ParallelCollection Standard interface to a set of data. Can be a streaming data set. PTransform - Parallel Transform takes Input, produces Output ParDo - Parallel Do Your custom Transform Function
  • 11.
  • 12.
    Running Dataflow -staging files Our code Output Dependencies Code is staged in the staging bucket before getting pushed to Workers
  • 13.
  • 14.
    We don’t needno stinking Batches Handles Batches! Streaming! Not real time streaming, unbounded data set streaming. Continuously processing your User Scores, Billing, Analytics Risk, Spam, and other deviations from mean
  • 15.
    Event Time Dataflow letsyou work in event time when the event says it happened rather than processing time when the event was received Allows Out of Order processing a plane full of mobile users just landed, turned their phones back on and start delivering the past 2 hours of data
  • 16.
    Features for workingwith Event Time Windowing hourly, session based Watermarks All the data is in. Fixed end of match, end of file Heuristic the data is probably in, Percentile 90% of the data is in Triggers emitting partial results. Accumulations ways of dealing with late data
  • 17.
    Streaming Event TimeExample Window Hourly - events per hour Trigger Each minute Allowed Lateness 12 hours Accumulation Discarding How many errors occurred between 5-6pm. As we process data, update windows every minute. After 12 hours, discard any late data that arrives When updating a window throw out the previous result replacing it with the new one.
  • 18.
    Code - streamingevent time .into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1) .triggering( .withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE)) .withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTE S))) .withAllowedLateness(TWELVE_HOURS) .discardingFiredPanes())
  • 19.
  • 20.
    What is ApacheBeam? A standard for running pipelines on different engines Direct PipelineRunner (i. e. local) Dataflow PipelineRunner Flink PipelineRunner Spark PipelineRunner (new)
  • 21.
  • 22.
    What to remember Processlots of data Out of order & Late data On cluster of your choice Locally testable
  • 23.
    Questions? Answers Answers AnswersAnswers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers Answers
  • 24.
    Thanks! Image credits: http://fav.me/d80wco9 Gamemashup http://mrg.bz/UwguyD Red Beams http://mrg.bz/ccBto0 Blue Beams http://mrg.bz/QfHhyS Steel beam frame http://mrg.bz/Dtcc1B Clock