DataFlow & Beam

So you’ve built
your perfect
video game.
People all over
the world are
playing it.

Now for Billing, High Scores, etc
People are playing your game on servers all over the world.
It’s time to start crunching all your data for billing, high scores, error reports, etc.
The time that events happened is important.
You charge per minute played, with surge pricing!
Data often arrives late.
Network delays, Servers go down and send their data hours later.

Google DataFlow?
Apache Beam?
Yes!

What we’re going to cover
What is Dataflow?
Start a demo!
DataFlow Code
Batches & Streaming
Event Time

What is Google DataFlow?
Distributed Streaming (and Batch) Data processing engine
Pulls in data from Sources
Writes data to Sinks
Spins up data processing nodes, pushes your code out to them
Like Hadoop but handles unbounded data? Yep similar idea

Up and running in 10 mins
1. https://cloud.google.com/dataflow/getting-started
a. create a project
b. add dataflow API
c. create a google storage bucket
d. gcloud auth login
2. git clone git@github.com:gabehamilton/DataflowGroovySDK-
examples.git
3. gradlew run -Pargs="project=PROJECT_NAME
stagingLocation=BUCKET_NAME” (requires a JDK)

Lets see some code - config
DataflowPipelineOptions options =
PipelineOptionsFactory.create().as(DataflowPipelineOptions)
options.setProject( ‘myproject’ )
options.setStagingLocation("gs://aStagingBucket")
options.setNumWorkers(1000) // ← !!! default is 3
options.setStreaming(true);
Pipeline pipeline = Pipeline.create(options);

Lets see some code - pipeline
pipeline // Extract and sum username/score pairs from the event data.
.apply(TextIO.Read.from(options.getInput())) // Read events from a text file
.apply(ParDo.named("ParseGameEvent").of(new ParseEventFn()))
.apply("SumByUser", new ExtractAndSumScore("user"))
.apply("WriteUserScoreSums",
new WriteToBigQuery(options.getTableName()));

Components
PCollection - Parallel Collection
Standard interface to a set of data.
Can be a streaming data set.
PTransform - Parallel Transform
takes Input, produces Output
ParDo - Parallel Do
Your custom Transform Function

Running Dataflow - staging files
Our code
Output
Dependencies
Code is staged in the
staging bucket
before getting
pushed to Workers

We don’t need no stinking Batches
Handles Batches!
Streaming!
Not real time streaming, unbounded data set streaming.
Continuously processing your
User Scores, Billing, Analytics
Risk, Spam, and other deviations from mean

Event Time
Dataflow lets you work in event time
when the event says it happened
rather than processing time
when the event was received
Allows Out of Order processing
a plane full of mobile users just landed, turned their phones back on and
start delivering the past 2 hours of data

Features for working with Event Time
Windowing hourly, session based
Watermarks All the data is in.
Fixed end of match, end of file
Heuristic the data is probably in,
Percentile 90% of the data is in
Triggers emitting partial results.
Accumulations ways of dealing with late data

Streaming Event Time Example
Window
Hourly - events per hour
Trigger
Each minute
Allowed Lateness
12 hours
Accumulation
Discarding
How many errors occurred between 5-6pm.
As we process data, update windows every
minute.
After 12 hours, discard any late data that arrives
When updating a window throw out the previous
result replacing it with the new one.

Code - streaming event time
.into(FixedWindows.of(ONE_HOUR)) // Duration.standardHours(1)
.triggering(
.withEarlyFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(ONE_MINUTE))
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(TEN_MINUTE
S)))
.withAllowedLateness(TWELVE_HOURS)
.discardingFiredPanes())

Demo 2
Streaming
Fraud
Detection

What is Apache Beam?
A standard for running pipelines on different engines
Direct PipelineRunner (i. e. local)
Dataflow PipelineRunner
Flink PipelineRunner
Spark PipelineRunner (new)

What to remember
Process lots of data
Out of order & Late data
On cluster of your choice
Locally testable

Questions?
Answers Answers Answers Answers Answers Answers
Answers Answers Answers Answers Answers
Answers

Thanks!
Image credits:
http://fav.me/d80wco9 Game mashup
http://mrg.bz/UwguyD Red Beams
http://mrg.bz/ccBto0 Blue Beams
http://mrg.bz/QfHhyS Steel beam frame
http://mrg.bz/Dtcc1B Clock

DataFlow & Beam

More Related Content

What's hot

Viewers also liked

Similar to DataFlow & Beam

More from Gabriel Hamilton

Recently uploaded

DataFlow & Beam