Gcp dataflow

Google Cloud Dataflow
Igor Roiter, CTO @ DataZone
Welcome!

About CloudZone
End-to-end
Cloud Solutions
• Migration
• Security
• DevOps
• Big Data
• DR & more
Full Service
Package
• Consulting
• Managed Services
• Professional
Services
Years of
Experience
Largest
partner in
Israel
Our Goal:
To ensure our
customers adopt
the most
advanced
technologies at
a minimal cost

What We Do…
FinOps DevOps Well Architected
• Architecture
schematics
• TCO
• SOW
Full Service Package
• Consulting
• Managed Services
• Professional Services
Continues
Integration and
Deployment

Our Professional Services
Cloud
Migration
DevOps
Disaster
Recovery
Big Data &
Bi Analytics
Dev &TestSecurity
Cost
Optimization
Cloud
Extension

Customer Success Unit (FinOps)
Design for Maximal
Cost Reduction
Leverage the Power
Of the Cloud
Find and Eliminate
Waste
Implement
Government Polices
Helping you save thousands on your monthly bill!
• Billing Analytics
• RI Utilization
• Data Analytics
• Automate Optimization
• Asset Management
• Consumption Management
Cloud Resource
ManagementTool

Hybrid Cloud Solutions
• Architecture & deployment
of Private Hybrid clouds
based on VMware vRA or
OpenStack
• Configuration Management:
Chef, Puppet, Ansible & Salt
• Public Cloud DevOps &
Automation
• OpenShift, Cloud Foundry
based, Kubernetes, DCOS
and More.
Cloud Containers DevOps

DataZone – Advanced Data Solutions
Troubleshooting
&Tuning
Technological
Evaluation
Training
Services
Architecture
Review
Cost
Management
End-to-End
Implementations
DataZone is a leading trusted advisor and integrator for data and search applications
Infrastructure
Support / DevOps

About Me
• Working as data specialist for last 10 years
• SQL Server DBA in past life
• Software developer & Architect
• CTO @ DataZone from 2015

Agenda
Processing Data Types
Big data processing trade-offs
What is apache beam
Google cloud Dataflow managed service
Pricing

Data…
...really, really big...
Tuesday
Wednesday
Thursday

Data…
...maybe even infinitely big...
9:008:00 14:0013:0012:0011:0010:002:001:00 7:006:005:004:003:00

Data…
…with unknown delays.
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00

The trade-offs of Big Data processing
Complet eness Lat ency Cost
Data Processing Tradeoffs

Processing DataTypes
Batch Processing – Designed for finite datasets
(distributed output dataset)
MapReduce: SELECT + GROUP BY
(distributed input dataset)
Shuffle (GROUP BY)
Map (SELECT)
Reduce (SELECT)

Time-Bound Data processing
EventTime - The time at which events actually occurred.
ProcessingTime - The time at which events are observed in the system

Batch Processing of unbound data – Patterns
Fixed window batch by processing time
• Fixed windows of X units of time, that divide the data into a smaller finite groups for processing
• More suitable for use cases in which the old data is irrelevant and only the current state is important
• Issues with data correctness occur due to fact we window by processing time instead of EventTime
• Many times we will be required to wait an arbitrary amount of time to close the window and thus delay all next

Batch Processing of unbound data – Patterns
Session window batch processing
• Sessions are defined as periods of activity.
• Sessions are usually terminated by inactivity or by closing event.
• Main issue with batch processing with sessions: When calculating sessions using a typical batch engine, you
often end up with sessions that are split across windows, as indicated by the red marks in the diagram below.
• The split amount may be reduced by increasing the window, but that would cost us in increased latency

Streaming Data - Designed with infinite data sets in mind
Challenges when dealing with unbound data set
• Highly unordered with respect to event times - meaning you need some sort of time-based shuffle in your
pipeline if you want to analyze the data in the context in which they occurred
• Varying event time skew - meaning you can’t just assume you’ll always see most of the data for a given event
time X within some constant epsilon of time Y
Challenge: complet eness when processing continuous data
9:008:00 14:0013:0012:0011:0010:00
8:00
8:008:00
8:00

Streaming processing of unbound data – patterns
Time-agnostic
• Time-agnostic processing is used in cases where time is essentially irrelevant, meaning that all relevant logic is data
driven
• Basic example form of time agnostic processing is filtering, Imagine you’re processing Web traffic logs, and you
want to filter out all traffic that didn’t originate from a specific domain

Streaming processing of unbound data – patterns
Windowing
• Types: Fixed windows, Sliding windows, Sessions.
• It is possible to window based on EventTime when dealing with streaming engines – While the following requires
buffering for late data.
• New windowing tools are introduced for late data completeness: Watermarks, triggers
• Watermarks – A watermark is a notion of input completeness with respect to event times. A watermark with a
value of time X makes the statement: “all input data with event times less than X have been observed.” As
such, watermarks act as a metric of progress when observing an unbounded data source with no known end.
• Triggers – A trigger is a mechanism for declaring when the output for a window should be materialized relative
to some external signal
• Accumulation – A set of rules that defines, how window results will be influenced by the use of triggers

Hybrid approach
Historical
events
Exact
historical
model
Periodic batch
processing
Approximate
real-time
model
Stream
processing
system
Continuous
updates
State of the art until recently: Lambda Architecture

The trade-offs of Big Data processing
Streaming or Batch?
Com plet eness Lat ency Cost
Why not both?

Apache Beam
Before Apache Beam
Batch OR Stream
Accuracy OR Speed
Simplicity OR Sophistication
Savings OR Scalability
After Apache Beam
Batch AND Stream
Accuracy AND Speed
Simplicity AND Sophistication
Savings AND Scalability
Balancing correctness, latency and cost with a unified
batch with a streaming model

What is Apache beam
Apache Beam is a big data processing standard created by Google in 2016.
It provides unified DSL to process both batch and stream data, and can be
executed on popular platforms like Spark, Flink, and of course Google’s
commercial product Dataflow. Beam’s model is based on previous works
known as FlumeJava and Millwheel, and addresses solutions for data
processing tasks like ETL, analysis, and stream processing. Currently it
provides SDK in two languages, Java and Python. This article will introduce
how to use Python to write Beam applications

Apache Beam – Basic framework concepts
Pipeline
• A Pipeline represents a graph of data processing
transformations
PCollection
• Immutable collection
• Could be bound or unbound
• Created by:
• a backing data store
• a transformation from other PCollection
• generated
• PCollection<KV<K,V>>
Transform
• Combine small operations into bigger transforms
• Standard transform (COUNT, SUM,...)
• Reuse and Monitoring

A fully-managed cloud service and programming model for
batch and streaming big data processing on Google Cloud.
• Fully managed Runner service
• Out of the box API support for all google cloud data
sources/destinations – pub/sub, Google Cloud Storage,
BigQuery, BigTable, Spanner, etc…
• Monitoring and UI – Usage of Stackdriver for real-time logs
of whole process
• Automated Resource Management
• Dynamic Work Rebalancing

Data processing with Google Cloud Dataflow
What are you computing?
How does the event time affect computation?
How does the processing time affect latency?
How do results get refined?

What are you computing?
• A Pipeline represents a graph
of data processing
transformations
• PCollections flow through the
pipeline
• Optimized and executed as a
unit for efficiency

Key
2
Key
1
Key
3
1
Fixed
2
3
4
Key
2
Key
1
Key
3
Sliding
1
2
3
5
4
Key
2
Key
1
Key
3
Sessions
2
4
3
1
Aggregating According to Event Time
• Windowing divides
data into event-
time-based finite
chunks.
• Required when
doing aggregations
over unbounded
data.
What Where When How

What Where When How
When in Processing Time?
• Triggers control
when results are
emitted.
• Triggers are often
relative to the
watermark.
ProcessingTime
Event Time
Wat erm ark

What Where When How
How are Results refined?
• How should multiple outputs per window
accumulate?
• Appropriate choice depends on consumer.
Firing Elements
Speculative 3
Watermark 5, 1
Late 2
Total Observ 11
Discarding
3
6
2
11
Accumulating
3
9
11
23
Acc. & Retracting
3
9, -3
11, -9
11

Pricing
• Usage of the Cloud Dataflow service is billed per minute on a per job basis. Each Dataflow job will use at least one
Cloud Dataflow worker. The Cloud Dataflow service provides two worker types: batch and streaming. There are
separate service charges for batch and streaming mode.
• Batch worker defaults: 1 vCPU, 3.75GB memory, 250GB PD.
• Streaming worker defaults: 4 vCPU, 15GB memory, 420GB PD.
• Currently there is no support for pre-emptive machines
• The prices bellow are correct for North Virginia at presentation time
Dataflow
Worker Type
vCPU
(per Hour)
Memory
(per GB per
Hour)
Local storage -
Persistent Disk
(per GB per
Hour)
Local storage -
SSD based
(per GB per
Hour)
Batch $0.0599200 $0.0038060 $0.0000594 $0.0003278
Streaming $0.0738300 $0.0038060 $0.0000594 $0.0003278

Igor Roiter, igorro@datazone.io
Thank you!

Gcp dataflow

More Related Content

What's hot

Similar to Gcp dataflow

Recently uploaded

Gcp dataflow

Editor's Notes