Apache Beam @ GCPUG.TW Flink.TW 20161006

Apache Beam in
Data Pipeline
Randy Huang 
2016/10/06

Who am I
• Data Architect @ VMFive
• Fluentd/Embulk fans

Overview
• Deﬁne Data Pipeline
• Architecture
• How to write Beam
• Demo

Data Pipeline
Input Algorithm Output

Data Pipeline’s world is
chaos

Goal
• Provide an abstraction layer between data
processing’s code and the execution runtime.
• Batch processing and Streaming Jobs in one
world.
• Beam SDK open the door to write once, run
anywhere.*
on-premise and non-Google cloud

Supported Runners
• Google Cloud Dataﬂow (Block/Non-Blocking)
• Apache Flink 1.1.2
• Apache Spark 1.6.2 Hadoop 2.2.0 Kafka 0.8.2.1

Architecture
• Pipelines
• Translators
• Runners

programming tips/ Flink
• Use the Flink DataStream API in Java and Scala
• Use the Beam API directly in Java (and soon
Python) with the Flink runner

SDK
• Four Parts :
• Pipeline : Streaming & Batch Processing
• PCollection
• Transform
• I/O : Source & Sink

for Flink user
• we encourage users to use either of the Beam or Flink
APIs to implement their Flink jobs for stream data
processing.
• But Native Flink API -
• backwards-compatible API
• built-in libraries (e.g., CEP and upcoming SQL)
• key-value state (with the ability to query that state in
the future)
http://data-artisans.com/why-apache-beam/

Demo
• GDELT project
• EventCount by Location
Pileline

Recap
• Write the general data pipeline, and choose your
runner

Next…
• New Runners, SDK (python still dev)
• DSL

Another things
• BigQuery have DML support!!! https://goo.gl/
lcZQVZ
• DataStudio Beta in Taiwan is available
• Embulk
• Fluentd v0.14.6 - 2016/09/07

Apache Beam @ GCPUG.TW Flink.TW 20161006

More Related Content

What's hot

Viewers also liked

Similar to Apache Beam @ GCPUG.TW Flink.TW 20161006

Recently uploaded

Apache Beam @ GCPUG.TW Flink.TW 20161006