SlideShare a Scribd company logo
1 of 41
GOOGLE CLOUD DATAFLOW
&
APACHE FLINK
I V A N F E R N A N D E Z P E R E A
GOOGLE CLOUD DATAFLOW
DEFINITION
“A fully-managed cloud service and programming
model for batch and streaming big data
processing”
• Main features
– Fully Managed
– Unified Programming Model
– Integrated & Open Source
GOOGLE CLOUD DATAFLOW USE
CASES
• Both batch and streaming data processing
• ETL (extract, transform, load) approach
• Excels and high volume computation
• High parallelism factor (“embarrassingly parallel”)
• Cost effective
DATAFLOW PROGRAMMING MODEL
• Designed to simplify the mechanics of large-scale data
processing
• It creates an optimized job to be executed as a unit by one of
the Cloud Dataflow runner services
• You can focus on the logical composition of your data
processing job, rather than the physical orchestratrion of
parallel processing
GOOGLE CLOUD DATAFLOW
COMPONENTS
• Two main components:
– A set of SDK used to define data processing jobs:
• Unified programming model. “One fits all” approach
• Data programming model (pipelines, collection, transformation, sources and sinks)
– A Google Cloud Platform managed service that ties together with the Google Cloud
Platform, Google Compute Engine, Google Cloud Storage, Big Query, ….
DATAFLOW SDKS
 Each pipeline is an indepedent entity that reads some input data, transform it, and generates
some output data. A pipeline represents a directed graph of data processing transformation
 Simple data representation.
 Specialized collections called Pcollection
 Pcollection can represent unlimited size dataset
 Pcollections are the inputs and the ouputs for each step in your pipeline
 Dataflow provides abstractions to manipulate data
 Transformation over data are known as Ptransform
 Transformations can be linear or not
 I/O APIs for a variety of data formats like text or Avro files, Big Query table, Google Pub/Sub, …
 Dataflow SDK for Java available on Github.
https://github.com/GoogleCloudPlatform/DataflowJavaSDK
PIPELINE DESIGN PRINCIPLES
• Some question before building your Pipeline:
– Where is your input data stored?  Read transformations
– What does your data look like?  It defines your Pcollections
– What do you want to do with your data?  Core or pre-written transforms
– What does your output data look like, and where should it go?  Write transformations
PIPELINE SHAPES
Linear Pipeline Branching Pipeline
PIPELINE EXAMPLE
public static interface Options extends PipelineOptions {
...
}
public static void main(String[] args) {
// Parse and validate command-line flags,
// then create pipeline, passing it a user-defined Options object.
Options options = PipelineOptionsFactory.fromArgs(args)
.withValidation()
.as(Options.class);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.Read.from(input)) // SDK-provided PTransform for reading text data
.apply(new CountWords()) // User-written subclass of PTransform for counting words
.apply(TextIO.Write.to(output)); // SDK-provided PTransform for writing text data
p.run();
}
PIPELINE COLLECTIONS:
PCOLLECTIONS
• PCollection represents a potentially large, immutable “bag” of same-type elements
• A PCollection can be of any type and it will be encoded based on the Dataflow SDK
Data encoding or on your own.
• PCollection requirements
– Immutable. Once created, you cannot add, remove or change individual objects.
– Does not support random access
– A PCollection belongs to one Pipeline (collections cannot be shared):
• Bounded vs unbounded collections
– It depends on your source dataset.
– Bounded collections can be processed using batch jobs
– Unbounded collections must be processed using streaming jobs (Windowing and
Timestamps)
PIPELINE COLLECTIONS EXAMPLE
• A collection created from individual lines of text
// Create a Java Collection, in this case a List of Strings.
static final List<String> LINES = Arrays.asList(
"To be, or not to be: that is the question: ",
"Whether 'tis nobler in the mind to suffer ",
"The slings and arrows of outrageous fortune, ",
"Or to take arms against a sea of troubles, ");
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of()) // create the PCollection
PIPELINE COLLECTIONS TYPES
• Bounded PCollections. It represents a fixed data set from data sources/sinks as:
– TextIO
– BigQueryIO
– DataStoreIO
– Custom data sources using the Custom Source/Sink API
• Unbounded PCollections. It represents a continuously updating data set, or streaming data
sources/sinks as:
– PubsubIO
– BigQueryIO (only as a sink)
• Each element in a PCollection has an associated timestamp. NOTE: it doesn’t happen for all
sources(e.g, TextIO)
• Unbounded PCollections are processed as finite logical windows (Windowing). Windowing
can also be applied to Bounded PCollections as a global window.
PCOLLECTIONS WINDOWING
• Subdivide PCollection processing according to the timestamp.
• Uses Triggers to determine when to close each finite window as unbounded data
arrives.
• Windowing functions
– Fixed Time Windows
– Sliding Time Windows. Two variables, windows size and period.
– Per-Session Windows. It relates to when actions are perform (e.g, mouse interactions)
– Single Global Window. By default window.
– Other windowing function as Calendar-based are found in
com.google.cloud.dataflow.sdk.transforms.windowing
• Time Skew, Data Lag, and Late Data. As each element is marked with a Timestamp it can
be known if data arrives with some lag.
PCOLLECTIONS WINDOWING II
• Adding Timestamp
PCollection<LogEntry> unstampedLogs = ...;
PCollection<LogEntry> stampedLogs =
unstampedLogs.apply(ParDo.of(new DoFn<LogEntry, LogEntry>() {
public void processElement(ProcessContext c) {
// Extract the timestamp from log entry we're currently processing.
Instant logTimeStamp = extractTimeStampFromLogEntry(c.element());
// Use outputWithTimestamp to emit the log entry with timestamp attached.
c.outputWithTimestamp(c.element(), logTimeStamp);
}
}));
• Time Skew and Late Data
PCollection<String> items = ...;
PCollection<String> fixed_windowed_items = items.apply(
Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES))
.withAllowedLateness(Duration.standardDays(2)));
• Sliding window
PCollection<String> items = ...;
PCollection<String> sliding_windowed_items = items.apply(
Window.<String>into(SlidingWindows.of(Duration.standardMinutes(60)).every(Duration.standardSeconds(30))));
PCOLLECTION SLIDING TIME
WINDOWS
PIPELINE TRANSFORMS:
PTRANSFORMS
• Represent a processing operation logic in a pipeline as a function object.
• Processing operations
– Mathematical computations on data
– Converting data from one format to another
– Grouping data together
– Filtering data
– Combining data elements into single values
• PTransforms requirements
– Serializable
– Thread-compatible. Functions are going to be accessed by a single thread on a worker instance.
– Idempotent functions are recommended: for any given input, a function provides the same ouput
• How it works?. Call the apply method over the PCollection
PIPELINE TRANSFORMS TYPES
• Core transformation
– ParDo for generic parallel processing
– GroupByKey for Key-Grouping Key/Value pairs
– Combine for combining collections or grouped values
– Flatten for merging collections
• Composite transform
– Built for multiple sub-transform in a modular way
– Examples, Count and Top composite transform.
• Pre-Written Transform
– Proccessing logic as combining, splitting, manipulating and performing statistical analysis is
already written.
– They are found in the com.google.cloud.dataflow.sdk.transforms package
• Root Transforms for Reading and Writing Data
PIPELINE TRANSFORM EXAMPLE
• A composite transform that count words
static class CountWords
extends PTransform<PCollection<String>, PCollection<String>> {
@Override
public PCollection<String> apply(PCollection<String> lines) {
PCollection<String> words = lines.apply(
ParDo
.named("ExtractWords")
.of(new ExtractWordsFn()));
PCollection<KV<String, Integer>> wordCounts =
words.apply(Count.<String>perElement());
PCollection<String> results = wordCounts.apply(
ParDo
.named("FormatCounts")
.of(new DoFn<KV<String, Integer>, String>() {
@Override
public void processElement() {
output(element().getKey() + ": " + element().getValue());
}
}));
return results;
}
}
PIPELINE I/O
• We need to read/write data from external sources like Google Cloud Storage or BigQuery table
• Read/Write transforms are applied to sources to gather data
• Read/Write data from Cloud Storage
p.apply(AvroIO.Read.named("ReadFromAvro")
.from("gs://my_bucket/path/to/records-*.avro")
.withSchema(schema));
records.apply(AvroIO.Write.named("WriteToAvro")
.to("gs://my_bucket/path/to/numbers")
.withSchema(schema)
.withSuffix(".avro"));
 Read and write Tranforms in the Dataflow SDKs
 Text files
 Big Query tables
 Avro files
 Pub/Sub
 Custom I/O sources and sink can be created.
GETTING STARTED
LOG INTO GOOGLE DEV CONSOLE ENVIRONMENT SETUP
• JDK 1.7 or higher
• Install the Google Cloud SDK. Gcloud tool is
required to run examples in the Dataflow
SDK.
https://cloud.google.com/sdk/?hl=es#nix
• Download SDK examples from Github.
https://github.com/GoogleCloudPlatform/Dat
aflowJavaSDK-examples
• Enable Billing (Free for 60 days/$300)
• Enable Services & APIs
• Create a project for the example
• More info:
– https://cloud.google.com/dataflow/getting-
started?hl=es#DevEnv
RUN LOCALLY
• Run dataflow SDK Wordcount example locally
mvn compile exec:java 
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount 
-Dexec.args="--inputFile=/home/ubuntu/.bashrc --output=/tmp/output/“
RUN IN THE CLOUD - CREATE A
PROJECT
RUN IN THE CLOUD - INSTALL GOOGLE
CLOUD SDK
C U R L H T T P S : / / S D K . C L O U D . G O O G L E . C O M | B A S H G C L O U D I N I T
RUN IN THE CLOUD - EXECUTE
WORDCOUNT
• Compile & execute Wordcount examples in the cloud:
mvn compile exec:java 
-Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount 
-Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> 
--stagingLocation=<YOUR CLOUD STORAGE LOCATION> 
--runner=BlockingDataflowPipelineRunner“
– Project is the id of the project you just created
– StagingLocation is the Google Storage Location with the following aspect:
gs://bucket/path/to/staging/directory
– Runner associates your code with an specific dataflow pipeline runner
– Note: you can only open a Google Cloud Platform account in Europe if you look for
economic benefit.
MANAGE YOUR POM
• Google cloud dataflow artifact needs to be added to your POM:
<dependency>
<groupId>com.google.cloud.dataflow</groupId>
<artifactId>google-cloud-dataflow-java-sdk-all</artifactId>
<version>${project.version}</version>
</dependency>
• Google services that are also used in the project needs to be added. E.g, bigQuery:
<dependency>
<groupId>com.google.apis</groupId>
<artifactId>google-api-services-bigquery</artifactId>
<!-- If updating version, please update the javadoc offlineLink -->
<version>v2-rev238-1.20.0</version>
</dependency>
GOOGLE CLOUD PLATFORM
• Google Compute Engine VMs, to provide job workers
• Google Cloud Storage, for readinig and writing data
• Google BigQuery, for reading and writing data
APACHE FLINK
“Apache Flink is an open source platform for
distributed stream and batch data processing.”
• Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for
distributed computations over data streams.
• Flink includes several APIs for creating applications that use the Flink engine:
– DataSet API for static data embedded in Java, Scala, and Python,
– DataStream API for unbounded streams embedded in Java and Scala, and
– Table API with a SQL-like expression language embedded in Java and Scala.
• Flink also bundles libraries for domain-specific use cases:
– Machine Learning library, and
– Gelly, a graph processing API and library.
APACHE FLINK: BACKGROUND
• 2010: "Stratosphere: Information Management on the Cloud" (funded by the German
Research Foundation (DFG)) was started as a collaboration of Technical University
Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-InstitutPotsdam.
• March 2014: Flink is a Stratosphere fork and it became an Apache Incubator.
• December 2014: Flink was accepted as an Apache top-level project
APACHE FLINK COMPONENTS
APACHE FLINK FEATURES
• Streaming first (Kappa approach)
– High Performance & Low latency with little configuration
– Flows (events) vs batches
– Exactly-once Semantics for Stateful Computations
– Continuos Streaming Model with Flow Control and long live operators (no need to run new tasks as in Spark,
‘similar’ to Storm)
– Fault-tolerance via Lightweight Distributed Snapshots
• One runtime for Streaming and Batch Processing
– Batch processing runs as special case of streaming
– Own memory management (Spark Tungsten project goal)
– Iterations and Delta iterations
– Program optimizer
• APIs and Libraries
– Batch Processing Applications (DataSet API)
– Streaming Applications (DataStream API)
– Library Ecosystem: Machine Learning, Graph Analytics and Relational Data Processing.
APACHE FLINK FEATURES
• DatasSet: abstract representation of a finite immutable collection of data of the same
type that may contain duplicates.
• DataStream: a possibly unbounded immutable collection of data items of a the same
type
• Transformation: Data transformations transform one or more DataSets/DataStreams int
a new DataSet/DataStream
– Common: Map, FlatMap, MapPartition, Filter, Reduce, union
– DataSets: aggregate, join, cogroup
– DataStreams: window* transformations (Window, Window Reduce, …)
APACHE FLINK DATA SOURCES AND
SINKS
• Data Sources
– File-based
• readTextFile(path), readTextFileWithValue(path), readFile(path), …
– Socket-based
• socketTextStream (streaming)
– Collection-based
• fromCollection(Seq), fromCollection(iterator), fromElements(elements: _*)
– Custom.
• addSource from Kafka, …
• Data Sinks (similar to Spark actions):
– writeAsText()
– writeAsCsv()
– Print() / printToErr()
– Write()
– writeToSocket
– addSink like Kafka
ENGINE COMPARISION
APACHE FLINK PROCESS MODEL
• Processes
– JobManager: coordinator of the Flink system
– TaskManagers: workers that execute parts of the parallel programs.
APACHE FLINK EXECUTION MODEL
• As a software stack, Flink is a layered system. The different layers of the stack build on
top of each other and raise the abstraction level of the program representations they
accept:
– The runtime layer receives a program in the form of a JobGraph. A JobGraph is a generic
parallel data flow with arbitrary tasks that consume and produce data streams.
– Both the DataStream API and the DataSet API generate JobGraphs through separate
compilation processes. The DataSet API uses an optimizer to determine the optimal plan for
the program, while the DataStream API uses a stream builder.
– The JobGraph is lazy executed according to a variety of deployment options available in
Flink (e.g., local, remote, YARN, etc)
THE 8 REQUIREMENTS OF REAL-TIME
STREAM PROCESSING
• Coined by Michael Stonebraker and others in
http://cs.brown.edu/~ugur/8rulesSigRec.pdf
– Pipelining: Flink is built upon pipelining
– Replay: Flink acknowledges batches of records
– Operator state: flows pass by different operators
– State backup: Flink operators can keep state
– High-level language(s): Java, Scala, Python (beta)
– Integration with static sources
– High availability
FLINK STREAMING NOTES
• Hybrid runtime architecture
– Intermediate results are a handle to the data produced by an operator.
– Coordinate the “handshake” between data producer and the consumer.
• Current DataStream API has support for flexible windows
• Apache SAMOA on Flink for Machine Learning on streams
• Google Dataflow (stream functionality upcoming)
• Table API (window definition upcoming)
FLINK STREAMING NOTES II
• Flink supports different streaming windowing.
– Instant event-at-a-time
– Arrival time windows
– Event time windows
K. Tzoumas & S. Ewen – Flink Forward Keynote
http://www.slideshare.net/FlinkForward/k-tzoumas-s-
ewen-flink-forward-keynote?qid=ced740f4-8af3-4bc7-
8d7c-388eb26f463f&v=qf1&b=&from_search=5
GETTING STARTED (LOCALLY)
• Download Apache Flink latest release and unzip it. https://flink.apache.org/downloads.html
– Don’t need to install Hadoop beforehand, but you want to use HDFS.
• Run a JobManager .
– $FLINK_HOME/bin/start-local.sh
• Run an example code (e.g. WordCount)
– $FLINK_HOME/bin/flink run ./examples/WordCount.jar /path/input_data /path/output_data
• Setup Guide. https://ci.apache.org/projects/flink/flink-docs-release-0.10/quickstart/setup_quickstart.html
• If you want to develop with Flink you need to add dependencies to your code development tool. E.g, Maven:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>0.10.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>0.10.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>0.10.0</version>
</dependency>
WORDCOUNT EXAMPLE
SCALA
val env =
ExecutionEnvironment.getExecutionEnvironment
// get input data
val text = env.readTextFile("/path/to/file")
val counts = text.flatMap { _.toLowerCase.split("W+")
filter { _.nonEmpty } }
.map { (_, 1) }
.groupBy(0)
.sum(1)
counts.writeAsCsv(outputPath, "n", " ")
SCALA
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> text = env.readTextFile("/path/to/file"); DataSet<Tuple2<String, Integer>>
counts =
// split up the lines in pairs (2-tuples) containing: (word,1) text.flatMap(new
Tokenizer())
// group by the tuple field "0" and sum up tuple field "1"
.groupBy(0)
.sum(1);
counts.writeAsCsv(outputPath, "n", " ");
// User-defined functions
public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>>
{
@Override
public void flatMap(String value, Collector<Tuple2<String, Integer>> out) {
// normalize and split the line
String[] tokens = value.toLowerCase().split("W+");
// emit the pairs
for (String token : tokens) {
if (token.length() > 0) {
out.collect(new Tuple2<String, Integer>(token, 1));
}
}
}
}
DEPLOYING
• Local
• Cluster (standalone)
• YARN
• Google Cloud
• Flink on Tez
• JobManager High Availability

More Related Content

What's hot

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...HostedbyConfluent
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalSub Szabolcs Feczak
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetOwen O'Malley
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiFlink Forward
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeDatabricks
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream ProcessingSuneel Marthi
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeFlink Forward
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...HostedbyConfluent
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache icebergAlluxio, Inc.
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufVerverica
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewenconfluent
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupStephan Ewen
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019confluent
 

What's hot (20)

Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
File Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & ParquetFile Format Benchmarks - Avro, JSON, ORC, & Parquet
File Format Benchmarks - Avro, JSON, ORC, & Parquet
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and HudiHow to build a streaming Lakehouse with Flink, Kafka, and Hudi
How to build a streaming Lakehouse with Flink, Kafka, and Hudi
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Introduction to Apache Beam
Introduction to Apache BeamIntroduction to Apache Beam
Introduction to Apache Beam
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Autoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive ModeAutoscaling Flink with Reactive Mode
Autoscaling Flink with Reactive Mode
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Building an open data platform with apache iceberg
Building an open data platform with apache icebergBuilding an open data platform with apache iceberg
Building an open data platform with apache iceberg
 
Gcp dataflow
Gcp dataflowGcp dataflow
Gcp dataflow
 
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin KnaufWebinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
Webinar: 99 Ways to Enrich Streaming Data with Apache Flink - Konstantin Knauf
 
Unified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache FlinkUnified Stream and Batch Processing with Apache Flink
Unified Stream and Batch Processing with Apache Flink
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan EwenAdvanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
Advanced Streaming Analytics with Apache Flink and Apache Kafka, Stephan Ewen
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 20190-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
0-60: Tesla's Streaming Data Platform ( Jesse Yates, Tesla) Kafka Summit SF 2019
 

Viewers also liked

Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_MeetupPaolo Platter
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...Flink Forward
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingRobert Metzger
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsDataWorks Summit/Hadoop Summit
 

Viewers also liked (6)

OAuth
OAuthOAuth
OAuth
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_Meetup
 
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
William Vambenepe – Google Cloud Dataflow and Flink , Stream Processing by De...
 
Click-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer CheckpointingClick-Through Example for Flink’s KafkaConsumer Checkpointing
Click-Through Example for Flink’s KafkaConsumer Checkpointing
 
Performance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data PlatformsPerformance Comparison of Streaming Big Data Platforms
Performance Comparison of Streaming Big Data Platforms
 
Weld Strata talk
Weld Strata talkWeld Strata talk
Weld Strata talk
 

Similar to Google cloud Dataflow & Apache Flink

CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)Ortus Solutions, Corp
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...Ortus Solutions, Corp
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Chester Chen
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamImre Nagi
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured StreamingKnoldus Inc.
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIDatabricks
 
OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL Suraj Bang
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Anyscale
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkFlink Forward
 
Developing real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and KafkaDeveloping real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and Kafkamarius_bogoevici
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streamsRadu Tudoran
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Guido Schmutz
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkDatabricks
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffJAX London
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Databricks
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformChester Chen
 

Similar to Google cloud Dataflow & Apache Flink (20)

CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)CBStreams - Java Streams for ColdFusion (CFML)
CBStreams - Java Streams for ColdFusion (CFML)
 
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
ITB2019 CBStreams : Accelerate your Functional Programming with the power of ...
 
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
Analytics Metrics delivery and ML Feature visualization: Evolution of Data Pl...
 
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache BeamGDG Jakarta Meetup - Streaming Analytics With Apache Beam
GDG Jakarta Meetup - Streaming Analytics With Apache Beam
 
Data Science on Google Cloud Platform
Data Science on Google Cloud PlatformData Science on Google Cloud Platform
Data Science on Google Cloud Platform
 
Introduction to Structured Streaming
Introduction to Structured StreamingIntroduction to Structured Streaming
Introduction to Structured Streaming
 
Writing Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark APIWriting Continuous Applications with Structured Streaming PySpark API
Writing Continuous Applications with Structured Streaming PySpark API
 
OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL OWB11gR2 - Extending ETL
OWB11gR2 - Extending ETL
 
Practical OData
Practical ODataPractical OData
Practical OData
 
Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0Continuous Application with Structured Streaming 2.0
Continuous Application with Structured Streaming 2.0
 
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache FlinkMaximilian Michels – Google Cloud Dataflow on Top of Apache Flink
Maximilian Michels – Google Cloud Dataflow on Top of Apache Flink
 
Developing real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and KafkaDeveloping real-time data pipelines with Spring and Kafka
Developing real-time data pipelines with Spring and Kafka
 
Towards sql for streams
Towards sql for streamsTowards sql for streams
Towards sql for streams
 
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
Spark (Structured) Streaming vs. Kafka Streams - two stream processing platfo...
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
cb streams - gavin pickin
cb streams - gavin pickincb streams - gavin pickin
cb streams - gavin pickin
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
Spring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard WolffSpring Day | Spring and Scala | Eberhard Wolff
Spring Day | Spring and Scala | Eberhard Wolff
 
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...Writing Continuous Applications with Structured Streaming Python APIs in Apac...
Writing Continuous Applications with Structured Streaming Python APIs in Apac...
 
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platformSf big analytics_2018_04_18: Evolution of the GoPro's data platform
Sf big analytics_2018_04_18: Evolution of the GoPro's data platform
 

Recently uploaded

Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Availablegargpaaro
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdfkhraisr
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...HyderabadDolls
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...kumargunjan9515
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...gajnagarg
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangeThinkInnovation
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 

Recently uploaded (20)

Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service AvailableVastral Call Girls Book Now 7737669865 Top Class Escort Service Available
Vastral Call Girls Book Now 7737669865 Top Class Escort Service Available
 
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Rohtak [ 7014168258 ] Call Me For Genuine Models We...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf20240412-SmartCityIndex-2024-Full-Report.pdf
20240412-SmartCityIndex-2024-Full-Report.pdf
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 

Google cloud Dataflow & Apache Flink

  • 1. GOOGLE CLOUD DATAFLOW & APACHE FLINK I V A N F E R N A N D E Z P E R E A
  • 2. GOOGLE CLOUD DATAFLOW DEFINITION “A fully-managed cloud service and programming model for batch and streaming big data processing” • Main features – Fully Managed – Unified Programming Model – Integrated & Open Source
  • 3. GOOGLE CLOUD DATAFLOW USE CASES • Both batch and streaming data processing • ETL (extract, transform, load) approach • Excels and high volume computation • High parallelism factor (“embarrassingly parallel”) • Cost effective
  • 4. DATAFLOW PROGRAMMING MODEL • Designed to simplify the mechanics of large-scale data processing • It creates an optimized job to be executed as a unit by one of the Cloud Dataflow runner services • You can focus on the logical composition of your data processing job, rather than the physical orchestratrion of parallel processing
  • 5. GOOGLE CLOUD DATAFLOW COMPONENTS • Two main components: – A set of SDK used to define data processing jobs: • Unified programming model. “One fits all” approach • Data programming model (pipelines, collection, transformation, sources and sinks) – A Google Cloud Platform managed service that ties together with the Google Cloud Platform, Google Compute Engine, Google Cloud Storage, Big Query, ….
  • 6. DATAFLOW SDKS  Each pipeline is an indepedent entity that reads some input data, transform it, and generates some output data. A pipeline represents a directed graph of data processing transformation  Simple data representation.  Specialized collections called Pcollection  Pcollection can represent unlimited size dataset  Pcollections are the inputs and the ouputs for each step in your pipeline  Dataflow provides abstractions to manipulate data  Transformation over data are known as Ptransform  Transformations can be linear or not  I/O APIs for a variety of data formats like text or Avro files, Big Query table, Google Pub/Sub, …  Dataflow SDK for Java available on Github. https://github.com/GoogleCloudPlatform/DataflowJavaSDK
  • 7. PIPELINE DESIGN PRINCIPLES • Some question before building your Pipeline: – Where is your input data stored?  Read transformations – What does your data look like?  It defines your Pcollections – What do you want to do with your data?  Core or pre-written transforms – What does your output data look like, and where should it go?  Write transformations
  • 8. PIPELINE SHAPES Linear Pipeline Branching Pipeline
  • 9. PIPELINE EXAMPLE public static interface Options extends PipelineOptions { ... } public static void main(String[] args) { // Parse and validate command-line flags, // then create pipeline, passing it a user-defined Options object. Options options = PipelineOptionsFactory.fromArgs(args) .withValidation() .as(Options.class); Pipeline p = Pipeline.create(options); p.apply(TextIO.Read.from(input)) // SDK-provided PTransform for reading text data .apply(new CountWords()) // User-written subclass of PTransform for counting words .apply(TextIO.Write.to(output)); // SDK-provided PTransform for writing text data p.run(); }
  • 10. PIPELINE COLLECTIONS: PCOLLECTIONS • PCollection represents a potentially large, immutable “bag” of same-type elements • A PCollection can be of any type and it will be encoded based on the Dataflow SDK Data encoding or on your own. • PCollection requirements – Immutable. Once created, you cannot add, remove or change individual objects. – Does not support random access – A PCollection belongs to one Pipeline (collections cannot be shared): • Bounded vs unbounded collections – It depends on your source dataset. – Bounded collections can be processed using batch jobs – Unbounded collections must be processed using streaming jobs (Windowing and Timestamps)
  • 11. PIPELINE COLLECTIONS EXAMPLE • A collection created from individual lines of text // Create a Java Collection, in this case a List of Strings. static final List<String> LINES = Arrays.asList( "To be, or not to be: that is the question: ", "Whether 'tis nobler in the mind to suffer ", "The slings and arrows of outrageous fortune, ", "Or to take arms against a sea of troubles, "); PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); p.apply(Create.of(LINES)).setCoder(StringUtf8Coder.of()) // create the PCollection
  • 12. PIPELINE COLLECTIONS TYPES • Bounded PCollections. It represents a fixed data set from data sources/sinks as: – TextIO – BigQueryIO – DataStoreIO – Custom data sources using the Custom Source/Sink API • Unbounded PCollections. It represents a continuously updating data set, or streaming data sources/sinks as: – PubsubIO – BigQueryIO (only as a sink) • Each element in a PCollection has an associated timestamp. NOTE: it doesn’t happen for all sources(e.g, TextIO) • Unbounded PCollections are processed as finite logical windows (Windowing). Windowing can also be applied to Bounded PCollections as a global window.
  • 13. PCOLLECTIONS WINDOWING • Subdivide PCollection processing according to the timestamp. • Uses Triggers to determine when to close each finite window as unbounded data arrives. • Windowing functions – Fixed Time Windows – Sliding Time Windows. Two variables, windows size and period. – Per-Session Windows. It relates to when actions are perform (e.g, mouse interactions) – Single Global Window. By default window. – Other windowing function as Calendar-based are found in com.google.cloud.dataflow.sdk.transforms.windowing • Time Skew, Data Lag, and Late Data. As each element is marked with a Timestamp it can be known if data arrives with some lag.
  • 14. PCOLLECTIONS WINDOWING II • Adding Timestamp PCollection<LogEntry> unstampedLogs = ...; PCollection<LogEntry> stampedLogs = unstampedLogs.apply(ParDo.of(new DoFn<LogEntry, LogEntry>() { public void processElement(ProcessContext c) { // Extract the timestamp from log entry we're currently processing. Instant logTimeStamp = extractTimeStampFromLogEntry(c.element()); // Use outputWithTimestamp to emit the log entry with timestamp attached. c.outputWithTimestamp(c.element(), logTimeStamp); } })); • Time Skew and Late Data PCollection<String> items = ...; PCollection<String> fixed_windowed_items = items.apply( Window.<String>into(FixedWindows.of(1, TimeUnit.MINUTES)) .withAllowedLateness(Duration.standardDays(2))); • Sliding window PCollection<String> items = ...; PCollection<String> sliding_windowed_items = items.apply( Window.<String>into(SlidingWindows.of(Duration.standardMinutes(60)).every(Duration.standardSeconds(30))));
  • 16. PIPELINE TRANSFORMS: PTRANSFORMS • Represent a processing operation logic in a pipeline as a function object. • Processing operations – Mathematical computations on data – Converting data from one format to another – Grouping data together – Filtering data – Combining data elements into single values • PTransforms requirements – Serializable – Thread-compatible. Functions are going to be accessed by a single thread on a worker instance. – Idempotent functions are recommended: for any given input, a function provides the same ouput • How it works?. Call the apply method over the PCollection
  • 17. PIPELINE TRANSFORMS TYPES • Core transformation – ParDo for generic parallel processing – GroupByKey for Key-Grouping Key/Value pairs – Combine for combining collections or grouped values – Flatten for merging collections • Composite transform – Built for multiple sub-transform in a modular way – Examples, Count and Top composite transform. • Pre-Written Transform – Proccessing logic as combining, splitting, manipulating and performing statistical analysis is already written. – They are found in the com.google.cloud.dataflow.sdk.transforms package • Root Transforms for Reading and Writing Data
  • 18. PIPELINE TRANSFORM EXAMPLE • A composite transform that count words static class CountWords extends PTransform<PCollection<String>, PCollection<String>> { @Override public PCollection<String> apply(PCollection<String> lines) { PCollection<String> words = lines.apply( ParDo .named("ExtractWords") .of(new ExtractWordsFn())); PCollection<KV<String, Integer>> wordCounts = words.apply(Count.<String>perElement()); PCollection<String> results = wordCounts.apply( ParDo .named("FormatCounts") .of(new DoFn<KV<String, Integer>, String>() { @Override public void processElement() { output(element().getKey() + ": " + element().getValue()); } })); return results; } }
  • 19. PIPELINE I/O • We need to read/write data from external sources like Google Cloud Storage or BigQuery table • Read/Write transforms are applied to sources to gather data • Read/Write data from Cloud Storage p.apply(AvroIO.Read.named("ReadFromAvro") .from("gs://my_bucket/path/to/records-*.avro") .withSchema(schema)); records.apply(AvroIO.Write.named("WriteToAvro") .to("gs://my_bucket/path/to/numbers") .withSchema(schema) .withSuffix(".avro"));  Read and write Tranforms in the Dataflow SDKs  Text files  Big Query tables  Avro files  Pub/Sub  Custom I/O sources and sink can be created.
  • 20. GETTING STARTED LOG INTO GOOGLE DEV CONSOLE ENVIRONMENT SETUP • JDK 1.7 or higher • Install the Google Cloud SDK. Gcloud tool is required to run examples in the Dataflow SDK. https://cloud.google.com/sdk/?hl=es#nix • Download SDK examples from Github. https://github.com/GoogleCloudPlatform/Dat aflowJavaSDK-examples • Enable Billing (Free for 60 days/$300) • Enable Services & APIs • Create a project for the example • More info: – https://cloud.google.com/dataflow/getting- started?hl=es#DevEnv
  • 21. RUN LOCALLY • Run dataflow SDK Wordcount example locally mvn compile exec:java -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount -Dexec.args="--inputFile=/home/ubuntu/.bashrc --output=/tmp/output/“
  • 22. RUN IN THE CLOUD - CREATE A PROJECT
  • 23. RUN IN THE CLOUD - INSTALL GOOGLE CLOUD SDK C U R L H T T P S : / / S D K . C L O U D . G O O G L E . C O M | B A S H G C L O U D I N I T
  • 24. RUN IN THE CLOUD - EXECUTE WORDCOUNT • Compile & execute Wordcount examples in the cloud: mvn compile exec:java -Dexec.mainClass=com.google.cloud.dataflow.examples.WordCount -Dexec.args="--project=<YOUR CLOUD PLATFORM PROJECT ID> --stagingLocation=<YOUR CLOUD STORAGE LOCATION> --runner=BlockingDataflowPipelineRunner“ – Project is the id of the project you just created – StagingLocation is the Google Storage Location with the following aspect: gs://bucket/path/to/staging/directory – Runner associates your code with an specific dataflow pipeline runner – Note: you can only open a Google Cloud Platform account in Europe if you look for economic benefit.
  • 25. MANAGE YOUR POM • Google cloud dataflow artifact needs to be added to your POM: <dependency> <groupId>com.google.cloud.dataflow</groupId> <artifactId>google-cloud-dataflow-java-sdk-all</artifactId> <version>${project.version}</version> </dependency> • Google services that are also used in the project needs to be added. E.g, bigQuery: <dependency> <groupId>com.google.apis</groupId> <artifactId>google-api-services-bigquery</artifactId> <!-- If updating version, please update the javadoc offlineLink --> <version>v2-rev238-1.20.0</version> </dependency>
  • 26. GOOGLE CLOUD PLATFORM • Google Compute Engine VMs, to provide job workers • Google Cloud Storage, for readinig and writing data • Google BigQuery, for reading and writing data
  • 27. APACHE FLINK “Apache Flink is an open source platform for distributed stream and batch data processing.” • Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. • Flink includes several APIs for creating applications that use the Flink engine: – DataSet API for static data embedded in Java, Scala, and Python, – DataStream API for unbounded streams embedded in Java and Scala, and – Table API with a SQL-like expression language embedded in Java and Scala. • Flink also bundles libraries for domain-specific use cases: – Machine Learning library, and – Gelly, a graph processing API and library.
  • 28. APACHE FLINK: BACKGROUND • 2010: "Stratosphere: Information Management on the Cloud" (funded by the German Research Foundation (DFG)) was started as a collaboration of Technical University Berlin, Humboldt-Universität zu Berlin, and Hasso-Plattner-InstitutPotsdam. • March 2014: Flink is a Stratosphere fork and it became an Apache Incubator. • December 2014: Flink was accepted as an Apache top-level project
  • 30. APACHE FLINK FEATURES • Streaming first (Kappa approach) – High Performance & Low latency with little configuration – Flows (events) vs batches – Exactly-once Semantics for Stateful Computations – Continuos Streaming Model with Flow Control and long live operators (no need to run new tasks as in Spark, ‘similar’ to Storm) – Fault-tolerance via Lightweight Distributed Snapshots • One runtime for Streaming and Batch Processing – Batch processing runs as special case of streaming – Own memory management (Spark Tungsten project goal) – Iterations and Delta iterations – Program optimizer • APIs and Libraries – Batch Processing Applications (DataSet API) – Streaming Applications (DataStream API) – Library Ecosystem: Machine Learning, Graph Analytics and Relational Data Processing.
  • 31. APACHE FLINK FEATURES • DatasSet: abstract representation of a finite immutable collection of data of the same type that may contain duplicates. • DataStream: a possibly unbounded immutable collection of data items of a the same type • Transformation: Data transformations transform one or more DataSets/DataStreams int a new DataSet/DataStream – Common: Map, FlatMap, MapPartition, Filter, Reduce, union – DataSets: aggregate, join, cogroup – DataStreams: window* transformations (Window, Window Reduce, …)
  • 32. APACHE FLINK DATA SOURCES AND SINKS • Data Sources – File-based • readTextFile(path), readTextFileWithValue(path), readFile(path), … – Socket-based • socketTextStream (streaming) – Collection-based • fromCollection(Seq), fromCollection(iterator), fromElements(elements: _*) – Custom. • addSource from Kafka, … • Data Sinks (similar to Spark actions): – writeAsText() – writeAsCsv() – Print() / printToErr() – Write() – writeToSocket – addSink like Kafka
  • 34. APACHE FLINK PROCESS MODEL • Processes – JobManager: coordinator of the Flink system – TaskManagers: workers that execute parts of the parallel programs.
  • 35. APACHE FLINK EXECUTION MODEL • As a software stack, Flink is a layered system. The different layers of the stack build on top of each other and raise the abstraction level of the program representations they accept: – The runtime layer receives a program in the form of a JobGraph. A JobGraph is a generic parallel data flow with arbitrary tasks that consume and produce data streams. – Both the DataStream API and the DataSet API generate JobGraphs through separate compilation processes. The DataSet API uses an optimizer to determine the optimal plan for the program, while the DataStream API uses a stream builder. – The JobGraph is lazy executed according to a variety of deployment options available in Flink (e.g., local, remote, YARN, etc)
  • 36. THE 8 REQUIREMENTS OF REAL-TIME STREAM PROCESSING • Coined by Michael Stonebraker and others in http://cs.brown.edu/~ugur/8rulesSigRec.pdf – Pipelining: Flink is built upon pipelining – Replay: Flink acknowledges batches of records – Operator state: flows pass by different operators – State backup: Flink operators can keep state – High-level language(s): Java, Scala, Python (beta) – Integration with static sources – High availability
  • 37. FLINK STREAMING NOTES • Hybrid runtime architecture – Intermediate results are a handle to the data produced by an operator. – Coordinate the “handshake” between data producer and the consumer. • Current DataStream API has support for flexible windows • Apache SAMOA on Flink for Machine Learning on streams • Google Dataflow (stream functionality upcoming) • Table API (window definition upcoming)
  • 38. FLINK STREAMING NOTES II • Flink supports different streaming windowing. – Instant event-at-a-time – Arrival time windows – Event time windows K. Tzoumas & S. Ewen – Flink Forward Keynote http://www.slideshare.net/FlinkForward/k-tzoumas-s- ewen-flink-forward-keynote?qid=ced740f4-8af3-4bc7- 8d7c-388eb26f463f&v=qf1&b=&from_search=5
  • 39. GETTING STARTED (LOCALLY) • Download Apache Flink latest release and unzip it. https://flink.apache.org/downloads.html – Don’t need to install Hadoop beforehand, but you want to use HDFS. • Run a JobManager . – $FLINK_HOME/bin/start-local.sh • Run an example code (e.g. WordCount) – $FLINK_HOME/bin/flink run ./examples/WordCount.jar /path/input_data /path/output_data • Setup Guide. https://ci.apache.org/projects/flink/flink-docs-release-0.10/quickstart/setup_quickstart.html • If you want to develop with Flink you need to add dependencies to your code development tool. E.g, Maven: <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-java</artifactId> <version>0.10.0</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-streaming-java</artifactId> <version>0.10.0</version> </dependency> <dependency> <groupId>org.apache.flink</groupId> <artifactId>flink-clients</artifactId> <version>0.10.0</version> </dependency>
  • 40. WORDCOUNT EXAMPLE SCALA val env = ExecutionEnvironment.getExecutionEnvironment // get input data val text = env.readTextFile("/path/to/file") val counts = text.flatMap { _.toLowerCase.split("W+") filter { _.nonEmpty } } .map { (_, 1) } .groupBy(0) .sum(1) counts.writeAsCsv(outputPath, "n", " ") SCALA ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<String> text = env.readTextFile("/path/to/file"); DataSet<Tuple2<String, Integer>> counts = // split up the lines in pairs (2-tuples) containing: (word,1) text.flatMap(new Tokenizer()) // group by the tuple field "0" and sum up tuple field "1" .groupBy(0) .sum(1); counts.writeAsCsv(outputPath, "n", " "); // User-defined functions public static class Tokenizer implements FlatMapFunction<String, Tuple2<String, Integer>> { @Override public void flatMap(String value, Collector<Tuple2<String, Integer>> out) { // normalize and split the line String[] tokens = value.toLowerCase().split("W+"); // emit the pairs for (String token : tokens) { if (token.length() > 0) { out.collect(new Tuple2<String, Integer>(token, 1)); } } } }
  • 41. DEPLOYING • Local • Cluster (standalone) • YARN • Google Cloud • Flink on Tez • JobManager High Availability