SlideShare a Scribd company logo
Introduction to Apache
Flink
A stream based big data processing engine
https://github.com/phatak-dev/flink-examples
● Madhukara Phatak
● Big data consultant and
trainer at datamantra.io
● Consult in Hadoop, Spark
and Scala
● www.madhukaraphatak.com
Agenda
● Evolution of big data frameworks to platform
● Shortcomings of the RDD abstraction
● Streaming dataflow as the abstraction layer
● Flink introduction
● Flink history
● Flink vs Spark
● Flink examples
Big data frameworks
● Big data started as the multiple frameworks focused on
specific problems
● Every framework defined its own abstraction
● Though combining multiple frameworks in powerful in
theory, putting together is no fun
● Combining right set of frameworks and making them
work on same platform became non trivial task
● Distributions like Cloudera, Hortonworks solved these
problems to some level
Dawn of big data platforms
● Time of big data processing frameworks is over, it’s time
for big platforms
● More and more advances in big data shows people
want to use single platform for their big data needs
● Single platform normally evolves faster compared to a
distribution of multiple frameworks
● Spark pioneering the platform and others are following
their lead
Framework vs Platform
● Each framework is free to define it’s own abstraction but
platform should define single abstraction
● Frameworks often embrace multiple runtime whereas
platform embraces single runtime
● Platforms normally supports libraries rather than
frameworks
● Frameworks can be in different versions, normally all
libraries on a platform will be on a single version
Abstractions in frameworks
● Map/Reduce provides file as abstraction as it’s a
framework focused on batch processing
● Storm provided realtime stream as abstraction as it’s
focused on real time processing
● Hive provides table as abstraction as it’s focused on
structured data analysis
● Every framework define their own abstraction which
defines their capabilities and limitations
Abstraction in platform
● As we moved into platform, abstraction provided by
platform becomes crucial
● Abstraction provided by platform should be generic
enough for all frameworks/ libraries built on top of it.
● Requirements of platform abstraction
○ Ability to support different kind of libraries/
framework on platform
○ Ability to evolve with new requirements
○ Ability to harness advances in the hardware
Abstraction in Spark platform
● RDD is the single abstraction on which all other API’s
have built
● RDD allowed Spark to evolve to be platform rather than
just another framework
● RDD currently supports
○ Batch processing
○ Structure data analysis using DF abstraction
○ Streaming
○ Iterative processing - ML
RDD is the assembly of big data?
● RDD abstraction as platform abstraction was much
better than any other framework abstractions that came
before that.
● But as platform idea becomes common place, there is
been question is RDD is the assembly of big data?
● Databricks has convinced it is, but other disagree
● RDD seems to start showing some cracks in it’s perfect
image, particularly in
○ Streaming
○ Iterative processing
RDD in streaming
● RDD is essentially a map/reduce file based abstraction
that sits in memory
● RDD file like behaviour makes it difficult to amend to the
streaming latencies
● This is the reason why spark streaming in a near
realtime not realtime system
● So this is holding platform back in API’s like custom
windows like window based on record count, event
based time
Spark streaming
● RDD is an abstraction based on partitions
● To satisfy the RDD abstraction, spark streaming inserts blocking
operations like block boundary, partition boundary to streaming data
● Normally it’s good practice to avoid any blocking operation on stream but
spark streaming we can’t avoid them
● Due to this nature, dashboard our example has to wait till all the records in
the partition to be processed before it can see even first result.
R1 R2 R3 M1 M2 M3Input
Stream
Block boundary
Partition
Partition boundary
Dashboard
Map operation
Spark streaming in spikes
Input
Stream
Block boundary
Partition
Partition boundary
Dashboard
Map operation
R
1
R
2
R
3
R
4
R
5
M
1
M
2
M
3
M
4
M
5
● In case of spikes, as spark streaming is purely on time, data in single block
increases.
● As block increases the partition size increases which makes blocking time
bigger
● So again it puts more pressure on the processing system and increases
latency in downstream systems
Spark iterative processing
RDD in iterative processing
● Spark runtime does not support iteration in the runtime
level
● RDD caching with essential looping gets the effect of
iterative processing
● To support iterative process, the platform should
support some kind of cycle in its dataflow which is
currently DAG
● So to implement advanced matrix factorization algorithm
it’s very hard to do in spark
Stream as the abstraction
● A stream is a sequence of data elements made
available over time.
● A stream can be thought of as items on a conveyor belt
being processed one at a time rather than in large
batches.
● Streams can be unbounded ( streaming application) and
bounded ( batch processing)
● Streams are becoming new abstractions to build data
pipelines.
Streams in last two years
● There's been many projects started to make stream as
the new abstraction layer for data processing
● Some of them are
○ Reactive streams like akka-streams, akka-http
● Java 8 streams
● RxJava etc
Streams are picking up steam from last few years now.
Why stream abstraction?
● Stream is one of most generic interface out there
● Flexible in representing different data flow scenarios
● Stream are stateless by nature
● Streams are normally lazy and easily parallelizable
● Streams work well with functional programming
● Stream can model different data sources/sinks
○ File can be treated as stream of bytes
● Stream normally good at exploiting the system
resources.
Why not streams?
● Long running pipelined streams are hard to reason
about.
● As stream is by default stateless, doing fault tolerance
becomes harder
● Stream change the way we look at data in normal
program
● As there may be multiple things running parallely debug
becomes tricky
Stream abstraction in big data
● Stream is the new abstraction layer people are
exploring in the big data
● With right implementation, stream can support both
streaming and batch applications much more effectively
than existing abstractions.
● Batch on streaming is new way of looking at processing
rather than treating streaming as the special case of
batch
● Batch can be faster on streaming than dedicated batch
processing
Apache flink
● Flink’s core is a streaming dataflow engine that
provides data distribution, communication, and fault
tolerance for distributed computations over data
streams.
● Flink provides
○ Dataset API - for bounded streams
○ Datastream API - for unbounded streams
● Flink embraces the stream as abstraction to implement
it’s dataflow.
Flink stack
Flink history
● Stratosphere project started in Technical university,
Berlin in 2009
● Incubated to Apache in March, 2014
● Top level project in Dec 2014
● Started as stream engine for batch processing
● Started to support streaming few versions before
● DataArtisians is company founded by core flink team
● Complete history is here [1]
Spark vs Flink
● Spark was the first open source big data platform. Flink
is following it’s lead
● They share many ideas and frameworks
○ Same collection like API
○ Same frameworks like AKKA, YARN underneath
○ Full fledges DSL in Scala
● It is very easy to port code from Spark to Flink as they
have very similar API. Ex : WordCount.scala
● But there are many differences in the abstractions and
runtime.
Abstraction
Spark Flink
RDD Dataflow (JobGraph)
RDD API for batch and DStream API for
streaming
Dataset for batch API and DataStream API
for streaming
RDD does not go through optimizer.
Dataframe is API for that.
All API’s go through optimizer. Dataset is
dataframe equivalent.
Can combine RDD and DStream Cannot combine Dataset and DataStream
Streaming over batch Batch over streaming
Memory management
Spark Flink
Till 1.5, Java memory layout to manage
cached data
Custom memory management from day
one.
Many OOM errors due to inaccurate
approximation of java heap memory
Fewer OOM as memory is pre allocated
and controlled manually
RDD does not uses optimizer so it can’t use
benefits of Tungsten
All API’s go through optimizer. So all can
depend upon the controlled memory
management
No support for off heap memory natively Support for off heap memory natively from
0.10 version
Supports explicit data caching for
interactive applications
No explicit data caching yet
Streaming
Spark Flink
Near real time / microbatch Realtime / Event based
Only supports single concept of time i.e
processing time
Support multiple concepts of time like event
time, process time etc
Supports windows based on process time. Supports windows based on time, record
count, triggers etc
Great support to combine historical data
and stream
Limited support to combine historical data
and stream
Supports most of the datastreams used in
real world
Limited supported for streams which are
limited to kafka as of now
Batch processing
Spark Flink
Spark runtime is dedicated batch processor Flink uses different optimizer to run batch
on streaming system
All operators are blocking aka partitioned to
run in flow
Even some of the batch operators are
streamed as it’s transparent to program
Uses RDD recomputation for fault tolerance Chandy-Lamport algorithm for barrier
checks for fault tolerance
● As spark has dedicated batch processing, it’s expected
to do better compare to flink
● Not really. Stream pipeline can result in better
performance in even in batch[2].
Structured Data analysis
Spark Flink
Excellent supports for structured data using
Datasource API
Only supports Hadoop InputFormat based
API’s both for structured and unstructured
data
Support Dataframe DSL and SQL interface
for data analysis
Limited support for DSL using Table API
and no SQL support yet
Integrates nicely with Hive No integration with Hive yet
Support for statically types dataframe
transformation is getting added in 1.6
Supports statically typed structured data
analysis using Dataset API
No support for Dataframe for streaming yet Table API is supported for streaming
Maturity
● Spark is already 5 years old in Apache community
where as flink is around 2 year old
● Spark is already in version 1.6 whereas flink is yet to hit
1.0
● Spark has great ecosystem support and mature
compared to flink at this point of time
● Materials to learn, understand are far more better for
spark compared to Flink
● Very few companies are using flink in production as of
now
FAQ
● Is this all real? Are you saying me Flink is taking over
spark?
● Can spark adopt to new ideas proposed by Flink?
● How easy to move the project from Spark to Flink?
● Do you have any customer who are evaluating Flink as
of now?
● Is there any resources to learn more about it?
References
1. http://www.slideshare.net/stephanewen1/flink-history-
roadmap-and-vision
2. http://www.slideshare.net/FlinkForward/dongwon-kim-a-
comparative-performance-evaluation-of-flink
3. http://www.slideshare.net/stephanewen1/flink-history-
roadmap-and-vision
References
● http://flink.apache.org/
● http://flink.apache.org/faq.html
● http://flink-forward.org/
● http://flink.apache.org/news/2015/05/11/Juggling-with-
Bits-and-Bytes.html
● https://cwiki.apache.
org/confluence/display/FLINK/Data+exchange+between
+tasks

More Related Content

What's hot

Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
Slim Baltagi
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
Databricks
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
Stephan Ewen
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
Ververica
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
Ryan Blue
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
Suneel Marthi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
Nishith Agarwal
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
Flink Forward
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
Databricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Databricks
 
Apache flink
Apache flinkApache flink
Apache flink
Ahmed Nader
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
Timothy Spann
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
HostedbyConfluent
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Vadim Y. Bichutskiy
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Flink Forward
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 

What's hot (20)

Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink MeetupApache Flink @ NYC Flink Meetup
Apache Flink @ NYC Flink Meetup
 
Stephan Ewen - Experiences running Flink at Very Large Scale
Stephan Ewen -  Experiences running Flink at Very Large ScaleStephan Ewen -  Experiences running Flink at Very Large Scale
Stephan Ewen - Experiences running Flink at Very Large Scale
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
Apache Flink Stream Processing
Apache Flink Stream ProcessingApache Flink Stream Processing
Apache Flink Stream Processing
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Apache flink
Apache flinkApache flink
Apache flink
 
Running Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration OptionsRunning Apache NiFi with Apache Spark : Integration Options
Running Apache NiFi with Apache Spark : Integration Options
 
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
Streaming Data Lakes using Kafka Connect + Apache Hudi | Vinoth Chandar, Apac...
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy FarkasVirtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 

Similar to Introduction to Apache Flink

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
datamantra
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
datamantra
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
datamantra
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
datamantra
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
GlobalLogic Ukraine
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
AKASH SIHAG
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
Spark Summit
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
Evan Chan
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
Anirudh
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
Phaneendra Chiruvella
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
aspyker
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
inoshg
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
datamantra
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Databricks
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
DataWorks Summit
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
Nicola Ferraro
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
Itai Yaffe
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
Timothy Spann
 

Similar to Introduction to Apache Flink (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Evolution of apache spark
Evolution of apache sparkEvolution of apache spark
Evolution of apache spark
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to spark 2.0
Introduction to spark 2.0Introduction to spark 2.0
Introduction to spark 2.0
 
Introduction to Structured streaming
Introduction to Structured streamingIntroduction to Structured streaming
Introduction to Structured streaming
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Apache Spark vs Apache Flink
Apache Spark vs Apache FlinkApache Spark vs Apache Flink
Apache Spark vs Apache Flink
 
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service700 Queries Per Second with Updates: Spark As A Real-Time Web Service
700 Queries Per Second with Updates: Spark As A Real-Time Web Service
 
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service700 Updatable Queries Per Second: Spark as a Real-Time Web Service
700 Updatable Queries Per Second: Spark as a Real-Time Web Service
 
Apache Spark for Beginners
Apache Spark for BeginnersApache Spark for Beginners
Apache Spark for Beginners
 
Spark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark GroupSpark Streaming and MLlib - Hyderabad Spark Group
Spark Streaming and MLlib - Hyderabad Spark Group
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Spark Driven Big Data Analytics
Spark Driven Big Data AnalyticsSpark Driven Big Data Analytics
Spark Driven Big Data Analytics
 
Introduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset APIIntroduction to Spark 2.0 Dataset API
Introduction to Spark 2.0 Dataset API
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Analyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache SparkAnalyzing Data at Scale with Apache Spark
Analyzing Data at Scale with Apache Spark
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
DBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data LakesDBCC 2021 - FLiP Stack for Cloud Data Lakes
DBCC 2021 - FLiP Stack for Cloud Data Lakes
 

More from datamantra

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
datamantra
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
datamantra
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
datamantra
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
datamantra
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
datamantra
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
datamantra
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
datamantra
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
datamantra
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
datamantra
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
datamantra
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
datamantra
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
datamantra
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
datamantra
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
datamantra
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
datamantra
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
datamantra
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
datamantra
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
datamantra
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
datamantra
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
datamantra
 

More from datamantra (20)

Multi Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and TelliusMulti Source Data Analysis using Spark and Tellius
Multi Source Data Analysis using Spark and Tellius
 
State management in Structured Streaming
State management in Structured StreamingState management in Structured Streaming
State management in Structured Streaming
 
Spark on Kubernetes
Spark on KubernetesSpark on Kubernetes
Spark on Kubernetes
 
Understanding transactional writes in datasource v2
Understanding transactional writes in  datasource v2Understanding transactional writes in  datasource v2
Understanding transactional writes in datasource v2
 
Introduction to Datasource V2 API
Introduction to Datasource V2 APIIntroduction to Datasource V2 API
Introduction to Datasource V2 API
 
Exploratory Data Analysis in Spark
Exploratory Data Analysis in SparkExploratory Data Analysis in Spark
Exploratory Data Analysis in Spark
 
Core Services behind Spark Job Execution
Core Services behind Spark Job ExecutionCore Services behind Spark Job Execution
Core Services behind Spark Job Execution
 
Optimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloadsOptimizing S3 Write-heavy Spark workloads
Optimizing S3 Write-heavy Spark workloads
 
Structured Streaming with Kafka
Structured Streaming with KafkaStructured Streaming with Kafka
Structured Streaming with Kafka
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Spark stack for Model life-cycle management
Spark stack for Model life-cycle managementSpark stack for Model life-cycle management
Spark stack for Model life-cycle management
 
Productionalizing Spark ML
Productionalizing Spark MLProductionalizing Spark ML
Productionalizing Spark ML
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Testing Spark and Scala
Testing Spark and ScalaTesting Spark and Scala
Testing Spark and Scala
 
Understanding Implicits in Scala
Understanding Implicits in ScalaUnderstanding Implicits in Scala
Understanding Implicits in Scala
 
Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2Migrating to Spark 2.0 - Part 2
Migrating to Spark 2.0 - Part 2
 
Migrating to spark 2.0
Migrating to spark 2.0Migrating to spark 2.0
Migrating to spark 2.0
 
Scalable Spark deployment using Kubernetes
Scalable Spark deployment using KubernetesScalable Spark deployment using Kubernetes
Scalable Spark deployment using Kubernetes
 
Introduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actorsIntroduction to concurrent programming with akka actors
Introduction to concurrent programming with akka actors
 
Functional programming in Scala
Functional programming in ScalaFunctional programming in Scala
Functional programming in Scala
 

Recently uploaded

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
oz8q3jxlp
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 

Recently uploaded (20)

一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
一比一原版(Deakin毕业证书)迪肯大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 

Introduction to Apache Flink

  • 1. Introduction to Apache Flink A stream based big data processing engine https://github.com/phatak-dev/flink-examples
  • 2. ● Madhukara Phatak ● Big data consultant and trainer at datamantra.io ● Consult in Hadoop, Spark and Scala ● www.madhukaraphatak.com
  • 3. Agenda ● Evolution of big data frameworks to platform ● Shortcomings of the RDD abstraction ● Streaming dataflow as the abstraction layer ● Flink introduction ● Flink history ● Flink vs Spark ● Flink examples
  • 4. Big data frameworks ● Big data started as the multiple frameworks focused on specific problems ● Every framework defined its own abstraction ● Though combining multiple frameworks in powerful in theory, putting together is no fun ● Combining right set of frameworks and making them work on same platform became non trivial task ● Distributions like Cloudera, Hortonworks solved these problems to some level
  • 5. Dawn of big data platforms ● Time of big data processing frameworks is over, it’s time for big platforms ● More and more advances in big data shows people want to use single platform for their big data needs ● Single platform normally evolves faster compared to a distribution of multiple frameworks ● Spark pioneering the platform and others are following their lead
  • 6. Framework vs Platform ● Each framework is free to define it’s own abstraction but platform should define single abstraction ● Frameworks often embrace multiple runtime whereas platform embraces single runtime ● Platforms normally supports libraries rather than frameworks ● Frameworks can be in different versions, normally all libraries on a platform will be on a single version
  • 7. Abstractions in frameworks ● Map/Reduce provides file as abstraction as it’s a framework focused on batch processing ● Storm provided realtime stream as abstraction as it’s focused on real time processing ● Hive provides table as abstraction as it’s focused on structured data analysis ● Every framework define their own abstraction which defines their capabilities and limitations
  • 8. Abstraction in platform ● As we moved into platform, abstraction provided by platform becomes crucial ● Abstraction provided by platform should be generic enough for all frameworks/ libraries built on top of it. ● Requirements of platform abstraction ○ Ability to support different kind of libraries/ framework on platform ○ Ability to evolve with new requirements ○ Ability to harness advances in the hardware
  • 9. Abstraction in Spark platform ● RDD is the single abstraction on which all other API’s have built ● RDD allowed Spark to evolve to be platform rather than just another framework ● RDD currently supports ○ Batch processing ○ Structure data analysis using DF abstraction ○ Streaming ○ Iterative processing - ML
  • 10. RDD is the assembly of big data? ● RDD abstraction as platform abstraction was much better than any other framework abstractions that came before that. ● But as platform idea becomes common place, there is been question is RDD is the assembly of big data? ● Databricks has convinced it is, but other disagree ● RDD seems to start showing some cracks in it’s perfect image, particularly in ○ Streaming ○ Iterative processing
  • 11. RDD in streaming ● RDD is essentially a map/reduce file based abstraction that sits in memory ● RDD file like behaviour makes it difficult to amend to the streaming latencies ● This is the reason why spark streaming in a near realtime not realtime system ● So this is holding platform back in API’s like custom windows like window based on record count, event based time
  • 12. Spark streaming ● RDD is an abstraction based on partitions ● To satisfy the RDD abstraction, spark streaming inserts blocking operations like block boundary, partition boundary to streaming data ● Normally it’s good practice to avoid any blocking operation on stream but spark streaming we can’t avoid them ● Due to this nature, dashboard our example has to wait till all the records in the partition to be processed before it can see even first result. R1 R2 R3 M1 M2 M3Input Stream Block boundary Partition Partition boundary Dashboard Map operation
  • 13. Spark streaming in spikes Input Stream Block boundary Partition Partition boundary Dashboard Map operation R 1 R 2 R 3 R 4 R 5 M 1 M 2 M 3 M 4 M 5 ● In case of spikes, as spark streaming is purely on time, data in single block increases. ● As block increases the partition size increases which makes blocking time bigger ● So again it puts more pressure on the processing system and increases latency in downstream systems
  • 15. RDD in iterative processing ● Spark runtime does not support iteration in the runtime level ● RDD caching with essential looping gets the effect of iterative processing ● To support iterative process, the platform should support some kind of cycle in its dataflow which is currently DAG ● So to implement advanced matrix factorization algorithm it’s very hard to do in spark
  • 16. Stream as the abstraction ● A stream is a sequence of data elements made available over time. ● A stream can be thought of as items on a conveyor belt being processed one at a time rather than in large batches. ● Streams can be unbounded ( streaming application) and bounded ( batch processing) ● Streams are becoming new abstractions to build data pipelines.
  • 17. Streams in last two years ● There's been many projects started to make stream as the new abstraction layer for data processing ● Some of them are ○ Reactive streams like akka-streams, akka-http ● Java 8 streams ● RxJava etc Streams are picking up steam from last few years now.
  • 18. Why stream abstraction? ● Stream is one of most generic interface out there ● Flexible in representing different data flow scenarios ● Stream are stateless by nature ● Streams are normally lazy and easily parallelizable ● Streams work well with functional programming ● Stream can model different data sources/sinks ○ File can be treated as stream of bytes ● Stream normally good at exploiting the system resources.
  • 19. Why not streams? ● Long running pipelined streams are hard to reason about. ● As stream is by default stateless, doing fault tolerance becomes harder ● Stream change the way we look at data in normal program ● As there may be multiple things running parallely debug becomes tricky
  • 20. Stream abstraction in big data ● Stream is the new abstraction layer people are exploring in the big data ● With right implementation, stream can support both streaming and batch applications much more effectively than existing abstractions. ● Batch on streaming is new way of looking at processing rather than treating streaming as the special case of batch ● Batch can be faster on streaming than dedicated batch processing
  • 21. Apache flink ● Flink’s core is a streaming dataflow engine that provides data distribution, communication, and fault tolerance for distributed computations over data streams. ● Flink provides ○ Dataset API - for bounded streams ○ Datastream API - for unbounded streams ● Flink embraces the stream as abstraction to implement it’s dataflow.
  • 23. Flink history ● Stratosphere project started in Technical university, Berlin in 2009 ● Incubated to Apache in March, 2014 ● Top level project in Dec 2014 ● Started as stream engine for batch processing ● Started to support streaming few versions before ● DataArtisians is company founded by core flink team ● Complete history is here [1]
  • 24. Spark vs Flink ● Spark was the first open source big data platform. Flink is following it’s lead ● They share many ideas and frameworks ○ Same collection like API ○ Same frameworks like AKKA, YARN underneath ○ Full fledges DSL in Scala ● It is very easy to port code from Spark to Flink as they have very similar API. Ex : WordCount.scala ● But there are many differences in the abstractions and runtime.
  • 25. Abstraction Spark Flink RDD Dataflow (JobGraph) RDD API for batch and DStream API for streaming Dataset for batch API and DataStream API for streaming RDD does not go through optimizer. Dataframe is API for that. All API’s go through optimizer. Dataset is dataframe equivalent. Can combine RDD and DStream Cannot combine Dataset and DataStream Streaming over batch Batch over streaming
  • 26. Memory management Spark Flink Till 1.5, Java memory layout to manage cached data Custom memory management from day one. Many OOM errors due to inaccurate approximation of java heap memory Fewer OOM as memory is pre allocated and controlled manually RDD does not uses optimizer so it can’t use benefits of Tungsten All API’s go through optimizer. So all can depend upon the controlled memory management No support for off heap memory natively Support for off heap memory natively from 0.10 version Supports explicit data caching for interactive applications No explicit data caching yet
  • 27. Streaming Spark Flink Near real time / microbatch Realtime / Event based Only supports single concept of time i.e processing time Support multiple concepts of time like event time, process time etc Supports windows based on process time. Supports windows based on time, record count, triggers etc Great support to combine historical data and stream Limited support to combine historical data and stream Supports most of the datastreams used in real world Limited supported for streams which are limited to kafka as of now
  • 28. Batch processing Spark Flink Spark runtime is dedicated batch processor Flink uses different optimizer to run batch on streaming system All operators are blocking aka partitioned to run in flow Even some of the batch operators are streamed as it’s transparent to program Uses RDD recomputation for fault tolerance Chandy-Lamport algorithm for barrier checks for fault tolerance ● As spark has dedicated batch processing, it’s expected to do better compare to flink ● Not really. Stream pipeline can result in better performance in even in batch[2].
  • 29. Structured Data analysis Spark Flink Excellent supports for structured data using Datasource API Only supports Hadoop InputFormat based API’s both for structured and unstructured data Support Dataframe DSL and SQL interface for data analysis Limited support for DSL using Table API and no SQL support yet Integrates nicely with Hive No integration with Hive yet Support for statically types dataframe transformation is getting added in 1.6 Supports statically typed structured data analysis using Dataset API No support for Dataframe for streaming yet Table API is supported for streaming
  • 30. Maturity ● Spark is already 5 years old in Apache community where as flink is around 2 year old ● Spark is already in version 1.6 whereas flink is yet to hit 1.0 ● Spark has great ecosystem support and mature compared to flink at this point of time ● Materials to learn, understand are far more better for spark compared to Flink ● Very few companies are using flink in production as of now
  • 31. FAQ ● Is this all real? Are you saying me Flink is taking over spark? ● Can spark adopt to new ideas proposed by Flink? ● How easy to move the project from Spark to Flink? ● Do you have any customer who are evaluating Flink as of now? ● Is there any resources to learn more about it?
  • 33. References ● http://flink.apache.org/ ● http://flink.apache.org/faq.html ● http://flink-forward.org/ ● http://flink.apache.org/news/2015/05/11/Juggling-with- Bits-and-Bytes.html ● https://cwiki.apache. org/confluence/display/FLINK/Data+exchange+between +tasks