Presented By:
Kundan Kumar
Software Consultant
An introduction to
Apache Flink: 4G of
Big Data
Lack of etiquette and manners is a huge turn off.
KnolX Etiquettes
Punctuality
Respect Knolx session timings, you
are requested not to join sessions
after a 5 minutes threshold post
the session start time.
Feedback
Make sure to submit a constructive
feedback for all sessions as it is
very helpful for the presenter.
Mute
Be on mute until you have
questions or concerns.
Avoid Disturbance
Avoid unwanted chit chat during
the session.
Agenda
01 Big Data evolution
02
Introduction to Flink
03
Features of Flink
Architecture of Flink
Anatomy of a Flink program
Demo
04
05
06
Big Data Evolution
Problems with Big Data:
● Storing huge and exponentially growing datasets.
● Processing of huge data datasets having complex structure.
● 3v’s of Big Data - Volume, Variety, Velocity
Continue..
● At early 2000, Big Data era started with multiple frameworks focusing on
specifying Big Data problem.
Continue..
● A unified platform that alone can handle various Big Data problem:
➢ Batch processing
➢ Stream processing
➢ Graph processing
➢ Iterative processing
● A unified platform must have following characteristics to solve Big
Data Problem:
➢ Distributed/ parallel computation
➢ Fault tolerance
➢ Ease of use (developer friendly API’s)
➢ Powerful predefined operators/functions(Like Join, filter)
➢ Fast
Apache Spark (3G Big Data Framework)
● Spark is a lightning-fast cluster computing engine that is 100 times faster than
Hadoop in running applications in memory
● Apache Spark is best known for its in-memory computing capabilities that
deliver high-speed processing.
➢ Problem
● Process data streams in micro batches and not in real time.
● High throughput but medium latency in some use cases.
Introduction to Flink
● Apache Flink is a Big Data framework and distributed processing engine for
stateful computations over unbounded and bounded data streams.
● Flink is based on the streaming first principle which means it is real streaming
processing engine Flink considers batch processing as a special case of
streaming
● Flink has been designed to run in all common cluster environments, perform
computations at in-memory speed and at any scale.
Source
Transformations
Sink
➢ A Flink application may consume real-time data from streaming sources such as
message queues or distributed logs, like Apache Kafka or Kinesis.
➢ Flink can also consume bounded, historic data from a variety of data sources.
➢ The streams of results being produced by a Flink application can be sent to a wide
variety of systems that can be connected as sinks
➢ Programs in Flink are inherently parallel and distributed.
➢ During execution, a stream has one or more stream partitions, and each
operator has one or more operator subtasks.
➢ Flink facilitate stateful operations.
➢ Current handling event can depend on the accumulated effect of all the events
that came before it.
➢ The set of parallel instances of a stateful operator is effectively a sharded
key-value store. Each parallel instance is responsible for handling events for a
specific group of keys, and the state for those keys is kept locally.
Flink Architecture
➢ Flink 1.X's architecture consists of various components such as deploy,
core processing, and APIs.
➢ Flink has a layered architecture and each component is a part of a
specific layer.
➢ Each layer is built on top of the others for clear abstraction.
Flinks Distributed Execution
➢ Flink is based on master slave architecture.
➢ Various processes take part in the Flink’s program execution, namely
Job Manager, Task Manager, and Job Client.
Flink Task Manager
Flink Features
➢ High performance
➢ Exactly-once stateful computation
➢ Fault tolerance
➢ Memory management
➢ Optimizer
➢ Unified platform for stream and batch
➢ Rich Libraries
Basic Anatomy of a Flink Program
DEMO
Q/A
References
1.
2.
3.
Thank You !

Introduction To Flink

  • 1.
    Presented By: Kundan Kumar SoftwareConsultant An introduction to Apache Flink: 4G of Big Data
  • 2.
    Lack of etiquetteand manners is a huge turn off. KnolX Etiquettes Punctuality Respect Knolx session timings, you are requested not to join sessions after a 5 minutes threshold post the session start time. Feedback Make sure to submit a constructive feedback for all sessions as it is very helpful for the presenter. Mute Be on mute until you have questions or concerns. Avoid Disturbance Avoid unwanted chit chat during the session.
  • 3.
    Agenda 01 Big Dataevolution 02 Introduction to Flink 03 Features of Flink Architecture of Flink Anatomy of a Flink program Demo 04 05 06
  • 4.
    Big Data Evolution Problemswith Big Data: ● Storing huge and exponentially growing datasets. ● Processing of huge data datasets having complex structure. ● 3v’s of Big Data - Volume, Variety, Velocity
  • 5.
    Continue.. ● At early2000, Big Data era started with multiple frameworks focusing on specifying Big Data problem.
  • 6.
    Continue.. ● A unifiedplatform that alone can handle various Big Data problem: ➢ Batch processing ➢ Stream processing ➢ Graph processing ➢ Iterative processing ● A unified platform must have following characteristics to solve Big Data Problem: ➢ Distributed/ parallel computation ➢ Fault tolerance ➢ Ease of use (developer friendly API’s) ➢ Powerful predefined operators/functions(Like Join, filter) ➢ Fast
  • 7.
    Apache Spark (3GBig Data Framework) ● Spark is a lightning-fast cluster computing engine that is 100 times faster than Hadoop in running applications in memory ● Apache Spark is best known for its in-memory computing capabilities that deliver high-speed processing. ➢ Problem ● Process data streams in micro batches and not in real time. ● High throughput but medium latency in some use cases.
  • 8.
    Introduction to Flink ●Apache Flink is a Big Data framework and distributed processing engine for stateful computations over unbounded and bounded data streams. ● Flink is based on the streaming first principle which means it is real streaming processing engine Flink considers batch processing as a special case of streaming ● Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
  • 9.
  • 10.
    ➢ A Flinkapplication may consume real-time data from streaming sources such as message queues or distributed logs, like Apache Kafka or Kinesis. ➢ Flink can also consume bounded, historic data from a variety of data sources. ➢ The streams of results being produced by a Flink application can be sent to a wide variety of systems that can be connected as sinks
  • 11.
    ➢ Programs inFlink are inherently parallel and distributed. ➢ During execution, a stream has one or more stream partitions, and each operator has one or more operator subtasks.
  • 12.
    ➢ Flink facilitatestateful operations. ➢ Current handling event can depend on the accumulated effect of all the events that came before it. ➢ The set of parallel instances of a stateful operator is effectively a sharded key-value store. Each parallel instance is responsible for handling events for a specific group of keys, and the state for those keys is kept locally.
  • 13.
    Flink Architecture ➢ Flink1.X's architecture consists of various components such as deploy, core processing, and APIs. ➢ Flink has a layered architecture and each component is a part of a specific layer. ➢ Each layer is built on top of the others for clear abstraction.
  • 14.
    Flinks Distributed Execution ➢Flink is based on master slave architecture. ➢ Various processes take part in the Flink’s program execution, namely Job Manager, Task Manager, and Job Client.
  • 15.
  • 16.
    Flink Features ➢ Highperformance ➢ Exactly-once stateful computation ➢ Fault tolerance ➢ Memory management ➢ Optimizer ➢ Unified platform for stream and batch ➢ Rich Libraries
  • 17.
    Basic Anatomy ofa Flink Program
  • 18.
  • 19.
  • 20.
  • 21.