SlideShare a Scribd company logo

Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source. With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing. In this talk, you will learn about: 1. What is Apache Flink stack and how it fits into the Big Data ecosystem? 2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment? 3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark. 4. Who is using Apache Flink? 5. Where to learn more about Apache Flink?

1 of 116
Download to read offline
Apache Flink: What,
How, Why, Who, Where?
By @SlimBaltagi
Director of Big Data Engineering
Capital One
1
New York City (NYC) Apache Flink Meetup
Civic Hall, NYC
February 2nd, 2016
New York City (NYC) Apache Flink Meetup
Civic Hall, NYC
February 2nd, 2016
Agenda
I. What is Apache Flink stack and how it fits
into the Big Data ecosystem?
II. How Apache Flink integrates with Hadoop
and other open source tools?
III. Why Apache Flink is an alternative to
Apache Hadoop MapReduce, Apache Storm
and Apache Spark?
IV. Who is using Apache Flink?
V. Where to learn more about Apache Flink?
2
I. What is Apache Flink stack and how it
fits into the Big Data ecosystem?
1. What is Apache Flink?
2. What is Flink Execution Engine?
3. What are Flink APIs?
4. What are Flink Domain Specific Libraries?
5. What is Flink Architecture?
6. What is Flink Programming Model?
7. What are Flink tools?
3
1. What is Apache Flink?
1.1 Apache project with a cool logo!
1.2 Project that evolved the concept of a multi-
purpose Big Data analytics framework
1.3 Project with a unique vision and philosophy
1.4 Only Hybrid ( Real-Time streaming + Batch)
engine supporting many use cases
1.5 Major contributor to the movement of
unification of streaming and batch
1.6 The 4G of Big Data Analytics frameworks
4
1.1 Apache project with a cool logo!
 Apache Flink, like Apache Hadoop and
Apache Spark, is a community-driven open source
framework for distributed Big Data Analytics.
 Apache Flink has its origins in a research project
called Stratosphere of which the idea was conceived in
late 2008 by professor Volker Markl from the
Technische Universität Berlin in Germany.
 Flink joined the Apache incubator in April 2014 and
graduated as an Apache Top Level Project (TLP) in
December 2014.
 dataArtisans (data-artisans.com) is a German start-up
company based in Berlin and is leading the
development of Apache Flink. 5
1.1 Apache project with a cool logo
Squirrel is an animal! This reflects the harmony with
other animals in the Hadoop
ecosystem (Zoo): elephant,
pig, python, camel, …
A squirrel is swift and
agile
This reflects the meaning of
the word ‘Flink’: German for
“nimble, swift, speedy”
Red color of the body This reflects the roots of the
project at German universities:
In harmony with red squirrels in
Germany
Colorful tail This reflects an open source
project as the colors match the
ones of the feather symbolizing
the Apache Software Foundation

Recommended

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksOverview of Apache Fink: the 4 G of Big Data Analytics Frameworks
Overview of Apache Fink: the 4 G of Big Data Analytics FrameworksSlim Baltagi
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkSlim Baltagi
 
Apache Fink 1.0: A New Era for Real-World Streaming Analytics
Apache Fink 1.0: A New Era  for Real-World Streaming AnalyticsApache Fink 1.0: A New Era  for Real-World Streaming Analytics
Apache Fink 1.0: A New Era for Real-World Streaming AnalyticsSlim Baltagi
 
Slim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkSlim Baltagi – Flink vs. Spark
Slim Baltagi – Flink vs. SparkFlink Forward
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaFlink Forward
 
Apache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuApache Flink Crash Course by Slim Baltagi and Srini Palthepu
Apache Flink Crash Course by Slim Baltagi and Srini PalthepuSlim Baltagi
 
Overview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksOverview of Apache Flink: the 4G of Big Data Analytics Frameworks
Overview of Apache Flink: the 4G of Big Data Analytics FrameworksDataWorks Summit/Hadoop Summit
 

More Related Content

What's hot

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksSlim Baltagi
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streamsJoey Echeverria
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesPaco Nathan
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopJim Dowling
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamDataWorks Summit/Hadoop Summit
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormJungtaek Lim
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksJamie Grier
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDataWorks Summit
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...DataWorks Summit/Hadoop Summit
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JFlink Forward
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveDataWorks Summit/Hadoop Summit
 

What's hot (20)

Why apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics FrameworksWhy apache Flink is the 4G of Big Data Analytics Frameworks
Why apache Flink is the 4G of Big Data Analytics Frameworks
 
LinkedIn
LinkedInLinkedIn
LinkedIn
 
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...
 
Embeddable data transformation for real time streams
Embeddable data transformation for real time streamsEmbeddable data transformation for real time streams
Embeddable data transformation for real time streams
 
Strata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case StudiesStrata EU 2014: Spark Streaming Case Studies
Strata EU 2014: Spark Streaming Case Studies
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
From Device to Data Center to Insights
From Device to Data Center to InsightsFrom Device to Data Center to Insights
From Device to Data Center to Insights
 
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/HadoopHopsworks - Self-Service Spark/Flink/Kafka/Hadoop
Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
Introduction to Apache NiFi And Storm
Introduction to Apache NiFi And StormIntroduction to Apache NiFi And Storm
Introduction to Apache NiFi And Storm
 
Migrating pipelines into Docker
Migrating pipelines into DockerMigrating pipelines into Docker
Migrating pipelines into Docker
 
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR BenchmarksExtending the Yahoo Streaming Benchmark + MapR Benchmarks
Extending the Yahoo Streaming Benchmark + MapR Benchmarks
 
Debunking Common Myths in Stream Processing
Debunking Common Myths in Stream ProcessingDebunking Common Myths in Stream Processing
Debunking Common Myths in Stream Processing
 
Cooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython NotebookCooperative Data Exploration with iPython Notebook
Cooperative Data Exploration with iPython Notebook
 
Design Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data AnalyticsDesign Patterns For Real Time Streaming Data Analytics
Design Patterns For Real Time Streaming Data Analytics
 
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
End to End Processing of 3.7 Million Telemetry Events per Second using Lambda...
 
Suneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4JSuneel Marthi - Deep Learning with Apache Flink and DL4J
Suneel Marthi - Deep Learning with Apache Flink and DL4J
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on HiveFaster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
Faster, Faster, Faster: The True Story of a Mobile Analytics Data Mart on Hive
 

Viewers also liked

Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberXiang Fu
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Slim Baltagi
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopSlim Baltagi
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiSlim Baltagi
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry confluent
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitSlim Baltagi
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsSlim Baltagi
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaSlim Baltagi
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsSlim Baltagi
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAMFlink Forward
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: AmadeusFlink Forward
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital OneFlink Forward
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastoreKishore Gopalakrishna
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Carol Smith
 

Viewers also liked (17)

Flink vs. Spark
Flink vs. SparkFlink vs. Spark
Flink vs. Spark
 
Pinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ UberPinot: Near Realtime Analytics @ Uber
Pinot: Near Realtime Analytics @ Uber
 
Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink Step-by-Step Introduction to Apache Flink
Step-by-Step Introduction to Apache Flink
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim BaltagiHadoop or Spark: is it an either-or proposition? By Slim Baltagi
Hadoop or Spark: is it an either-or proposition? By Slim Baltagi
 
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
Fundamentals of Stream Processing with Apache Beam, Tyler Akidau, Frances Perry
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Kafka Streams for Java enthusiasts
Kafka Streams for Java enthusiastsKafka Streams for Java enthusiasts
Kafka Streams for Java enthusiasts
 
Building Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache KafkaBuilding Streaming Data Applications Using Apache Kafka
Building Streaming Data Applications Using Apache Kafka
 
Apache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming AnalyticsApache Flink: Real-World Use Cases for Streaming Analytics
Apache Flink: Real-World Use Cases for Streaming Analytics
 
Flink Case Study: OKKAM
Flink Case Study: OKKAMFlink Case Study: OKKAM
Flink Case Study: OKKAM
 
Flink Case Study: Amadeus
Flink Case Study: AmadeusFlink Case Study: Amadeus
Flink Case Study: Amadeus
 
Flink Case Study: Capital One
Flink Case Study: Capital OneFlink Case Study: Capital One
Flink Case Study: Capital One
 
Pinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastorePinot: Realtime Distributed OLAP datastore
Pinot: Realtime Distributed OLAP datastore
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
 

Similar to Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksSlim Baltagi
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkSlim Baltagi
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Petr Novotný
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Alluxio, Inc.
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache FlinkAljoscha Krettek
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Jason Dai
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302Timothy Spann
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingPaco Nathan
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamJ On The Beach
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaTimothy Spann
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 

Similar to Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi (20)

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics FrameworksOverview of Apache Fink: The 4G of Big Data Analytics Frameworks
Overview of Apache Fink: The 4G of Big Data Analytics Frameworks
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
Available platforms for Big Data 2.0
Available platforms for Big Data 2.0Available platforms for Big Data 2.0
Available platforms for Big Data 2.0
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
Rise of Intermediate APIs - Beam and Alluxio at Alluxio Meetup 2016
 
Robust stream processing with Apache Flink
Robust stream processing with Apache FlinkRobust stream processing with Apache Flink
Robust stream processing with Apache Flink
 
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
Build Deep Learning Applications for Big Data Platforms (CVPR 2018 tutorial)
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Started with-apache-spark
Started with-apache-sparkStarted with-apache-spark
Started with-apache-spark
 
ApacheCon 2021 Apache Deep Learning 302
ApacheCon 2021   Apache Deep Learning 302ApacheCon 2021   Apache Deep Learning 302
ApacheCon 2021 Apache Deep Learning 302
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Realizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache BeamRealizing the promise of portability with Apache Beam
Realizing the promise of portability with Apache Beam
 
Real time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafkaReal time stock processing with apache nifi, apache flink and apache kafka
Real time stock processing with apache nifi, apache flink and apache kafka
 
Data streaming
Data streamingData streaming
Data streaming
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 

More from Slim Baltagi

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?Slim Baltagi
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiSlim Baltagi
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesSlim Baltagi
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeSlim Baltagi
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiSlim Baltagi
 
Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Slim Baltagi
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceSlim Baltagi
 

More from Slim Baltagi (7)

How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?How to select a modern data warehouse and get the most out of it?
How to select a modern data warehouse and get the most out of it?
 
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-BaltagiModern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
Modern-Data-Warehouses-In-The-Cloud-Use-Cases-Slim-Baltagi
 
Modern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetesModern big data and machine learning in the era of cloud, docker and kubernetes
Modern big data and machine learning in the era of cloud, docker and kubernetes
 
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision TreeApache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
Apache Kafka vs RabbitMQ: Fit For Purpose / Decision Tree
 
Apache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim BaltagiApache Flink community Update for March 2016 - Slim Baltagi
Apache Flink community Update for March 2016 - Slim Baltagi
 
Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities Big Data at CME Group: Challenges and Opportunities
Big Data at CME Group: Challenges and Opportunities
 
A Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to FinanceA Big Data Journey: Bringing Open Source to Finance
A Big Data Journey: Bringing Open Source to Finance
 

Recently uploaded

Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaAdrian Sanabria
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensKondapi V Siva Rama Brahmam
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxHizkiaJastis
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfAustraliaChapterIIBA
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referencepriyansabari355
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)CUO VEERANAN VEERANAN
 
What is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxWhat is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxJose Briones
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023stephizcoolio
 
AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for usersStephenEfange3
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Thibaud Le Douarin
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)UNCResearchHub
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxMdRafiqulIslam403212
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdfdigimartfamily
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxVighnesh Shashtri
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referencepriyansabari355
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsDataArchiva
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Cyber Security Experts
 

Recently uploaded (18)

Lies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix EnigmaLies and Myths in InfoSec - 2023 Usenix Enigma
Lies and Myths in InfoSec - 2023 Usenix Enigma
 
Operations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample ScreensOperations Data On Mobile - inSis Mobile App - Sample Screens
Operations Data On Mobile - inSis Mobile App - Sample Screens
 
ppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptxppt penjualan berbasis online omset.pptx
ppt penjualan berbasis online omset.pptx
 
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdfIIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
IIBA Adl - Being Effective on Day 1 - Slide Deck.pdf
 
SABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a referenceSABARI PRIYAN's self introduction as a reference
SABARI PRIYAN's self introduction as a reference
 
Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)Big Data - large Scale data (Amazon, FB)
Big Data - large Scale data (Amazon, FB)
 
What is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptxWhat is the value of your Data v3.0.pptx
What is the value of your Data v3.0.pptx
 
Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023Soil Health Policy Map Years 2020 to 2023
Soil Health Policy Map Years 2020 to 2023
 
AWS Identity and access management for users
AWS Identity and access management for usersAWS Identity and access management for users
AWS Identity and access management for users
 
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
Generative AI Rennes Meetup with OVHcloud - WAICF highlights & how to deploy ...
 
A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)A Gentle Introduction to Text Analysis :)
A Gentle Introduction to Text Analysis :)
 
Electricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptxElectricity Year 2023_updated_22022024.pptx
Electricity Year 2023_updated_22022024.pptx
 
Industry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptxIndustry 4.0 in IoT Transforming the Future.pptx
Industry 4.0 in IoT Transforming the Future.pptx
 
data analytics and tools from in2inglobal.pdf
data analytics  and tools from in2inglobal.pdfdata analytics  and tools from in2inglobal.pdf
data analytics and tools from in2inglobal.pdf
 
Artificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptxArtificial Intelligence and its Impact on Society.pptx
Artificial Intelligence and its Impact on Society.pptx
 
SABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as referenceSABARI PRIYAN's self introduction as reference
SABARI PRIYAN's self introduction as reference
 
Tips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data GoalsTips to Align with Your Salesforce Data Goals
Tips to Align with Your Salesforce Data Goals
 
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
Web 3.0 in Data Privacy and Security | Data Privacy |Blockchain Security| Cyb...
 

Apache-Flink-What-How-Why-Who-Where-by-Slim-Baltagi

  • 1. Apache Flink: What, How, Why, Who, Where? By @SlimBaltagi Director of Big Data Engineering Capital One 1 New York City (NYC) Apache Flink Meetup Civic Hall, NYC February 2nd, 2016 New York City (NYC) Apache Flink Meetup Civic Hall, NYC February 2nd, 2016
  • 2. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop and other open source tools? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 2
  • 3. I. What is Apache Flink stack and how it fits into the Big Data ecosystem? 1. What is Apache Flink? 2. What is Flink Execution Engine? 3. What are Flink APIs? 4. What are Flink Domain Specific Libraries? 5. What is Flink Architecture? 6. What is Flink Programming Model? 7. What are Flink tools? 3
  • 4. 1. What is Apache Flink? 1.1 Apache project with a cool logo! 1.2 Project that evolved the concept of a multi- purpose Big Data analytics framework 1.3 Project with a unique vision and philosophy 1.4 Only Hybrid ( Real-Time streaming + Batch) engine supporting many use cases 1.5 Major contributor to the movement of unification of streaming and batch 1.6 The 4G of Big Data Analytics frameworks 4
  • 5. 1.1 Apache project with a cool logo!  Apache Flink, like Apache Hadoop and Apache Spark, is a community-driven open source framework for distributed Big Data Analytics.  Apache Flink has its origins in a research project called Stratosphere of which the idea was conceived in late 2008 by professor Volker Markl from the Technische Universität Berlin in Germany.  Flink joined the Apache incubator in April 2014 and graduated as an Apache Top Level Project (TLP) in December 2014.  dataArtisans (data-artisans.com) is a German start-up company based in Berlin and is leading the development of Apache Flink. 5
  • 6. 1.1 Apache project with a cool logo Squirrel is an animal! This reflects the harmony with other animals in the Hadoop ecosystem (Zoo): elephant, pig, python, camel, … A squirrel is swift and agile This reflects the meaning of the word ‘Flink’: German for “nimble, swift, speedy” Red color of the body This reflects the roots of the project at German universities: In harmony with red squirrels in Germany Colorful tail This reflects an open source project as the colors match the ones of the feather symbolizing the Apache Software Foundation
  • 7. 1.2 Project that evolved the concept of a multi- purpose Big Data analytics framework 7 What is a typical Big Data Analytics Stack: Hadoop, Spark, Flink, …?
  • 8. 1.2 Project that evolved the concept of a multi- purpose Big Data analytics framework Apache Flink, written in Java and Scala, consists of: 1. Big data processing engine: distributed and scalable streaming dataflow engine 2. Several APIs in Java/Scala/Python: • DataSet API – Batch processing • DataStream API – Real-Time streaming analytics 3. Domain-Specific Libraries: • FlinkML: Machine Learning Library for Flink • Gelly: Graph Library for Flink • Table: Relational Queries • FlinkCEP: Complex Event Processing for Flink8
  • 9. What is Apache Flink stack? Gelly Table HadoopM/R SAMOA DataSet (Java/Scala/Python) Batch Processing DataStream (Java/Scala) Stream Processing FlinkML Local Single JVM Embedded Docker Cluster Standalone YARN, Mesos (WIP) Cloud Google’s GCE Amazon’s EC2 IBM Docker Cloud, … GoogleDataflow Dataflow(WiP) MRQL Table Cascading Runtime - Distributed Streaming Dataflow Zeppelin DEPLOYSYSTEMAPIs&LIBRARIESSTORAGE Files Local HDFS S3, Azure Storage Tachyon Databases MongoDB HBase SQL … Streams Flume Kafka RabbitMQ … Batch Optimizer Stream Builder Storm Gelly-Stream
  • 10. • Declarativity • Query optimization • Efficient parallel in- memory and out-of- core algorithms • Massive scale-out • User Defined Functions • Complex data types • Schema on read • Real-Time Streaming • Iterations • Memory Management • Advanced Dataflows • General APIs Draws on concepts from MPP Database Technology Draws on concepts from Hadoop MapReduce Technology Add 1.3 Project with a unique vision and philosophy Apache Flink’s original vision was getting the best from both worlds: MPP Technology and Hadoop MapReduce Technologies:
  • 11. 1.3 Project with a unique vision and philosophy All streaming all the time: execute everything as streams including batch!! Write like a programming language, execute like a database. Alleviate the user from a lot of the pain of: • manually tuning memory assignment to intermediate operators • dealing with physical execution concepts (e.g., choosing between broadcast and partitioned joins, reusing partitions). 11
  • 12. 1.3 Project with a unique vision and philosophy Little configuration required • Requires no memory thresholds to configure – Flink manages its own memory • Requires no complicated network configurations – Pipelining engine requires much less memory for data exchange • Requires no serializers to be configured – Flink handles its own type extraction and data representation Little tuning required: Programs can be adjusted to data automatically – Flink’s optimizer can choose execution strategies automatically 12
  • 13. 1.3 Project with a unique vision and philosophy Support for many file systems: • Flink is File System agnostic. BYOS: Bring Your Own Storage Support for many deployment options: • Flink is agnostic to the underlying cluster infrastructure. BYOC: Bring Your Own Cluster Be a good citizen of the Hadoop ecosystem • Good integration with YARN Preserve your investment in your legacy Big Data applications: Run your legacy code on Flink’s powerful engine using Hadoop and Storm compatibility layers and Cascading adapter. 13
  • 14. 1.3 Project with a unique vision and philosophy Native Support of many use cases on top of the same streaming engine • Batch • Real-Time streaming • Machine learning • Graph processing • Relational queries Support building complex data pipelines leveraging native libraries without the need to combine and manage external ones. 14
  • 15. 1.4 The only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine natively supporting many use cases: Real-Time stream processing Machine Learning at scale Graph AnalysisBatch Processing 15
  • 16. 1.5 Major contributor to the movement of unification of streaming and batch Dataflow proposal for incubation has been renamed to Apache Beam ( for combination of Batch and Stream) https://wiki.apache.org/incubator/BeamProposal Apache Beam was accepted to the Apache incubation on February 1st, 2016 http://incubator.apache.org/projects/beam.html Dataflow/Beam & Spark: A Programming Model Comparison, February 3rd, 2016https://cloud.google.com/dataflow/blog/dataflow-beam-and-spark-comparison By Tyler Akidau & Frances Perry, Software Engineers, Apache Beam Committers 16
  • 17. 1.5 Major contributor to the movement of unification of streaming and batch Apache Flink includes DataFlow on Flink http://data- artisans.com/dataflow-proposed-as-apache-incubator-project/  Keynotes of the Flink Forward 2015 conference: • Keynote on October 12th, 2015 by Kostas Tzoumas and Stephan Ewen of dataArtisanshttp://www.slideshare.net/FlinkForward/k-tzoumas-s- ewen-flink-forward-keynote/ • Keynote on October 13th, 2015 by William Vambenepe of Googlehttp://www.slideshare.net/FlinkForward/william-vambenepe- google-cloud-dataflow-and-flink-stream-processing-by-default 17
  • 18. 1.6 The 4G of Big Data Analytics frameworks Apache Flink is not YABDAF (Yet Another Big Data Analytics Framework)! Flink brings many technical innovations and a unique vision and philosophy that distinguish it from:  Other multi-purpose Big Data analytics frameworks such as Apache Hadoop and Apache Spark  Single-purpose Big Data Analytics frameworks such as Apache Storm Apache Flink is the 4G (4th Generation) of Big Data Analytics frameworks succeeding to Apache Spark. 18
  • 19. Apache Flink as the 4G of Big Data Analytics  Batch  Batch  Interactive  Batch  Interactive  Near-Real Time Streaming  Iterative processing  Hybrid (Streaming +Batch)  Interactive  Real-Time Streaming  Native Iterative processing MapReduce Direct Acyclic Graphs (DAG) Dataflows RDD: Resilient Distributed Datasets Cyclic Dataflows 1st Generation (1G) 2ndGeneration (2G) 3rd Generation (3G) 4th Generation (4G) 19
  • 20. How Big Data Analytics engines evolved?  The evolution of Massive-Scale Data Processing Tyler Akidau, Google. Strata + Hadoop World, Singapore, December 2, 2015. Slides: https://docs.google.com/presentation/d/10vs2PnjynYMtDpwFsqmSePtMnf JirCkXcHZ1SkwDg-s/present?slide=id.g63ca2a7cd_0_527 The world beyond batch: Streaming 101, Tyler Akidau, Google, August 5, 2015 http://radar.oreilly.com/2015/08/the-world-beyond-batch-streaming- 101.html Streaming 102, Tyler Akidau, Google, January 20, 2016 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 It covers topics like event-time vs. processing-time, windowing, watermarks, triggers, and accumulation. 20
  • 21. 2. What is Flink Execution Engine? The core of Flink is a distributed and scalable streaming dataflow engine with some unique features: 1. True streaming capabilities: Execute everything as streams 2. Versatile: Engine allows to run all existing MapReduce, Cascading, Storm, Google DataFlow applications 3. Native iterative execution: Allow some cyclic dataflows 4. Handling of mutable state 5. Custom memory manager: Operate on managed memory 6. Cost-Based Optimizer: for both batch and stream processing 21
  • 22. 3. Flink APIs 3.1 DataSet API for static data - Java, Scala, and Python 3.2 DataStream API for unbounded real-time streams - Java and Scala 22
  • 23. 3.1 DataSet API – Batch processing case class Word (word: String, frequency: Int) val env = StreamExecutionEnvironment.getExecutionEnvironment() val lines: DataStream[String] = env.fromSocketStream(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .window(Time.of(5,SECONDS)).every(Time.of(1,SECONDS)) .keyBy("word").sum("frequency") .print() env.execute() val env = ExecutionEnvironment.getExecutionEnvironment() val lines: DataSet[String] = env.readTextFile(...) lines.flatMap {line => line.split(" ") .map(word => Word(word,1))} .groupBy("word").sum("frequency") .print() env.execute() DataSet API (batch): WordCount DataStream API (streaming): Window WordCount 23
  • 24. 3.2 DataStream API – Real-Time Streaming Analytics Flink Streaming provides high-throughput, low-latency stateful stream processing system with rich windowing semantics. Streaming Fault-Tolerance allows Exactly-once processing delivery guarantees for Flink streaming programs that analyze streaming sources persisted by Apache Kafka.  Flink Streaming provides native support for iterative stream processing. Data streams can be transformed and modified using high-level functions similar to the ones provided by the batch processing API. 24
  • 25. 3.2 DataStream API – Real-Time Streaming Analytics Flink being based on a pipelined (streaming) execution engine akin to parallel database systems allows to: • implement true streaming & batch • integrate streaming operations with rich windowing semantics seamlessly • process streaming operations in a pipelined way with lower latency than micro-batch architectures and without the complexity of lambda architectures. It has built-in connectors to many data sources like Flume, Kafka, Twitter, RabbitMQ, etc 25
  • 26. 3.2 DataStream API – Real-Time Streaming Analytics Apache Flink: streaming done right. Till Rohrmann. January 31, 2016 https://fosdem.org/2016/schedule/event/hpc_bigdata_flink_streaming/ Web resources about stream processing with Apache Flink at the Flink Knowledge Base http://sparkbigdata.com/component/tags/tag/49-flink-streaming 26
  • 27. 4. Flink Domain Specific Libraries 4.1 FlinkML – Machine Learning Library 4.2 Table – Relational queries 4.3 Gelly – Graph Analytics for Flink 4.4 FlinkCEP: Complex Event Processing for Flink 27
  • 28. 4.1 FlinkML - Machine Learning Library  FlinkML is the Machine Learning (ML) library for Flink. It is written in Scala and was added in March 2015.  FlinkML aims to provide: • an intuitive API • scalable ML algorithms • tools that help minimize glue code in end-to-end ML applications  FlinkML will allow data scientists to: • test their models locally using subsets of data • use the same code to run their algorithms at a much larger scale in a cluster setting. 28
  • 29. 4.1 FlinkML FlinkML unique features are: 1. Exploiting the in-memory data streaming nature of Flink. 2. Natively executing iterative processing algorithms which are common in Machine Learning. 3. Streaming ML designed specifically for data streams. FlinkML: Large-scale machine learning with Apache Flink, Theodore Vasiloudis. October 21, 2015 Slides: https://sics.app.box.com/s/044omad6200pchyh7ptbyxkwvcvaiowu Video: https://www.youtube.com/watch?v=k29qoCm4c_k&feature=youtu.be  Check more FlinkML web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/51-29
  • 30. 4.2 Table – Relational Queries Table API, written in Scala , allows specifying operations using SQL-like expressions instead of manipulating DataSet or DataStream.  Table API can be used in both batch (on structured data sets) and streaming programs (on structured data streams).http://ci.apache.org/projects/flink/flink-docs- master/libs/table.html  Flink Table web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/52- flink-table 30
  • 31. 4.2 Table API – Relational Queries val customers = envreadCsvFile(…).as('id, 'mktSegment) .filter("mktSegment = AUTOMOBILE") val orders = env.readCsvFile(…) .filter( o => dateFormat.parse(o.orderDate).before(date) ) .as("orderId, custId, orderDate, shipPrio") val items = orders .join(customers).where("custId = id") .join(lineitems).where("orderId = id") .select("orderId, orderDate, shipPrio, extdPrice * (Literal(1.0f) – discount) as revenue") val result = items .groupBy("orderId, orderDate, shipPrio") .select("orderId, revenue.sum, orderDate, shipPrio") Table API (queries) 31
  • 32. 4.3 Gelly – Graph Analytics for Flink Gelly is Flink's large-scale graph processing API, available in Java and Scala, which leverages Flink's efficient delta iterations to map various graph processing models (vertex-centric and gather-sum- apply) to dataflows. Gelly provides: • A set of methods and utilities to create, transform and modify graphs • A library of graph algorithms which aims to simplify the development of graph analysis applications • Iterative graph algorithms are executed leveraging mutable state 32
  • 33. 4.3 Gelly – Graph Analytics for Flink Gelly allows Flink users to perform end-to-end data analysis, without having to build complex pipelines and combine different systems. It can be seamlessly combined with Flink's DataSet API, which means that pre-processing, graph creation, graph analysis and post-processing can be done in the same application. Gelly documentation https://ci.apache.org/projects/flink/flink-docs- master/libs/gelly_guide.html Introducing Gelly: Graph Processing with Apache Flink http://flink.apache.org/news/2015/08/24/introducing-flink-gelly.html Check out more Gelly web resources at the Apache Flink Knowledge Base: http://sparkbigdata.com/component/tags/tag/50-gelly33
  • 34. 4.3 Gelly – Graph Analytics for Flink Single-pass Graph Streaming Analytics with Apache Flink. Vasia Kalavri & Paris Carbone. January 31, 2016 FOSDEM'16. Brussels, BELGIUM. • Talk description :https://fosdem.org/2016/schedule/event/graph_processing_apache_flin k/ • Slides: http://www.slideshare.net/vkalavri/gellystream-singlepass- graph-streaming-analytics-with-apache-flink Gelly free training! http://www.slideshare.net/FlinkForward/vasia- kalavri-training-gelly-school http://gellyschool.com/ 34
  • 35. 4.4 FlinkCEP: Complex Event Processing for Flink FlinkCEP is the complex event processing library for Flink. It allows you to easily detect complex event patterns in a stream of endless data. Complex events can then be constructed from matching sequences. This gives you the opportunity to quickly get hold of what’s really important in your data. https://ci.apache.org/projects/flink/flink-docs- master/apis/streaming/libs/cep.html 35
  • 36. 5. What is Flink Architecture?  Flink implements the Kappa Architecture: run batch programs on a streaming system.  References about the Kappa Architecture: • Questioning the Lambda Architecture - Jay Kreps , July 2nd, 2014 http://radar.oreilly.com/2014/07/questioning-the-lambda- architecture.html • Turning the database inside out with Apache Samza -Martin Kleppmann, March 4th, 2015 o http://www.youtube.com/watch?v=fU9hR3kiOK0 (VIDEO) o http://martin.kleppmann.com/2015/03/04/turning-the-database-inside- out.html(TRANSCRIPT) o http://blog.confluent.io/2015/03/04/turning-the-database-inside-out-with- apache-samza/ (BLOG) 36
  • 37. 5. What is Flink Architecture? 5.1 Client 5.2 Master (Job Manager) 5.3 Worker (Task Manager) 37
  • 38. 5.1 Client  Type extraction  Optimize: in all APIs not just SQL queries as in Spark  Construct job Dataflow graph  Pass job Dataflow graph to job manager  Retrieve job results Job Manager Client case class Path (from: Long, to: Long) val tc = edges.iterate(10) { paths: DataSet[Path] => val next = paths .join(edges) .where("to") .equalTo("from") { (path, edge) => Path(path.from, edge.to) } .union(paths) .distinct() next } Optimizer Type extraction Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forward 38
  • 39. 5.2 Job Manager (JM) with High Availability  Parallelization: Create Execution Graph  Scheduling: Assign tasks to task managers  State tracking: Supervise the execution Job Manager Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash buildHT probe hash-part [0] hash-part [0] GroupRed sort forwar d Task Manager Task Manager Task Manager Task Manager Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSour ce lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d Data Source orders.tbl Filter Map DataSource lineitem.tbl Join Hybrid Hash build HT prob e hash-part [0] hash-part [0] GroupRed sort forwar d 39
  • 40. 5.3 Task Manager ( TM)  Operations are split up into tasks depending on the specified parallelism  Each parallel instance of an operation runs in a separate task slot  The scheduler may run several tasks from different operators in one task slot Task Manager S l o t Task ManagerTask Manager S l o t S l o t 40
  • 41. 6. What is Flink Programming Model?  DataSet and DataStream as programming abstractions are the foundation for user programs and higher layers.  Flink extends the MapReduce model with new operators that represent many common data analysis tasks more naturally and efficiently.  All operators will start working in memory and gracefully go out of core under memory pressure. 41
  • 42. 6.1 DataSet DataSet: abstraction for distributed data and the central notion of the batch programming API Files and other data sources are read into DataSets • DataSet<String> text = env.readTextFile(…) Transformations on DataSets produce DataSets • DataSet<String> first = text.map(…) DataSets are printed to files or on stdout • first.writeAsCsv(…) Computation is specified as a sequence of lazily evaluated transformations Execution is triggered with env.execute() 42
  • 43. 6.1 DataSet Used for Batch Processing Data Set Operation Data Set Source Example: Map and Reduce operation Sink b h 2 1 3 5 7 4 … … Map Reduce a 1 2 … 43
  • 44. 6.2 DataStream Real-time event streams Data Stream Operation Data Stream Source Sink Stock Feed Name Price Microsoft 124 Google 516 Apple 235 … … Alert if Microsoft > 120 Write event to database Sum every 10 seconds Alert if sum > 10000 Microsoft 124 Google 516 Apple 235 Microsoft 124 Google 516 Apple 235 Example: Stream from a live stock feed 44
  • 45. 7. What are Apache Flink tools? 7.1 Command-Line Interface (CLI) 7.2 Web Submission Client 7.3 Job Manager Web Interface 7.4 Interactive Scala Shell 7.5 Zeppelin Notebook 45
  • 46. 7.1 Command-Line Interface (CLI)  Flink provides a CLI to run programs that are packaged as JAR files, and control their execution.  bin/flink has 4 major actions • run #runs a program. • info #displays information about a program. • list #lists scheduled and running jobs • cancel #cancels a running job. Example: ./bin/flink info ./examples/KMeans.jar See CLI usage and related examples: https://ci.apache.org/projects/flink/flink-docs-master/apis/cli.html 46
  • 47. 7.2 Web Submission Client 47
  • 48. 7.2 Web Submission Client Flink provides a web interface to: • Upload programs • Execute programs • Inspect their execution plans • Showcase programs • Debug execution plans • Demonstrate the system as a whole The web interface runs on port 8080 by default. To specify a custom port set the webclient.port property in the ./conf/flink.yaml configuration file. 48
  • 49. 7.3 Job Manager Web Interface Overall system status Job execution details Task Manager resource utilization 49
  • 50. 7.3 Job Manager Web Interface The JobManager web frontend allows to : • Track the progress of a Flink program as all status changes are also logged to the JobManager’s log file. • Figure out why a program failed as it displays the exceptions of failed tasks and allow to figure out which parallel task first failed and caused the other tasks to cancel the execution. 50
  • 51. 7.4 Interactive Scala Shell bin/start-scala-shell.sh --host localhost --port 6123 51
  • 52. 7.4 Interactive Scala Shell Flink comes with an Interactive Scala Shell - REPL ( Read Evaluate Print Loop ) :  ./bin/start-scala-shell.sh  Interactive queries  Let’s you explore data quickly  It can be used in a local setup as well as in a cluster setup.  The Flink Shell comes with command history and auto completion.  Complete Scala API available  So far only batch mode is supported. There is plan to add streaming in the future: https://ci.apache.org/projects/flink/flink-docs-master/scala_shell.html 52
  • 54. 7.5 Zeppelin Notebook Web-based interactive computation environment Collaborative data analytics and visualization tool Combines rich text, execution code, plots and rich media Exploratory data science Saving and replaying of written code Storytelling 54
  • 55. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop and other open source tools? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 55
  • 56. II. How Apache Flink integrates with Hadoop and other open source tools? Service Open Source Tool Storage/Servi ng Layer Data Formats Data Ingestion Services Resource Management 56
  • 57. II. How Apache Flink integrates with Hadoop and other open source tools? Flink integrates well with other open source tools for data input and output as well as deployment. Flink allows to run legacy Big Data applications: MapReduce, Cascading and Storm applications Flink integrates with other open source tools 1. Data Input / Output 2. Deployment 3. Legacy Big Data applications 4. Other tools 57
  • 58. 1. Data Input / Output HDFS to read and write. Secure HDFS support Reuse data types (that implement Writables interface) Amazon S3 Microsoft Azure Storage MapR-FS Flink + Tachyon http://tachyon-project.org/ Running Apache Flink on Tachyon http://tachyon-project.org/Running- Flink-on-Tachyon.html  Flink + XtreemFS http://www.xtreemfs.org/ 58
  • 59. 1. Data Input / Output  Crunching Parquet Files with Apache Flink https://medium.com/@istanbul_techie/crunching-parquet-files-with-apache-flink- 200bec90d8a7 Here are some examples of how to read/write data from/to HBase: https://github.com/apache/flink/tree/master/flink-staging/flink- hbase/src/test/java/org/apache/flink/addons/hbase/example Using MongoDB with Flink: http://flink.apache.org/news/2014/01/28/querying_mongodb.html https://github.com/m4rcsch/flink-mongodb-example 59
  • 60. 1. Data Input / Output Apache Kafka, a system that provides durability and pub/sub functionality for data streams. Kafka + Flink: A practical, how-to guide. Robert Metzger and Kostas Tzoumas, September 2, 2015 http://data-artisans.com/kafka-flink-a-practical-how- to/ https://www.youtube.com/watch?v=7RPQUsy4qOM Click-Through Example for Flink’s KafkaConsumer Checkpointing. Robert Metzger, September 2nd , 2015. http://www.slideshare.net/robertmetzger1/clickthrough-example-for-flinks- kafkaconsumer-checkpointing MapR Streams (proprietary alternative to Kafka that is compatible with Apache Kafka 0.9 API) provides out of the box integration with Apache 60
  • 61. 1. Data Input / Output Using Apache Nifi with Flink: • Flink and NiFi: Two Stars in the Apache Big Data Constellation. Matthew Ring. January 19th , 2016 http://www.slideshare.net/mring33/flink-and-nifi-two-stars-in-the-apache-big- data-constellation • Integration of Apache Flink and Apache Nifi. Bryan Bende, February 4th , 2016 http://www.slideshare.net/BryanBende/integrating-nifi-and-flink Using Elasticsearch with Flink: https://www.elastic.co/ Building real-time dashboard applications with Apache Flink, Elasticsearch, and Kibana. By Fabian Hueske, December 7, 2015.https://www.elastic.co/blog/building-real-time-dashboard- applications-with-apache-flink-elasticsearch-and-kibana 61
  • 62. 2. Deployment Deploy inside of Hadoop via YARN • YARN Setup http://ci.apache.org/projects/flink/flink-docs- master/setup/yarn_setup.html • YARN Configuration http://ci.apache.org/projects/flink/flink-docs-master/setup/config.html#yarn Apache Flink cluster deployment on Docker using Docker-Compose by Simons Laws from IBM. Talk at the Flink Forward in Berlin on October 12, 2015.  Slides: http://www.slideshare.net/FlinkForward/simon-laws-apache-flink- cluster-deployment-on-docker-and-dockercompose  Video recording (40’:49): https://www.youtube.com/watch?v=CaObaAv9tLE 62
  • 63. 3. Legacy Big Data applications Flink’s MapReduce compatibility layer allows to: • run legacy Hadoop MapReduce jobs • reuse Hadoop input and output formats • reuse functions like Map and Reduce. References: • Documentation: https://ci.apache.org/projects/flink/flink-docs-release- 0.7/hadoop_compatibility.html • Hadoop Compatibility in Flink by Fabian Hüeske - November 18, 2014 http://flink.apache.org/news/2014/11/18/hadoop-compatibility.html • Apache Flink - Hadoop MapReduce Compatibility. Fabian Hüeske, January 29, 2015 http://www.slideshare.net/fhueske/flink- hadoopcompat20150128 63
  • 64. 3. Legacy Big Data applications  Cascading on Flink allows to port existing Cascading-MapReduce applications to Apache Flink with virtually no code changes. http://www.cascading.org/cascading-flink/  Expected advantages are performance boost and less resources consumption.  References: • Cascading on Apache Flink, Fabian Hueske, data Artisans. Flink Forward 2015. October 12, 2015 • http://www.slideshare.net/FlinkForward/fabian-hueske-training-cascading-on- flink • https://www.youtube.com/watch?v=G7JlpARrFkU • Cascading connector for Apache Flink. Code on Github https://github.com/dataArtisans/cascading-flink • Running Scalding jobs on Apache Flink, Ian Hummel, December 20, 201http://themodernlife.github.io/scala/hadoop/hdfs/sclading/flink/streaming/realtime/2015/12/2 0/running-scalding-jobs-on-apache-flink/ 64
  • 65. 3. Legacy Big Data applications Flink is compatible with Apache Storm interfaces and therefore allows reusing code that was implemented for Storm: • Execute existing Storm topologies using Flink as the underlying engine. • Reuse legacy application code (bolts and spouts) inside Flink programs. https://ci.apache.org/projects/flink/flink-docs- master/apis/streaming/storm_compatibility.html  A Tale of Squirrels and Storms. Mathias J. Sax, October 13, 2015. Flink Forward 2015 http://www.slideshare.net/FlinkForward/matthias-j-sax-a-tale-of-squirrels-and-storms https://www.youtube.com/watch?v=aGQQkO83Ong Storm Compatibility in Apache Flink: How to run existing Storm topologies on Flink. Mathias J. Sax, December 11, 2015 http://flink.apache.org/news/2015/12/11/storm-compatibility.html 65
  • 66.  Ambari service for Apache Flink: install, configure, manage Apache Flink on HDP, November 17, 2015 https://community.hortonworks.com/repos/4122/ambari-service-for-apache- flink.html Exploring Apache Flink with HDP https://community.hortonworks.com/articles/2659/exploring-apache-flink-with- hdp.html  Apache Flink + Apache SAMOA for Machine Learning on streams http://samoa.incubator.apache.org/  Flink Integrates with Zeppelin http://zeppelin.incubator.apache.org/ http://www.slideshare.net/FlinkForward/moon-soo-lee-data-science-lifecycle- with-apache-flink-and-apache-zeppelin Flink + Apache MRQL http://mrql.incubator.apache.org 66 4. Other tools
  • 67.  Google Cloud Dataflow (GA on August 12, 2015) is a fully-managed cloud service and a unified programming model for batch and streaming big data processing. https://cloud.google.com/dataflow/ (Try it FREE) Flink-Dataflow is a Google Cloud Dataflow SDK Runner for Apache Flink. It enables you to run Dataflow programs with Flink as an execution engine. References: Google Cloud Dataflow on top of Apache Flink, Maximilian Michels, data Artisans. Flink Forward conference, October 12, 2015  http://www.slideshare.net/FlinkForward/maximilian-michels-google- cloud-dataflow-on-top-of-apache-flink Slides  https://www.youtube.com/watch?v=K3ugWmHb7CE Video recording 67 4. Other tools
  • 68. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 68
  • 69. III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? 1. Why Flink is an alternative to Hadoop MapReduce? 2. Why Flink is an alternative to Apache Storm? 3. Why Flink is an alternative to Apache Spark? 4. What are the benchmarking results against Flink? 69
  • 70. 2. Why Flink is an alternative to Hadoop MapReduce? 1. Flink offers cyclic dataflows compared to the two- stage, disk-based MapReduce paradigm. 2. The application programming interface (API) for Flink is easier to use than programming for Hadoop’s MapReduce. 3. Flink is easier to test compared to MapReduce. 4. Flink can leverage in-memory processing, data streaming and iteration operators for faster data processing speed. 5. Flink can work on file systems other than Hadoop. 70
  • 71. 2. Why Flink is an alternative to Hadoop MapReduce? 6. Flink lets users work in a unified framework allowing to build a single data workflow that leverages, streaming, batch, sql and machine learning for example. 7. Flink can analyze real-time streaming data. 8. Flink can process graphs using its own Gelly library. 9. Flink can use Machine Learning algorithms from its own FlinkML library. 10. Flink supports interactive queries and iterative algorithms, not well served by Hadoop MapReduce. 71
  • 72. 2. Why Flink is an alternative to Hadoop MapReduce? 11. Flink extends MapReduce model with new operators: join, cross, union, iterate, iterate delta, cogroup, … Input Map Reduce Output DataSet DataSet DataSet Red Join DataSet Map DataSet OutputS Input 72
  • 73. 3. Why Flink is an alternative to Storm? 1. Higher Level and easier to use API 2. Lower latency • Thanks to pipelined engine 3. Exactly-once processing guarantees • Variation of Chandy-Lamport 4. Higher throughput • Controllable checkpointing overhead 5. Flink Separates application logic from recovery • Checkpointing interval is just a configuration parameter 73
  • 74. 3. Why Flink is an alternative to Storm? 6. More light-weight fault tolerance strategy 7. Stateful operators 8. Native support for iterative stream processing. 9. Flink does also support batch processing 10. Flink offers Storm compatibility • Flink is compatible with Apache Storm interfaces and therefore allows reusing code that was implemented for Storm. https://ci.apache.org/projects/flink/flink-docs- master/apis/storm_compatibility.html 74
  • 75. 3. Why Flink is an alternative to Storm? Extending the Yahoo! Streaming Benchmark, by Jamie Grier. February 2nd, 2016 http://data-artisans.com/extending-the-yahoo-streaming-benchmark/ Code at Github: https://github.com/dataArtisans/yahoo-streaming-benchmark Results show that Flink has a much better throughput compared to storm and better fault-tolerance guarantees: exactly-once. High-throughput, low-latency, and exactly-once stream processing with Apache Flink. The evolution of fault-tolerant streaming architectures and their performance – Kostas Tzoumas, August 5th 2015 http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream- processing-with-apache-flink/ 75
  • 76. 4. Why Flink is an alternative to Spark? 4.1 True Low latency streaming engine • Spark’s micro-batches aren’t good enough! • Unified batch and real-time streaming in a single engine • The streaming model of Flink is based on the Dataflow model similar to Google Dataflow 4.2 Unique windowing features not available in Spark • support for event time • out of order streams • a mechanism to define custom windows based on window assigners and triggers. 76
  • 77. 4. Why Flink is an alternative to Spark? 4.3 Native closed-loop iteration operators • make graph and machine learning applications run much faster 4.4 Custom memory manager • no more frequent Out Of Memory errors! • Flink’s own type extraction component • Flink’s own serialization component 4.5 Automatic Cost Based Optimizer • little re-configuration and little maintenance when the cluster characteristics change and the data evolves over time 77
  • 78. 4. Why Flink is an alternative to Apache Spark? 4.6 Little configuration required 4.7 Little tuning required 4.8 Flink has better performance 78
  • 79. 4.1 True low latency streaming engine  Some claim that 95% of streaming use cases can be handled with micro-batches!? Really!!! Spark’s micro-batching isn’t good enough for many time-critical applications that need to process large streams of live data and provide results in real-time. Below are Several use cases, taken from real industrial situations where batch or micro batch processing is not appropriate. References: • MapR Streams FAQ https://www.mapr.com/mapr-streams-faq#question12 • Apache Spark vs. Apache Flink, January 13, 2015. Whiteboard walkthrough by Balaji Narasimhalu from MapR https://www.youtube.com/watch?v=Dzx-iE6RN4w 79
  • 80. 4.1 True low latency streaming engine Financial Services – Real-time fraud detection. – Real-time mobile notifications. Healthcare – Smart hospitals - collect data and readings from hospital devices (vitals, IVs, MRI, etc.) and analyze and alert in real time. – Biometrics - collect and analyze data from patient devices that collect vitals while outside of care facilities. Ad Tech – Real-time user targeting based on segment and preferences. Oil & Gas – Real-time monitoring of pumps/rigs. 80
  • 81. 4.1 True low latency streaming engine Retail – Build an intelligent supply chain by placing sensors or RFID tags on items to alert if items aren’t in the right place, or proactively order more if supply is low. – Smart logistics with real-time end-to-end tracking of delivery trucks. Telecommunications – Real-time antenna optimization based on user location data. – Real-time charging and billing based on customer usage, ability to populate up-to-date usage dashboards for users. – Mobile offers. – Optimized advertising for video/audio content based on what users are consuming. 81
  • 82. 4.1 True low latency streaming engine  “I would consider stream data analysis to be a major unique selling proposition for Flink. Due to its pipelined architecture Flink is a perfect match for big data stream processing in the Apache stack.” – Volker Markl Ref.: On Apache Flink. Interview with Volker Markl, June 24th 2015 http://www.odbms.org/blog/2015/06/on-apache-flink-interview-with-volker-markl/ Apache Flink uses streams for all workloads: streaming, SQL, micro-batch and batch. Batch is just treated as a finite set of streamed data. This makes Flink the most sophisticated distributed open source Big Data processing engine. 82
  • 83. 4.2 Unique windowing features not available in Spark Streaming Besides arrival time, support for event time or a mixture of both for out of order streams Custom windows based on window assigners and triggers. How Apache Flink enables new streaming applications. Part I: The power of event time and out of order stream processing. December 9, 2015 by Stephan Ewen and Kostas Tzoumas http://data- artisans.com/how-apache-flink-enables-new-streaming-applications-part-1/ How Apache Flink enables new streaming applications. Part II: State and versioning. February 3, 2016 by Ufuk Celebi and Kostas Tzoumas http://data-artisans.com/how-apache-flink-enables-new-streaming-applications/ 83
  • 84. 4.2 Unique windowing features not available in Spark Streaming Flink 0.10: A significant step forward in open source stream processing. November 17, 2015. By Fabian Hueske and Kostas Tzoumashttp://data-artisans.com/flink-0-10-a- significant-step-forward-in-open-source-stream-processing/  Dataflow/Beam & Spark: A Programming Model Comparison. February 3, 2016. By Tyler Akidau & Frances Perry, Software Engineers, Apache Beam Committershttps://cloud.google.com/dataflow/blog/dataflow-beam-and- spark-comparison 84
  • 85. 4.3 Iteration Operators Why Iterations? Many Machine Learning and Graph processing algorithms need iterations! For example:  Machine Learning Algorithms • Clustering (K-Means, Canopy, …) • Gradient descent (Logistic Regression, Matrix Factorization)  Graph Processing Algorithms • Page-Rank, Line-Rank • Path algorithms on graphs (shortest paths, centralities, …) • Graph communities / dense sub-components • Inference (Belief propagation) 85
  • 86. 4.2 Iteration Operators  Flink's API offers two dedicated iteration operations: Iterate and Delta Iterate.  Flink executes programs with iterations as cyclic data flows: a data flow program (and all its operators) is scheduled just once.  In each iteration, the step function consumes the entire input (the result of the previous iteration, or the initial data set), and computes the next version of the partial solution 86
  • 87. 4.3 Iteration Operators  Delta iterations run only on parts of the data that is changing and can significantly speed up many machine learning and graph algorithms because the work in each iteration decreases as the number of iterations goes on.  Documentation on iterations with Apache Flinkhttp://ci.apache.org/projects/flink/flink-docs-master/apis/iterations.html 87
  • 88. 4.3 Iteration Operators Step Step Step Step Step Client for (int i = 0; i < maxIterations; i++) { // Execute MapReduce job } Non-native iterations in Hadoop and Spark are implemented as regular for-loops outside the system. 88
  • 89. 4.3 Iteration Operators  Although Spark caches data across iterations, it still needs to schedule and execute a new set of tasks for each iteration. In Spark, it is driver-based looping: • Loop outside the system, in driver program • Iterative program looks like many independent jobs In Flink, it is Built-in iterations: • Dataflow with Feedback edges • System is iteration-aware, can optimize the job  Spinning Fast Iterative Data Flows - Ewen et al. 2012 : http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf The Apache Flink model for incremental iterative dataflow processing. 89
  • 90. 4.4 Custom Memory Manager Features:  C++ style memory management inside the JVM  User data stored in serialized byte arrays in JVM  Memory is allocated, de-allocated, and used strictly using an internal buffer pool implementation. Advantages: 1. Flink will not throw an OOM exception on you. 2. Reduction of Garbage Collection (GC) 3. Very efficient disk spilling and network transfers 4. No Need for runtime tuning 5. More reliable and stable performance 90
  • 91. 4.4 Custom Memory Manager public class WC { public String word; public int count; } empty page Pool of Memory Pages Sorting, hashing, caching Shuffles/ broadcasts User code objects ManagedUnmanagedFlink contains its own memory management stack. To do that, Flink contains its own type extraction and serialization components. JVM Heap 91 Network Buffers
  • 92. 4.4 Custom Memory Manager Flink provides an Off-Heap option for its memory management component References: • Peeking into Apache Flink's Engine Room - by Fabian Hüske, March 13, 2015 http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks- Engine-Room.html • Juggling with Bits and Bytes - by Fabian Hüske, May 11,2015 https://flink.apache.org/news/2015/05/11/Juggling-with-Bits-and-Bytes.html • Memory Management (Batch API) by Stephan Ewen- May 16, 2015 https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=53741525 92
  • 93. 4.4 Custom Memory Manager Compared to Flink, Spark is catching up with its project Tungsten for Memory Management and Binary Processing: manage memory explicitly and eliminate the overhead of JVM object model and garbage collection. April 28, 2014https://databricks.com/blog/2015/04/28/project-tungsten-bringing- spark-closer-to-bare-metal.html It seems that Spark is adopting something similar to Flink and the initial Tungsten announcement read almost like Flink documentation!! 93
  • 94. 4.5 Built-in Cost-Based Optimizer  Apache Flink comes with an optimizer that is independent of the actual programming interface.  It chooses a fitting execution strategy depending on the inputs and operations.  Example: the "Join" operator will choose between partitioning and broadcasting the data, as well as between running a sort-merge-join or a hybrid hash join algorithm.  This helps you focus on your application logic rather than parallel execution.  Quick introduction to the Optimizer: section 6 of the paper: ‘The Stratosphere platform for big data analytics’http://stratosphere.eu/assets/papers/2014- VLDBJ_Stratosphere_Overview.pdf 94
  • 95. 4.5 Built-in Cost-Based Optimizer Run locally on a data sample on the laptop Run a month later after the data evolved Hash vs. Sort Partition vs. Broadcast Caching Reusing partition/sort Execution Plan A Execution Plan B Run on large files on the cluster Execution Plan C What is Automatic Optimization? The system's built-in optimizer takes care of finding the best way to execute the program in any environment. 95
  • 96. 4.5 Built-in Cost-Based Optimizer In contrast to Flink’s built-in automatic optimization, Spark jobs have to be manually optimized and adapted to specific datasets because you need to manually control partitioning and caching if you want to get it right. Spark SQL uses the Catalyst optimizer that supports both rule-based and cost-based optimization. References: • Spark SQL: Relational Data Processing in Sparkhttp://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.p df • Deep Dive into Spark SQL’s Catalyst Optimizer https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls- catalyst-optimizer.html 96
  • 97. 4.6 Little configuration required  Flink requires no memory thresholds to configure • Flink manages its own memory  Flink requires no complicated network configurations • Pipelining engine requires much less memory for data exchange  Flink requires no serializers to be configured • Flink handles its own type extraction and data representation 97
  • 98. 4.7 Little tuning required Flink programs can be adjusted to data automatically • Flink’s optimizer can choose execution strategies automatically According to Mike Olsen, Chief Strategy Officer of Cloudera Inc. “Spark is too knobby — it has too many tuning parameters, and they need constant adjustment as workloads, data volumes, user counts change. Reference: http://vision.cloudera.com/one-platform/ Tuning Spark Streaming for Throughput By Gerard Maas from Virdata. December 22, 2014 http://www.virdata.com/tuning-spark/ Spark Tuning: http://spark.apache.org/docs/latest/tuning.html 98
  • 99. 4.8 Flink has better performance Why Flink provides a better performance? • Custom memory manager • Native closed-loop iteration operators make graph and machine learning applications run much faster . • Role of the built-in automatic optimizer. For example, more efficient join processing • Pipelining data to the next operator in Flink is more efficient than in Spark. Reference: • A comparative performance evaluation of Flink, Dongwon Kim, Postech. October 12, 2015http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative- performance-evaluation-of-flink 99
  • 100. 5. What are the benchmarking results against Flink? I am maintaining a list of resources related to benchmarks against Flink: http://sparkbigdata.com/102-spark-blog- slim-baltagi/14-results-of-a-benchmark-between-apache-flink-and-apache-spark A couple resources worth mentioning: • A comparative performance evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward October 13, 2015 http://www.slideshare.net/FlinkForward/dongwon-kim-a-comparative- performance-evaluation-of-flink • Benchmarking Streaming Computation Engines at Yahoo December 16, 2015 Code at github: http://yahooeng.tumblr.com/post/135321837876/benchmarking- streaming-computation-engines-at https://github.com/yahoo/streaming-benchmarks 100
  • 101. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark. IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 101
  • 102. IV. Who is using Apache Flink? You might like what you saw so far about Apache Flink and still reluctant to give it a try! You might wonder: Is there anybody using Flink in pre-production or production environment? I asked this question to our friend ‘Google’ and I came with a short list in the next slide! I also heard more about who is using Flink in production at the Flink Forward conference on October 12-13, 2015 in Berlin, Germany! http://flink-forward.org/ 102
  • 103. IV. Who is using Apache Flink? How companies are using Flink as presented at Flink Forward 2015. Kostas Tzoumas and Stephan Ewen. http://www.slideshare.net/stephanewen1/flink-use-cases-bay-area-meetup- october-2015 Powered by Flink page: https://cwiki.apache.org/confluence/display/FLINK/Powered+by+Flink 103
  • 104. IV. Who is using Apache Flink? 6 Apache Flink Case Studies from the 2015 Flink Forward conference http://sparkbigdata.com/102-spark-blog-slim- baltagi/21-6-apache-flink-case-studies-from-the-2015-flinkforward-conference Mine the Apache Flink User mailing list to discover more! Gradoop: Scalable Graph Analytics with Apache Flink • Gradoop project page http://dbs.uni- leipzig.de/en/research/projects/gradoop • Gradoop: Scalable Graph Analytics with Apache Flink @ FOSDEM 2016. January 31, 2016http://www.slideshare.net/s1ck/gradoop-scalable-graph-analytics-with- apache-flink-fosdem-2016 104
  • 105. PROTEUS http://www.proteus-bigdata.com/ a European Union funded research project to improve Apache Flink and mainly to develop two libraries (visualization and online machine learning) on top of Flink core. PROTEUS: Scalable Online Machine Learning by Rubén Casado at Big Data Spain 2015 • Video: https://www.youtube.com/watch?v=EIH7HLyqhfE • Slides: http://www.slideshare.net/Datadopter/proteus-h2020-big-data 105 IV. Who is using Apache Flink?
  • 106. IV. Who is using Apache Flink?  has its hack week and the winner was a Flink based streaming project! December 18, 2015 • Extending the Yahoo! Streaming Benchmark and Winning Twitter Hack-Week with Apache Flink. Posted on February 2, 2016 by Jamie Grier http://data- artisans.com/extending-the-yahoo-streaming-benchmark/  did some benchmarks to compare performance of their use case implemented on Apache Storm against Spark Streaming and Flink. Results posted on December 18, 2015 http://yahooeng.tumblr.com/post/135321837876/benchmarking- streaming-computation-engines-at 106
  • 107. Agenda I. What is Apache Flink stack and how it fits into the Big Data ecosystem? II. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment? III. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark? IV. Who is using Apache Flink? V. Where to learn more about Apache Flink? 107
  • 108. V. Where to learn more about Apache Flink? 1. What is Flink 2016 roadmap? 2. How to get started quickly with Apache Flink? 3. Where to find more resources about Apache Flink? 4. How to contribute to Apache Flink? 5. What are some Key Takeaways? 108
  • 109. 1 What is Flink 2016 roadmap? SQL/StreamSQL and Table API CEP Library: Complex Event Processing library for the analysis of complex patterns such as correlations and sequence detection from multiple sources https://github.com/apache/flink/pull/1557 January 28, 2015 Dynamic Scaling: Runtime scaling for DataStream programs Managed memory for streaming operators Support for Apache Mesos https://issues.apache.org/jira/browse/FLINK-1984 Security: Over-the-wire encryption of RPC (Akka) and data transfers (Netty) Additional streaming connectors: Cassandra, Kinesis109
  • 110. 1 What is Flink roadmap? Expose more runtime metrics: Throughput / Latencies, Backpressure monitoring, Spilling / Out of Core Making YARN resource dynamic DataStream API enhancements DataSet API Enhancements References: • Apache Flink Roadmap Draft, December 2015 https://docs.google.com/document/d/1ExmtVpeVVT3TIhO1JoBpC5JKXm- 778DAD7eqw5GANwE/edit • What’s next? Roadmap 2016. Robert Metzger, January 26, 2016. Berlin Apache Flink Meetup. http://www.slideshare.net/robertmetzger1/january-2016-flink-community- update-roadmap-2016/9 110
  • 111. 2. How to get started quickly with Apache Flink? Step-By-Step Introduction to Apache Flinkhttp://www.slideshare.net/sbaltagi/stepbystep-introduction-to-apache-flink Implementing BigPetStore with Apache Flink http://www.slideshare.net/MrtonBalassi/implementing-bigpetstore-with-apache-flink Apache Flink Crash Course http://www.slideshare.net/sbaltagi/apache- flinkcrashcoursebyslimbaltagiandsrinipalthepu Free training from Data Artisans http://dataartisans.github.io/flink-training/ All talks at the Flink Forward 2015 http://sparkbigdata.com/102-spark-blog-slim-baltagi/22-all-talks-of-the- 2015-flink-forward-conference 111
  • 112. 3. Where to find more resources about Flink? Flink at the Apache Software Foundation: flink.apache.org/ data-artisans.com @ApacheFlink, #ApacheFlink, #Flink apache-flink.meetup.com github.com/apache/flink user@flink.apache.org dev@flink.apache.org Flink Knowledge Base http://sparkbigdata.com/component/tags/tag/27-flink 112
  • 113. 4. How to contribute to Apache Flink?  Contributions to the Flink project can be in the form of: • Code • Tests • Documentation • Community participation: discussions, questions, meetups, …  How to contribute guide ( also contains a list of simple “starter issues”) http://flink.apache.org/how-to-contribute.html 113
  • 114. 5. What are some key takeaways? 1. Although most of the current buzz is about Spark, Flink offers the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine natively supporting many use cases. 2. With the upcoming release of Apache Flink 1.0, I foresee more adoption especially in use cases with Real-Time stream processing and also fast iterative machine learning or graph processing. 3. I foresee Flink embedded in major Hadoop distributions and supported! 4. Apache Spark and Apache Flink will both have their sweet spots despite their “Me Too Syndrome”! 114
  • 116. Thanks! 116 • To all of you for attending! • To Bloomberg for sponsoring this event. • To data Artisans for allowing me to use some of their materials for my slide deck. • To Capital One for giving me time to prepare and give this talk. • Yes, we are hiring for our New York City offices and our other locations! http://jobs.capitalone.com • Drop me a note at sbaltagi@gmail.com if you’re interested.