Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics and why the architecture is well suited to power analytic applications.
User-facing applications are replacing traditional reporting interfaces as the preferred means for organizations to derive value from their datasets. In order to provide an interactive user experience, user interactions with analytic applications must complete in an order of milliseconds. To meet these needs, organizations often struggle with selecting a proper serving layer. Many serving layers are selected because of their general popularity without understanding the possible architecture limitations.
Druid is an analytics data store designed for analytic (OLAP) queries on event data. It draws inspiration from Google’s Dremel, Google’s PowerDrill, and search infrastructure. Many enterprises are switching to Druid for analytics, and we will cover why the technology is a good fit for its intended use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Stephan Ewen - Experiences running Flink at Very Large ScaleVerverica
This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters
Introduction to Apache Flink - Fast and reliable big data processingTill Rohrmann
This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Watch this talk here: https://www.confluent.io/online-talks/apache-kafka-architecture-and-fundamentals-explained-on-demand
This session explains Apache Kafka’s internal design and architecture. Companies like LinkedIn are now sending more than 1 trillion messages per day to Apache Kafka. Learn about the underlying design in Kafka that leads to such high throughput.
This talk provides a comprehensive overview of Kafka architecture and internal functions, including:
-Topics, partitions and segments
-The commit log and streams
-Brokers and broker replication
-Producer basics
-Consumers, consumer groups and offsets
This session is part 2 of 4 in our Fundamentals for Apache Kafka series.
Cluster computing frameworks such as Hadoop or Spark are tremendously beneficial in processing and deriving insights from data. However, long query latencies make these frameworks sub-optimal choices to power interactive applications. Organizations frequently rely on dedicated query layers, such as relational databases and key/value stores, for faster query latencies, but these technologies suffer many drawbacks for analytic use cases. In this session, we discuss using Druid for analytics and why the architecture is well suited to power analytic applications.
User-facing applications are replacing traditional reporting interfaces as the preferred means for organizations to derive value from their datasets. In order to provide an interactive user experience, user interactions with analytic applications must complete in an order of milliseconds. To meet these needs, organizations often struggle with selecting a proper serving layer. Many serving layers are selected because of their general popularity without understanding the possible architecture limitations.
Druid is an analytics data store designed for analytic (OLAP) queries on event data. It draws inspiration from Google’s Dremel, Google’s PowerDrill, and search infrastructure. Many enterprises are switching to Druid for analytics, and we will cover why the technology is a good fit for its intended use cases.
Speaker
Nishant Bangarwa, Software Engineer, Hortonworks
Automate Your Kafka Cluster with Kubernetes Custom Resources confluent
(Sam Obeid, Shopify) Kafka Summit SF 2018
At Shopify we manage multiple Apache Kafka clusters in multiple locations in Google’s cloud platform. We deploy our Kafka clusters as Kubernetes StatefulSets, and we use other K8s workloads to implement different tasks. Automating critical and repetitive operational tasks is one of our top priorities.
In this talk we’ll discuss how we leveraged Kubernetes Custom Resources and Controllers to automate some of the key cluster operational tasks, to detect clusters configuration changes and react to these changes with required actions. We will go through actual examples we implemented at Shopify, how we solved the problem of cluster discovery and how we automated topics creation across different clusters with zero human intervention and safety controls.
Flink Streaming is the real-time data processing framework of Apache Flink. Flink streaming provides high level functional apis in Scala and Java backed by a high performance true-streaming runtime.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
Join Tom Green, Solution Engineer at Confluent for this Lunch and Learn talk covering KSQL. Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka®. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and it supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
By attending one of these sessions, you will learn:
-How to query streams, using SQL, without writing code.
-How KSQL provides automated scalability and out-of-the-box high availability for streaming queries
-How KSQL can be used to join streams of data from different sources
-The differences between Streams and Tables in Apache Kafka
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
Apache Flink is a distributed, stateful stream processor. It features exactly-once state consistency, sophisticated event-time support, high throughput and low latency processing, and APIs at different levels of abstraction (Java, Scala, SQL). In my talk, I'll give an introduction to Apache Flink, its features and discuss the use cases it solves. I'll explain why batch is just a special case of stream processing, how its community evolves Flink into a truly unified stream and batch processor and what this means for its users.
https://www.meetup.com/de-DE/Bangalore-Apache-Kafka-Group/events/265285812/
https://www.youtube.com/watch?v=Ych5bbmDIoA&list=PLvkUPePDi9sa27SG9eGNXH25cfUeo_WY9&index=2
Automate Your Kafka Cluster with Kubernetes Custom Resources confluent
(Sam Obeid, Shopify) Kafka Summit SF 2018
At Shopify we manage multiple Apache Kafka clusters in multiple locations in Google’s cloud platform. We deploy our Kafka clusters as Kubernetes StatefulSets, and we use other K8s workloads to implement different tasks. Automating critical and repetitive operational tasks is one of our top priorities.
In this talk we’ll discuss how we leveraged Kubernetes Custom Resources and Controllers to automate some of the key cluster operational tasks, to detect clusters configuration changes and react to these changes with required actions. We will go through actual examples we implemented at Shopify, how we solved the problem of cluster discovery and how we automated topics creation across different clusters with zero human intervention and safety controls.
Flink Streaming is the real-time data processing framework of Apache Flink. Flink streaming provides high level functional apis in Scala and Java backed by a high performance true-streaming runtime.
In the session, we discussed the End-to-end working of Apache Airflow that mainly focused on "Why What and How" factors. It includes the DAG creation/implementation, Architecture, pros & cons. It also includes how the DAG is created for scheduling the Job and what all steps are required to create the DAG using python script & finally with the working demo.
The Parquet Format and Performance Optimization OpportunitiesDatabricks
The Parquet format is one of the most widely used columnar storage formats in the Spark ecosystem. Given that I/O is expensive and that the storage layer is the entry point for any query execution, understanding the intricacies of your storage format is important for optimizing your workloads.
As an introduction, we will provide context around the format, covering the basics of structured data formats and the underlying physical data storage model alternatives (row-wise, columnar and hybrid). Given this context, we will dive deeper into specifics of the Parquet format: representation on disk, physical data organization (row-groups, column-chunks and pages) and encoding schemes. Now equipped with sufficient background knowledge, we will discuss several performance optimization opportunities with respect to the format: dictionary encoding, page compression, predicate pushdown (min/max skipping), dictionary filtering and partitioning schemes. We will learn how to combat the evil that is ‘many small files’, and will discuss the open-source Delta Lake format in relation to this and Parquet in general.
This talk serves both as an approachable refresher on columnar storage as well as a guide on how to leverage the Parquet format for speeding up analytical workloads in Spark using tangible tips and tricks.
We want to present multiple anti patterns utilizing Redis in unconventional ways to get the maximum out of Apache Spark.All examples presented are tried and tested in production at Scale at Adobe. The most common integration is spark-redis which interfaces with Redis as a Dataframe backing Store or as an upstream for Structured Streaming. We deviate from the common use cases to explore where Redis can plug gaps while scaling out high throughput applications in Spark.
Niche 1 : Long Running Spark Batch Job – Dispatch New Jobs by polling a Redis Queue
· Why?
o Custom queries on top a table; We load the data once and query N times
· Why not Structured Streaming
· Working Solution using Redis
Niche 2 : Distributed Counters
· Problems with Spark Accumulators
· Utilize Redis Hashes as distributed counters
· Precautions for retries and speculative execution
· Pipelining to improve performance
ORC files were originally introduced in Hive, but have now migrated to an independent Apache project. This has sped up the development of ORC and simplified integrating ORC into other projects, such as Hadoop, Spark, Presto, and Nifi. There are also many new tools that are built on top of ORC, such as Hive’s ACID transactions and LLAP, which provides incredibly fast reads for your hot data. LLAP also provides strong security guarantees that allow each user to only see the rows and columns that they have permission for.
This talk will discuss the details of the ORC and Parquet formats and what the relevant tradeoffs are. In particular, it will discuss how to format your data and the options to use to maximize your read performance. In particular, we’ll discuss when and how to use ORC’s schema evolution, bloom filters, and predicate push down. It will also show you how to use the tools to translate ORC files into human-readable formats, such as JSON, and display the rich metadata from the file including the type in the file and min, max, and count for each column.
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
Join Tom Green, Solution Engineer at Confluent for this Lunch and Learn talk covering KSQL. Confluent KSQL is the streaming SQL engine that enables real-time data processing against Apache Kafka®. It provides an easy-to-use, yet powerful interactive SQL interface for stream processing on Kafka, without the need to write code in a programming language such as Java or Python. KSQL is scalable, elastic, fault-tolerant, and it supports a wide range of streaming operations, including data filtering, transformations, aggregations, joins, windowing, and sessionization.
By attending one of these sessions, you will learn:
-How to query streams, using SQL, without writing code.
-How KSQL provides automated scalability and out-of-the-box high availability for streaming queries
-How KSQL can be used to join streams of data from different sources
-The differences between Streams and Tables in Apache Kafka
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
Apache Flink is an open source platform which is a streaming data flow engine that provides communication, fault-tolerance, and data-distribution for distributed computations over data streams. Flink is a top level project of Apache. Flink is a scalable data analytics framework that is fully compatible to Hadoop. Flink can execute both stream processing and batch processing easily.
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Timo Walther
Apache Flink is a distributed, stateful stream processor. It features exactly-once state consistency, sophisticated event-time support, high throughput and low latency processing, and APIs at different levels of abstraction (Java, Scala, SQL). In my talk, I'll give an introduction to Apache Flink, its features and discuss the use cases it solves. I'll explain why batch is just a special case of stream processing, how its community evolves Flink into a truly unified stream and batch processor and what this means for its users.
https://www.meetup.com/de-DE/Bangalore-Apache-Kafka-Group/events/265285812/
https://www.youtube.com/watch?v=Ych5bbmDIoA&list=PLvkUPePDi9sa27SG9eGNXH25cfUeo_WY9&index=2
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data.
This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Speaker
Gregory Fee, Principal Engineer, Lyft
Unified stateful big data processing in Apache Beam (incubating)Aljoscha Krettek
Apache Beam lets you process unbounded, out-of-order, global-scale data with portable high-level pipelines, but not all use cases are pipelines of simple “map” and “combine” operations. Aljoscha Krettek introduces Beam’s new State API, which brings scalability and consistency to fine-grained stateful processing while interoperating with Beam’s other features such as consistent event-time windowing and windowed side inputs—all while remaining portable to any Beam runner, including Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow. Aljoscha covers the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.
Examples of new use cases unlocked by Beam’s new mutable state and timers include:
* Microservice-like streaming applications such as new user account verification and digital ordering
* Complex aggregations that cannot easily be expressed as an efficient associative combiner
* Output based on customized conditions, such as limiting to only “significant” changes in a learned model (resulting in potentially large cost savings in subsequent processing)
* Fine control over retrieval and storage of intermediate values during aggregation
* Reading from and writing to external systems with detailed management of the nature and size of requests
Aljoscha Krettek - Portable stateful big data processing in Apache BeamVerverica
Apache Beam's new State API brings scalability and consistency to fine-grained stateful processing while remaining portable to any Beam runner. Aljoscha Krettek introduces the new state and timer features in Beam and shows how to use them to express common real-world use cases in a backend-agnostic manner.
Apache Big Data 2016: Next Gen Big Data Analytics with Apache ApexApache Apex
Apache Apex is a next gen big data analytics platform. Originally developed at DataTorrent it comes with a powerful stream processing engine, rich set of functional building blocks and an easy to use API for the developer to build real-time and batch applications. Apex runs natively on YARN and HDFS and is used in production in various industries. You will learn about the Apex architecture, including its unique features for scalability, fault tolerance and processing guarantees, programming model and use cases.
http://apachebigdata2016.sched.org/event/6M0L/next-gen-big-data-analytics-with-apache-apex-thomas-weise-datatorrent
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward
Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink.
Rise of the machines: Continuous Delivery at SEEK - YOW! Night Summary SlidesDiUS
The virtues of continuous delivery are widely understood and accepted by organisations which value fast feedback cycles, reduced risk through incremental delivery of smaller changes and the ability to respond quickly to external factors. Furthermore if microservices are part of your architecture, then the ability to rapidly deploy multiple components of a system become increasingly important.
The foundations of scripting, automation and more recently containers made *nix-based systems the first target for automated deployments and subsequently continuous delivery. With the advent of some new tooling and a bit of courage these principles can now be applied to more heterogeneous environments including those from Redmond.
Using their backgrounds in automating large-scale ruby and java-based deployments, Warner and Matt embarked on a journey with SEEK to increase their agility by enabling continuous delivery – typically multiple times per day. This is their story.
Streaming of data has become the need of the hour. But do we really know how streaming exactly works? What are its benefits? Where and how to stream data in your big data architecture correctly? How to process the streamed data efficiently? What challenges do we face when we move from batch processing to stream processing? What is Stateful stream processing and what is stateless stream processing? Which one to opt and when? Let us address all these queries!
Intro to Apache Apex - Next Gen Platform for Ingest and TransformApache Apex
Introduction to Apache Apex - The next generation native Hadoop platform. This talk will cover details about how Apache Apex can be used as a powerful and versatile platform for big data processing. Common usage of Apache Apex includes big data ingestion, streaming analytics, ETL, fast batch alerts, real-time actions, threat detection, etc.
Bio:
Pramod Immaneni is Apache Apex PMC member and senior architect at DataTorrent, where he works on Apache Apex and specializes in big data platform and applications. Prior to DataTorrent, he was a co-founder and CTO of Leaf Networks LLC, eventually acquired by Netgear Inc, where he built products in core networking space and was granted patents in peer-to-peer VPNs.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Explore the innovative world of trenchless pipe repair with our comprehensive guide, "The Benefits and Techniques of Trenchless Pipe Repair." This document delves into the modern methods of repairing underground pipes without the need for extensive excavation, highlighting the numerous advantages and the latest techniques used in the industry.
Learn about the cost savings, reduced environmental impact, and minimal disruption associated with trenchless technology. Discover detailed explanations of popular techniques such as pipe bursting, cured-in-place pipe (CIPP) lining, and directional drilling. Understand how these methods can be applied to various types of infrastructure, from residential plumbing to large-scale municipal systems.
Ideal for homeowners, contractors, engineers, and anyone interested in modern plumbing solutions, this guide provides valuable insights into why trenchless pipe repair is becoming the preferred choice for pipe rehabilitation. Stay informed about the latest advancements and best practices in the field.
Industrial Training at Shahjalal Fertilizer Company Limited (SFCL)MdTanvirMahtab2
This presentation is about the working procedure of Shahjalal Fertilizer Company Limited (SFCL). A Govt. owned Company of Bangladesh Chemical Industries Corporation under Ministry of Industries.
2. Contents
2
❏ What is Flink?
❏ Why Flink
❏ Flink Ecosystem
❏ Sources and sinks
❏ Flink Program structure
❏ Dataset API and DataStream API
❏ Windowing in flink
❏ Fault tolerance & Availability
❏ Use cases
❏ Limitations
3. 3
Big Data
● We define big data as huge amount of data.
● Structured, semi-structured and unstructured format
● This led to emergence of certain tools & frameworks that help in
storage and processing of data
● Processing part - 1. Batch Processing
2. Stream Processing
6. Stream processing challenges
➢ Data is unbounded, means no start and end
➢ Unpredictable and inconsistent intervals of new data
➢ Data can be Out of order with different timestamps
➢ Latency factor impacts accuracy of results
6
7. What is APache FLink?
7
distributed-Stream processing
framework
8. WHY FLink?
8
➢ Scalable
➢ High throughput
➢ Low latency
➢ Flexible windowing
➢ Exactly once
semantics
➢ High availability
➢ Fault tolerant
9. 9
MORE ABOUT FLINK
● Flink works on kappa architecture
● The key idea of this architecture is to handle both batch and
real-time data through single stream processing engine.
● Hence, flink treats all data as stream and processes the data in real
time.
● Batch data is special case of streaming here.
14. Storage/streaming
14
● Flink is just computation engine,
doesn’t come with storage system
● Can read, write data from different
storage systems as well as streaming
systems
➢ Apache kafka(source/sink)
➢ Apache Cassandra(source/sink)
➢ Elastic search(sink)
➢ Hadoop filesystem(sink)
➢ RabbitMQ(source/sink)
➢ Amazon Kinesis
streams(source/sink)
➢ Twitter streaming API(source)
➢ Google PubSub(Source/sink)
15. PreDefined sources/sinks
➢ File based
○ readTextFile(path)
○ readTextFile(fileformat,path)
➢ Socket-based
○ SocketTextStream
➢ Collection-based(java.util.Collection)
○ fromCollection(Collection)
○ Elements must be of same type
➢ Custom(Previous slide)
○ addSource(. . .)
➢ writeAsText() - TExt output format
➢ writeAsCsv() - CSV output format
➢ print() - standard output
➢ writeUsingOutputFormat() - custom
file output
➢ WriteToSocket
➢ Custom( Previous slide)
○ addSink(. . .)
15
16. Some more information
✘ As you see flink is a layered system. Different layers are built on top
of other raising abstraction level of the program
✘ Runtime layer receives program in the form of job graph. This is the
layer where job manager sits.
✘ Both Datastream API and DataSet API generate jobgraphs through
separate compilation processes.
16
18. Dataset API
It is used to perform batch operations on
the data over a period. Datasets are
created from sources like local files and
transformations like filtering, mapping,
aggregating, joining are applied and the
result is written to sinks.
Flink api’s
DataStream API
It is used for handling data in continuous
stream. Operations like filtering, mapping,
windowing and aggregating can be
applied on the datastream and can be
written to sinks.
18
23. Windows In Flink
23
● Windows are the heart of processing infinite streams.
● Windows split the stream into fixed sized buckets over which we can
apply computations.
● It is a mechanism to take snapshot of the stream based on time or
other variables
● Need of windowing - in case we need some kind of aggregation on
the incoming data
24. 24
MORE INFO . . .
● Most of window operations are encouraged on keyed datastream.
● The basic structure of windowed transformation is as follows.
25. 25
Window Assigners . . .
❏ Tumbling Windows
❏ Sliding Windows
A window assigner defines how elements are assigned to
windows
❏ Sessions Windows
❏ Global Windows
26. TIME in FLINK
➢ Processing Time - Assigns elements based on the current clock of
the machines
➢ Event Time - Assigns windows based on timestamp of the
elements.
➢ Ingestion time - Assigns wall clock timestamps as they arrive in the
system and processes as event time semantics based on the
attached timestamps.
26
27. Tumbling windows
✘ Here window size is specified and elements are assigned to non
overlapping windows.
✘ Eg. If window size is specified as 2 minutes, all elements in 2 minutes
would come in one window for processing.
✘ It is useful in dividing the stream into multiple discrete batches.
27
29. 29
Sliding windows
➢ Here windows can overlap unlike tumbling windows
➢ Overlap is user specified. Here one element can be assigned to
multiple windows
➢ Eg. 10 minutes window with a slide of 5 minutes.
31. 31
Session Windows
➢ This is applicable cases where window boundaries needs to be
adjusted as per incoming data.
➢ Here we specify session gap or inactivity period before marking a
session closed.
➢ Refer next slide for example.
33. 33
Global Windows
➢ Sub-division of elements into windows is not done. Instead, we
assign each element to 1 single key per global window.
➢ There is no end to this window, so triggers should be used if any
computation needs to be done.
➢ Flink also has count windows. For eg. a tumbling count window of
100 will add 100 elements to the window and will evaluate upon
receiving 100th element.
35. 35
Evictors & Triggers
➢ A trigger determines when window is ready for processing. Except
Global windows, all windows come with a default trigger.
➢ We can define a custom trigger which tells when to fire elements for
processing .
➢ Evictors are used to evict some elements from the window. After the
window is completed, it is purged
36. Fault Tolerance
➢ Before diving in, let’s define stateful computing. Most of the
computational tasks require stateful data.
➢ We already came across this in previous slides.
➢ In word count example, word count is an output that accumulates
new objects into existing word count. It is a stateful variable.
➢ Here fault tolerance is achieved with checkpointing. Flink maintains
state locally per task.
➢ State is periodically checkpointed to durable storage. A checkpoint
is consistent snapshot of the state of all tasks.
36
37. More on Checkpointing . . .
➢ Flink backs up state at a certain interval
➢ In case of a failure, flink recovers all tasks to the state of last
checkpoint and starts running the tasks from that checkpoint.
➢ The application continues as the failure never happened.
➢ Flink guarantees exactly once semantics which means all data is
processed exactly once regardless of any failures.
37
38. Stopping and resuming jobs . . .
➢ The running jobs must be stopped before upgrading a component
and resumed after the upgrade.
➢ There are 2 modes to resume the jobs.
➢ Savepoint - unlike checkpoint which is periodically triggered by the
system, savepoint is triggered by running commands.
➢ Even the storage format is different from checkpoints.
➢ External checkpoint - Extension of existing internal checkpoint.
After storing internal checkpoint, it is stored specific directory.
38
39. Savepoints vs checkpoints . . .
➢ The primary purpose of checkpoints is to provide a recovery
mechanism in case of unexpected job failures. Checkpoint’s lifecycle
is managed by flink. Checkpoints are dropped after job was
terminated by user. They are lightweight.
➢ Savepoints are created, owned and deleted by user. Their use-case
is for planned manual backup and resume. They are bit of
expensive. Mostly used for flink version upgrade, changing job graph
and for changing parallelism.
39
40. More on checkpoints. . .
➢ We already know state is persisted upon checkpoints to avoid data
loss and recover consistently. How the state is stored internally and
how it is persisted upon checkpoints depends on state backend.
➢ 3 state backends available
○ Memory state backend
○ Fs state backend
○ Rocksdb state backend
40
41. High Availability for standalone cluster
➢ Job manager coordinates flink deployment and it is responsible for
scheduling the jobs and resource management.
➢ However, there is only one job manager per flink cluster creating a
single point of failure.
➢ There will be a single leading job manager at any time and multiple
standby job managers to take leadership incase leader fails.
➢ Flink uses zookeeper for distributed coordination between all
running instances of job manager and leader election.
41
43. 43
Limitations
➢ Maturity and community support is less in the industry than its
predecessors but it is growing fastly.
➢ No known adaptation of Flink batch, only popular for streaming
44. Use Cases
❏ Event driven Applications
❏ Data Analytics Applications
❏ Data Pipeline Applications
44
48. 48
EXAMPLES
Event driven applications
➢ Fraud detection
➢ Anomaly detection
➢ Rule Based Alerting
➢ Business Process
Monitoring
Data Analytics Applications
➢ Quality monitoring of
Telecom networks
➢ Analysis of live data
➢ Analysis of product
updates
➢ Large scale graph
analysis
Data Pipeline Applications
➢ Continuous ETL in
E-commerce
➢ Real time search index
building in E-commerce
49. Use cases cont’d. . .
Bouygues Telecom
For real time event processing and
analytics for billions of messages
per day and real time customer
insights.
King - candy crush creator
Real time analytics on 30 billion
events generated everyday across
200 countries.
Zalando E-commerce
Uses flink for real time process
monitoring, stream processing,
business process monitoring.
Otto - largest online retailer
For user agent identification and
identifying a search session via
stream processing,
Alibaba Group
For online recommendations of the
products based on the user activity.
They developed Blink on top of the
flink for this.
Capital One - commercial
banking
Monitor customer data in real time,
to detect customer issues, event
correlation, anomaly identification
49