Flink provides concise summaries of key points:
1) After submitting a Flink job, the client creates and submits the job graph to the JobManager, which then creates an execution graph and deploys tasks across TaskManagers for parallel execution.
2) The batch optimizer chooses optimal execution plans by evaluating physical execution strategies like join algorithms and data shipping approaches to minimize data shuffling and network usage.
3) Flink iterations are optimized by having the runtime directly handle caching, state maintenance, and pushing work out of loops to avoid scheduling overhead between iterations. Delta iterations further improve efficiency by only updating changed elements in each iteration.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
An in-depth look at Apache Flink’s Streaming Dataflow Engine. Flink executes data streaming programs directly as streams with low latency and flexible user-defined state and models batch programs as streaming programs on finite data streams.
The slides cover the general design of the runtime and show how the engine is able to support diverse features and workloads without compromising on performance or usability.
Flink Forward, Berlin
October 13, 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
Talk which I gave at the first Apache Flink Meetup in Paris on the 29th of October.
It gives an introduction into Apache Flink's streaming and batch API. Furthermore, it is explained how Flink jobs are deployed. Flink's checkpointing mechanism is presented which gives exactly-once processing guarantees.
Apache Flink Internals: Stream & Batch Processing in One System – Apache Flin...ucelebi
An in-depth look at Apache Flink’s Streaming Dataflow Engine. Flink executes data streaming programs directly as streams with low latency and flexible user-defined state and models batch programs as streaming programs on finite data streams.
The slides cover the general design of the runtime and show how the engine is able to support diverse features and workloads without compromising on performance or usability.
Flink Forward, Berlin
October 13, 2015
Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015Till Rohrmann
Talk which I gave at the first Apache Flink Meetup in Paris on the 29th of October.
It gives an introduction into Apache Flink's streaming and batch API. Furthermore, it is explained how Flink jobs are deployed. Flink's checkpointing mechanism is presented which gives exactly-once processing guarantees.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.
Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
SQL is undoubtedly the most widely used language for data analytics. It is declarative and can be optimized and efficiently executed by most query processors. Therefore the community has made effort to add relational APIs to Apache Flink, a standard SQL API and a language-integrated Table API. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. Since Flink supports both stream and batch processing and many use cases require both kinds of processing, we aim for a unified relational layer. In this talk we will look at the current API capabilities, find out what's under the hood of Flink’s relational APIs, and give an outlook for future features such as dynamic tables, Flink's way how streams are converted into tables and vice versa leveraging the stream-table duality.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
Large-scale graph processing with Apache Flink @GraphDevroom FOSDEM'15Vasia Kalavri
Apache Flink is a general-purpose platform for batch and streaming distributed data processing. This talk describes how Flink’s powerful APIs, iterative operators and other unique features make it a competitive alternative for large-scale graph processing as well. We take a close look at how one can elegantly express graph analysis tasks, using common Flink operators and how different graph processing models, like vertex-centric, can be easily mapped to Flink dataflows. Next, we get a sneak preview into Flink's upcoming Graph API, Gelly, which further simplifies graph application development in Flink. Finally, we show how to perform end-to-end data analysis, mixing common Flink operators and Gelly, without having to build complex pipelines and combine different systems. We go through a step-by-step example, demonstrating how to perform loading, transformation, filtering, graph creation and analysis, with a single Flink program.
During the past few years R has become an important language for data analysis, data representation and visualization. R is a very expressive language which combines functional and dynamic aspects, with laziness and object oriented programming. However the default R implementation is neither fast nor distributed, both features crucial for "big data" processing.
Here, FastR-Flink compiler is presented, a compiler based on Oracle's R implementation FastR with support for some operations of Apache Flink, a Java/Scala framework for distributed data processing. The Apache Flink constructs such as map, reduce or filter are integrated at the compiler level to allow the execution of distributed stream and batch data processing applications directly from the R programming language.
Flink Forward SF 2017: Timo Walther - Table & SQL API – unified APIs for bat...Flink Forward
SQL is undoubtedly the most widely used language for data analytics. It is declarative and can be optimized and efficiently executed by most query processors. Therefore the community has made effort to add relational APIs to Apache Flink, a standard SQL API and a language-integrated Table API. Both APIs are semantically compatible and share the same optimization and execution path based on Apache Calcite. Since Flink supports both stream and batch processing and many use cases require both kinds of processing, we aim for a unified relational layer. In this talk we will look at the current API capabilities, find out what's under the hood of Flink’s relational APIs, and give an outlook for future features such as dynamic tables, Flink's way how streams are converted into tables and vice versa leveraging the stream-table duality.
Machine Learning with Apache Flink at Stockholm Machine Learning GroupTill Rohrmann
This presentation presents Apache Flink's approach to scalable machine learning: Composable machine learning pipelines, consisting of transformers and learners, and distributed linear algebra.
The presentation was held at the Machine Learning Stockholm group on the 23rd of March 2015.
Flink vs. Spark: this is the slide deck of my talk at the 2015 Flink Forward conference in Berlin, Germany, on October 12, 2015. In this talk, we tried to compare Apache Flink vs. Apache Spark with focus on real-time stream processing. Your feedback and comments are much appreciated.
This presentation held in at Inovex GmbH in Munich in November 2015 was about a general introduction of the streaming space, an overview of Flink and use cases of production users as presented at Flink Forward.
My talk from SICS Data Science Day, describing FlinkML, the Machine Learning library for Apache Flink.
I talk about our approach to large-scale machine learning and how we utilize state-of-the-art algorithms to ensure FlinkML is a truly scalable library.
You can watch a video of the talk here: https://youtu.be/k29qoCm4c_k
Graphs as Streams: Rethinking Graph Processing in the Streaming EraVasia Kalavri
Streaming is the latest hot topic in the big data world. We want to process data immediately and continuously. Modern stream processors have matured significantly and offer exceptional features, including sub-second latencies, high throughput, fault-tolerance, and seamless integration with various data sources and sinks.
Many sources of streaming data consist of related or connected events: user interactions in a social network, web page clicks, movie ratings, product purchases. These connected events can be naturally represented as edges in an evolving graph.
In this talk I will explain how we can leverage a powerful stream processor, such as Apache Flink, and academic research of the past two decades, to build graph streaming applications. I will describe how we can model graphs as streams and how we can compute graph properties without storing and managing the graph state. I will introduce useful graph summary data structures and show how they allow us to build graph algorithms in the streaming model, such as connected components, bipartiteness detection, and distance estimation.
Kapacitor - Real Time Data Processing EnginePrashant Vats
Kapacitor is a native data processing engine.Kapacitor is a native data processing engine.It can process both stream and batch data from InfluxDB.It lets you plug in your own custom logic or user-defined functions to process alerts with dynamic thresholds. Key Kapacitor Capabilities
-Alerting
-ETL (Extraction, Transformation and Loading)
-Action Oriented
-Streaming Analytics
-Anomaly Detection
Kapacitor uses a DSL (Domain Specific Language) called TICKscript to define tasks.
This webinar explains why PISA chips are inevitable, provides overview of machine architecture of such switches, presents a brief primer on the P4 language with sample programs for a variety of networks and demonstrates a powerful network diagnostics application implemented in P4.
Programmability in SDNs is confined to the network control plane. The forwarding plane is still largely dictated by fixed-function switching chips. Our goal is to change that, and to allow programmers to define how packets are to be processed all the way down to the wire.
This is made possible by a new generation of high-performance forwarding chips. At the high-end, PISA (Protocol-Independent Switch Architecture) chips promise multi-Tb/s of packet processing. At the mid- and low-end of the performance spectrum, CPUs, GPUs, FPGAs, and NPUs already offer great flexibility with performance of a few tens to hundreds of Gb/s.
In addition to programmable forwarding chips, we also need a high-level language to dictate the forwarding behavior in a target independent fashion. "P4" (www.p4.org) is such a language. In P4, the programer declares how packets are to be processed, and a compiler generates a configuration for a PISA chip, or a programmable target in general. For example, the programmer might program the switch to be a top-of-rack switch, a firewall, or a load-balancer; and might add features to run automatic diagnostics and novel congestion control algorithms.
Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark WuFlink Forward
During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.
Apache Flink: Better, Faster & Uncut - Piotr Nowojski, data ArtisansEvention
This talk will start with brief introduction to streaming processing and Flink itself. Next, we will take a look at some of the most interesting recent improvements in Flink such as incremental checkpointing,
end-to-end exactly-once processing guarantee and network latency optimizations. We’ll discuss real problems that Flink’s users were facing and how they were addressed by the community and dataArtisans.
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das Databricks
“In Spark 2.0, we have extended DataFrames and Datasets to handle real time streaming data. This not only provides a single programming abstraction for batch and streaming data, it also brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “Continuous Applications”.” - T.D.
Databricks Blog: "Structured Streaming In Apache Spark 2.0: A new high-level API for streaming"
https://databricks.com/blog/2016/07/28/structured-streaming-in-apache-spark.html
// About the Presenter //
Tathagata Das is an Apache Spark Committer and a member of the PMC. He’s the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks, you could find him at the AMPLab of UC Berkeley, researching datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.
Follow T.D. on -
Twitter: https://twitter.com/tathadas
LinkedIn: https://www.linkedin.com/in/tathadas
Title
Hands-on Learning with KubeFlow + Keras/TensorFlow 2.0 + TF Extended (TFX) + Kubernetes + PyTorch + XGBoost + Airflow + MLflow + Spark + Jupyter + TPU
Video
https://youtu.be/vaB4IM6ySD0
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, and Airflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Reproduce Model Training with TFX Metadata Store and Pachyderm
12. Deploy the Model to Production with TensorFlow Serving and Istio
13. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
Related Links
1. PipelineAI Home: https://pipeline.ai
2. PipelineAI Community Edition: http://community.pipeline.ai
3. PipelineAI GitHub: https://github.com/PipelineAI/pipeline
4. Advanced Spark and TensorFlow Meetup (SF-based, Global Reach): https://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup
5. YouTube Videos: https://youtube.pipeline.ai
6. SlideShare Presentations: https://slideshare.pipeline.ai
7. Slack Support: https://joinslack.pipeline.ai
8. Web Support and Knowledge Base: https://support.pipeline.ai
9. Email Support: support@pipeline.ai
Cooperative Task Execution for Apache SparkDatabricks
Apache Spark has enabled a vast assortment of users to express batch, streaming, and machine learning computations, using a mixture of programming paradigms and interfaces. Lately, we observe that different jobs are often implemented as part of the same application to share application logic, state, or to interact with each other. Examples include online machine learning, real-time data transformation and serving, low-latency event monitoring and reporting. Although the recent addition of Structured Streaming to Spark provides the programming interface to enable such unified applications over bounded and unbounded data, the underlying execution engine was not designed to efficiently support jobs with different requirements (i.e., latency vs. throughput) as part of the same runtime. It therefore becomes particularly challenging to schedule such jobs to efficiently utilize the cluster resources while respecting their requirements in terms of task response times. Scheduling policies such as FAIR could alleviate the problem by prioritizing critical tasks, but the challenge remains, as there is no way to guarantee no queuing delays. Even though preemption by task killing could minimize queuing, it would also require task resubmission and loss of progress, leading to wasted cluster resources. In this talk, we present Neptune, a new cooperative task execution model for Spark with fine-grained control over resources such as CPU time. Neptune utilizes Scala coroutines as a lightweight mechanism to suspend task execution with sub-millisecond latency and introduces new scheduling policies that respect diverse task requirements while efficiently sharing the same runtime. Users can directly use Neptune for their continuous applications as it supports all existing DataFrame, DataSet, and RDD operators. We present an implementation of the execution model as part of Spark 2.4.0 and describe the observed performance benefits from running a number of streaming and machine learning workloads on an Azure cluster.
Speaker: Konstantinos Karanasos
Video: https://youtu.be/T0L0JxDaPkc
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
Description
In this workshop, we build real-world machine learning pipelines using TensorFlow Extended (TFX), KubeFlow, Airflow, and MLflow.
Described in the 2017 paper, TFX is used internally by thousands of Google data scientists and engineers across every major product line within Google.
KubeFlow is a modern, end-to-end pipeline orchestration framework that embraces the latest AI best practices including hyper-parameter tuning, distributed model training, and model tracking.
Airflow is the most-widely used pipeline orchestration framework in machine learning and data engineering.
MLflow is a lightweight experiment-tracking system recently open-sourced by Databricks, the creators of Apache Spark. MLflow supports Python, Java/Scala, and R - and offers native support for TensorFlow, Keras, and Scikit-Learn.
Pre-requisites
Modern browser - and that's it!
Every attendee will receive a cloud instance
Nothing will be installed on your local laptop
Everything can be downloaded at the end of the workshop
Location
Online Workshop
The link will be sent a few hours before the start of the workshop.
Only registered users will receive the link.
If you do not receive the link a few hours before the start of the workshop, please send your Eventbrite registration confirmation to support@pipeline.ai for help.
Agenda
1. Create a Kubernetes cluster
2. Install KubeFlow, Airflow, TFX, and Jupyter
3. Setup ML Training Pipelines with KubeFlow and Airflow
4. Transform Data with TFX Transform
5. Validate Training Data with TFX Data Validation
6. Train Models with Jupyter, Keras/TensorFlow 2.0, PyTorch, XGBoost, and KubeFlow
7. Run a Notebook Directly on Kubernetes Cluster with KubeFlow
8. Analyze Models using TFX Model Analysis and Jupyter
9. Perform Hyper-Parameter Tuning with KubeFlow
10. Select the Best Model using KubeFlow Experiment Tracking
11. Run Multiple Experiments with MLflow Experiment Tracking
12. Reproduce Model Training with TFX Metadata Store
13. Deploy the Model to Production with TensorFlow Serving and Istio
14. Save and Download your Workspace
Key Takeaways
Attendees will gain experience training, analyzing, and serving real-world Keras/TensorFlow 2.0 models in production using model frameworks and open-source tools.
RSVP Here: https://www.eventbrite.com/e/full-day-workshop-kubeflow-kerastensorflow-20-tf-extended-tfx-kubernetes-pytorch-xgboost-airflow-tickets-63362929227
https://youtu.be/T0L0JxDaPkc
Near real-time anomaly detection at Lyftmarkgrover
Near real-time anomaly detection at Lyft, by Mark Grover and Thomas Weise at Strata NY 2018.
https://conferences.oreilly.com/strata/strata-ny/public/schedule/detail/69155
Data Analytics and Simulation in Parallel with MATLAB*Intel® Software
This talk covers the current parallel capabilities in MATLAB*. Learn about its parallel language and distributed and tall arrays. Interact with GPUs both on the desktop and in the cluster. Combine this information into an interesting algorithmic framework for data analysis and simulation.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
Connector Corner: Automate dynamic content and events by pushing a buttonDianaGray10
Here is something new! In our next Connector Corner webinar, we will demonstrate how you can use a single workflow to:
Create a campaign using Mailchimp with merge tags/fields
Send an interactive Slack channel message (using buttons)
Have the message received by managers and peers along with a test email for review
But there’s more:
In a second workflow supporting the same use case, you’ll see:
Your campaign sent to target colleagues for approval
If the “Approve” button is clicked, a Jira/Zendesk ticket is created for the marketing design team
But—if the “Reject” button is pushed, colleagues will be alerted via Slack message
Join us to learn more about this new, human-in-the-loop capability, brought to you by Integration Service connectors.
And...
Speakers:
Akshay Agnihotri, Product Manager
Charlie Greenberg, Host
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
1. Apache Flink
Deep Dive
Vasia Kalavri
Flink Committer & KTH PhD student
vasia@apache.org
1st Apache Flink Meetup Stockholm
May 11, 2015
2. Flink Internals
● Job Life-Cycle
○ what happens after you submit a Flink job?
● The Batch Optimizer
○ how are execution plans chosen?
● Delta Iterations
○ how are Flink iterations special for Graph and ML
apps?
2
4. The Flink Stack
Python
Gelly
Table
FlinkML
SAMOA
Batch Optimizer
DataSet (Java/Scala) DataStream (Java/Scala)Hadoop
M/R
Flink Runtime
Local Remote Yarn Tez Embedded
Dataflow
*current Flink master + few PRs
Streaming Optimizer
4
5. DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2(token, 1));
})
.groupBy(0).aggregate(SUM, 1);
1
3
2
Program Life-Cycle
4
5
6. Task
Manager
Job
Manager
Task
Manager
Flink Client &
Optimizer
DataSet<String> text = env.readTextFile(input);
DataSet<Tuple2<String, Integer>> result = text
.flatMap((str, out) -> {
for (String token : value.split("W")) {
out.collect(new Tuple2(token, 1));
})
.groupBy(0).aggregate(SUM, 1);
O Romeo,
Romeo,
wherefore art
thou Romeo?
O, 1
Romeo, 3
wherefore, 1
art, 1
thou, 1
6
Nor arm, nor
face, nor any
other part
nor, 3
arm, 1
face, 1,
any, 1,
other, 1
part, 1
creates and submits
the job graph
creates the execution
graph and deploys tasks
execute tasks and send
status updates
7. Input First SecondX Y
Operator X Operator Y
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<String> input = env.readTextFile(input);
DataSet<String> first = input.filter (str -> str.contains(“Apache Flink“));
DataSet<String> second = first.filter (str -> str.length() > 40);
second.print()
env.execute();
Series of Transformations
7
8. DataSet Abstraction
Think of it as a collection of data elements that can be
produced/recovered in several ways:
… like a Java collection
… like an RDD
… perhaps it is never fully materialized (because the program does not
need it to)
… implicitly updated in an iteration
→ this is transparent to the user
8
10. Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
Stage 1:
Create/cache Log
Subsequent stages:
Grep log for matches
Caching in-memory
and disk if needed
Staged (batch) execution
10
11. Romeo,
Romeo,
where art
thou Romeo?
Load Log
Search
for str1
Search
for str2
Search
for str3
Grep 1
Grep 2
Grep 3
001100110011001100110011
Stage 1:
Deploy and start operators
Data transfer in-memory
and disk if needed
Note: Log
DataSet is
never
“created”!
Pipelined execution
11
18. ● Evaluates physical execution strategies
○ e.g. hash-join vs. sort-merge join
● Chooses data shipping strategies
○ e.g. broadcast vs. partition
● Reuses partitioning and sort orders
● Decides to cache loop-invariant data in
iterations
Optimization Examples
18
19. case class PageVisit(url: String, ip: String, userId: Long)
case class User(id: Long, name: String, email: String, country: String)
// get your data from somewhere
val visits: DataSet[PageVisit] = ...
val users: DataSet[User] = ...
// filter the users data set
val germanUsers = users.filter((u) => u.country.equals("de"))
// join data sets
val germanVisits: DataSet[(PageVisit, User)] =
// equi-join condition (PageVisit.userId = User.id)
visits.join(germanUsers).where("userId").equalTo("id")
Example: Distributed Joins
The join operator needs to
create all the pairs of
elements from the two
inputs, for which the join
condition evaluates to true
19
20. Example: Distributed Joins
● Ship Strategy: The input data is distributed across all
parallel instances that participate in the join
● Local Strategy: Each parallel instance performs a join
algorithm on its local partition
For both steps, there are multiple valid strategies which are
favorable in different situations.
20
21. Repartition-Repartition Strategy
Partitions both inputs
using the same
partitioning function.
All elements that share
the same join key are
shipped to the same
parallel instance and can
be locally joined.
21
22. Broadcast-Forward Strategy
Sends one complete data
set to each parallel
instance that holds a
partition of the other data.
The other Dataset
remains local and is not
shipped at all.
22
23. The optimizer will compute cost estimates for execution
plans and will pick the “cheapest” plan:
● amount of data shipped over the the network
● if the data of one input is already partitioned
R-R Cost: Full shuffle of both data sets over the network
B-F Cost: Depends on the size of the dataset that is
broadcasted and the number of parallel instances
Read more: http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html
How does the Optimizer choose?
23
25. ● for/while loop in client submits one job per
iteration step
● Data reuse by caching in memory and/or disk
Step Step Step Step Step
Client
Iterate by unrolling
25
26. Native Iterations
● the runtime is aware of the iterative execution
● no scheduling overhead between iterations
● caching and state maintenance are handled automatically
Caching Loop-invariant DataPushing work
“out of the loop”
Maintain state as index
26
27. Flink Iteration Operators
Iterate IterateDelta
Input
Iterative
Update Function
Result
Replace
Workset
Iterative
Update Function
Result
Solution Set
State
27
28. Delta Iteration
● Not all the elements of the state are updated
in each iteration.
● The elements that require an update, are
stored in the workset.
● The step function is applied only to the
workset elements.
28
29. Partition a graph into components by iteratively
propagating the min vertex ID among neighbors
Example: Connected Components
29
33. Read the documentation and our blog posts!
● Memory Management
● Serialization and Type Extraction
● Streaming Optimizations
● Fault-Tolerance
Want to learn more?
33
34. Apache Flink
Deep Dive
Vasia Kalavri
Flink Committer & KTH PhD student
vasia@apache.org
1st Apache Flink Meetup Stockholm
May 11, 2015