Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

•

16 likes•5,735 views

This talk shows how we can use Apache Flink and Apache Zeppelin to do interactive data analysis. The examples show the usage of FlinkML to solve a linear regression and classification problem.

Technology

Till Rohrmann
Flink PMC member
trohrmann@apache.org
@stsffap
Interactive Data Analysis
with Apache Flink

Exploratory Data Analysis
§  Visualize data
§  Calculate main
characteristics
§  Understand data and
ﬁnd possibly new
hypothesis
2

Read-Evaluate-Print Loop
§  New Scala shell offers REPL
§  Interactive queries
§  Let’s you explore data quickly
4

Problems
§  No visualization
§  No saving or replaying of written code
§  No assistance à Bad IDE
7

Notebooks
§  Web-based interactive
computation
environment
§  Combines rich text,
execution code, plots
and rich media
§  Storytelling
8

Apache Zeppelin
§  Web-based REPL with pluggable
interpreters
§  Since 2014 in the Apache Incubator
§  Supported interpreters:
•  Flink
•  Spark
•  Python
•  Markdown
•  Many more …
9

Word Count with Zeppelin
§  Find the 10 most frequent words with
more than 4 letters in the King James
version of the bible.
10

Linear regression
§  Let’s predict the inﬂuence of advertisement
spending on sales
§  Input data set:
http://www-bcf.usc.edu/~gareth/ISL/
Advertising.csv
§  Features:
•  TV advertisement money
•  Radio advertisement money
•  Newspaper advertisement money
§  Response:
•  Sales
15

Classiﬁcation
§  Let’s build a classiﬁer for insult detection
§  Kaggle challenge
https://www.kaggle.com/c/detecting-
insults-in-social-commentary
§  Label: 1 – Insult, 0 – No insult
§  Feature: Comment text
25

Conclusion
§  Interactive data analysis is really easy with
Apache Flink
§  Apache Zeppelin is great interactive
notebook
§  Zeppelin and Flink play well together to
solve machine learning tasks and more
28

Data can be viewed as the exhaust of online activity. With the rise of cloud-based data platforms, barriers to data storage and transfer have crumbled. The demand for creative applications and learning from those datasets has accelerated. Rapid acceleration can quickly accrue disorder, and disorderly data design can turn the deepest data lake into an impenetrable swamp. In this talk, I will discuss the evolution of the data science workflow at Expedia with a special emphasis on Learning to Rank problems. From the heroic early days of ad-hoc Spark exploration to our first production sort model on the cloud, we will explore the process of industrializing the workflow. Layered over our story, I will share some best practices and suggestions on how to keep your data productive, or even pull your organization out of the data swamp.

QCon London - Stream Processing with Apache Flink

Robert Metzger

January 2016 Flink Community Update & Roadmap 2016

Robert Metzger

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Jamie Grier

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Flink Forward

January 2011 HUG: Kafka PresentationYahoo Developer Network

Community Update May 2016 (January - May) | Berlin Apache Flink Meetup

Robert Metzger

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Jim Dowling

Pinot: Near Realtime Analytics @ Uber

Xiang Fu

Bullet: A Real Time Data Query Engine

DataWorks Summit

Bullet is an open sourced, lightweight, pluggable querying system for streaming data without a persistence layer implemented on top of Storm. It allows you to filter, project, and aggregate on data in transit. It includes a UI and WS. Instead of running queries on a finite set of data that arrived and was persisted or running a static query defined at the startup of the stream, our queries can be executed against an arbitrary set of data arriving after the query is submitted. In other words, it is a look-forward system. Bullet is a multi-tenant system that scales independently of the data consumed and the number of simultaneous queries. Bullet is pluggable into any streaming data source. It can be configured to read from systems such as Storm, Kafka, Spark, Flume, etc. Bullet leverages Sketches to perform its aggregate operations such as distinct, count distinct, sum, count, min, max, and average. An instance of Bullet is currently running at Yahoo against its user engagement data pipeline. We’ll highlight how it is powering internal use-cases such as web page and native app instrumentation validation. Finally, we’ll show a demo of Bullet and go over query performance numbers.

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Databricks

Data integration is a really difficult problem. We know this because 80% of the time in every project is spent getting the data you want the way you want it. We know this because this problem remains challenging despite 40 years of attempts to solve it. All we want is a service that will be reliable, handle all kinds of data and integrate with all kinds of systems, be easy to manage and scale as our systems grow. Oh, and it should be super low latency too. Is it too much to ask? In this presentation, we’ll discuss the basic challenges of data integration and introduce few design and architecture patterns that are used to tackle these challenges. We will then explore how these patterns can be implemented using Apache Kafka. Difficult problems are difficult and we offer no silver bullets, but we will share pragmatic solutions that helped many organizations build fast, scalable and manageable data pipelines.

Apache Flink community Update for March 2016 - Slim Baltagi

Slim Baltagi

Presto at Twitter

Bill Graham

Building Data Pipelines in Python

C4Media

Video and slides synchronized, mp3 and slide download available at URL http://bit.ly/2nwSwEh. Marco Bonzanini discusses the process of building data pipelines, e.g. extraction, cleaning, integration, pre-processing of data; in general, all the steps necessary to prepare data for a data-driven product. In particular, he focuses on data plumbing and on the practice of going from prototype to production. Filmed at qconlondon.com. Marco Bonzanini is Data Scientist and co-organizer of PyData London Meetup.

Apache Flink: Streaming Done Right @ FOSDEM 2016

Till Rohrmann

SSR: Structured Streaming for R and Machine Learning

felixcss

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2

K. Tzoumas & S. Ewen – Flink Forward Keynote

Flink Forward

Big Data visualization with Apache Spark and Zeppelin

prajods

Taking a look under the hood of Apache Flink's relational APIs.

Fabian Hueske

Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.

What's hot

Apache Spark vs Apache Flink

AKASH SIHAG

Cascalog at May Bay Area Hadoop User Group

nathanmarz

Fabian Hueske – Cascading on Flink

Flink Forward

Rental Cars and Industrialized Learning to Rank with Sean Downes

Databricks

QCon London - Stream Processing with Apache Flink

Robert Metzger

January 2016 Flink Community Update & Roadmap 2016

Robert Metzger

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Jamie Grier

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

Flink Forward

January 2011 HUG: Kafka PresentationYahoo Developer Network

Community Update May 2016 (January - May) | Berlin Apache Flink Meetup

Robert Metzger

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Jim Dowling

Pinot: Near Realtime Analytics @ Uber

Xiang Fu

Bullet: A Real Time Data Query Engine

DataWorks Summit

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Databricks

Apache Flink community Update for March 2016 - Slim Baltagi

Slim Baltagi

Presto at Twitter

Bill Graham

Building Data Pipelines in Python

C4Media

Apache Flink: Streaming Done Right @ FOSDEM 2016

Till Rohrmann

SSR: Structured Streaming for R and Machine Learning

felixcss

K. Tzoumas & S. Ewen – Flink Forward Keynote

Flink Forward

What's hot (20)

Apache Spark vs Apache Flink

Cascalog at May Bay Area Hadoop User Group

Fabian Hueske – Cascading on Flink

Rental Cars and Industrialized Learning to Rank with Sean Downes

QCon London - Stream Processing with Apache Flink

January 2016 Flink Community Update & Roadmap 2016

Extending the Yahoo Streaming Benchmark + MapR Benchmarks

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka

January 2011 HUG: Kafka Presentation

Community Update May 2016 (January - May) | Berlin Apache Flink Meetup

Hopsworks - Self-Service Spark/Flink/Kafka/Hadoop

Pinot: Near Realtime Analytics @ Uber

Bullet: A Real Time Data Query Engine

Stream All Things—Patterns of Modern Data Integration with Gwen Shapira

Apache Flink community Update for March 2016 - Slim Baltagi

Presto at Twitter

Building Data Pipelines in Python

Apache Flink: Streaming Done Right @ FOSDEM 2016

SSR: Structured Streaming for R and Machine Learning

K. Tzoumas & S. Ewen – Flink Forward Keynote

Viewers also liked

Big Data visualization with Apache Spark and Zeppelin

prajods

Taking a look under the hood of Apache Flink's relational APIs.

Fabian Hueske

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...

Till Rohrmann

Modern stream processing engines not only have to process millions of events per second at sub-second latency but also have to cope with constantly changing workloads. Due to the dynamic nature of stream applications where the number of incoming events can strongly vary with time, systems cannot reliably predetermine the amount of required resources. In order to meet guaranteed SLAs as well as utilizing system resources as efficiently as possible, frameworks like Apache Flink have to adapt their resource consumption dynamically. In this talk, we will take a look under the hood and explain how Flink scales stateful application in and out. Starting with the concept of key groups and partionable state, we will cover ways to detect bottlenecks in streaming jobs and discuss efficient strategies how to scale out operators with minimal down-time.

Step-by-Step Introduction to Apache Flink

Slim Baltagi

This a talk that I gave at the 2nd Apache Flink meetup in Washington DC Area hosted and sponsored by Capital One on November 19, 2015. You will quickly learn in step-by-step way: How to setup and configure your Apache Flink environment? How to use Apache Flink tools? 3. How to run the examples in the Apache Flink bundle? 4. How to set up your IDE (IntelliJ IDEA or Eclipse) for Apache Flink? 5. How to write your Apache Flink program in an IDE?

Flink Gelly - Karlsruhe - June 2015Andra Lungu

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

Till Rohrmann

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Till Rohrmann

The talk explains how Apache Flink checkpoints stateful jobs using the asynchronous barrier snapshotting algorithm to give exactly once semantics in streaming. Furthermore, Flink's approach to master high availability (HA) is described which solves the problem of the JobManager being the single point of failure. Job checkpointing in combination with HA is the basis for Flink's fault tolerance mechanism to recover from occurring failures.

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Till Rohrmann

Flink Streaming @BudapestData

Gyula Fóra

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

Vasia Kalavri

Introduction to Apache Flink - Fast and reliable big data processing

Till Rohrmann

This presentation introduces Apache Flink, a massively parallel data processing engine which currently undergoes the incubation process at the Apache Software Foundation. Flink's programming primitives are presented and it is shown how easily a distributed PageRank algorithm can be implemented with Flink. Intriguing features such as dedicated memory management, Hadoop compatibility, streaming and automatic optimisation make it an unique system in the world of Big Data processing.

Flink Streaming Berlin Meetup

Márton Balassi

Streaming Analytics & CEP - Two sides of the same coin?

Till Rohrmann

The Stream Processor as the Database - Apache Flink @ Berlin buzzwords

Stephan Ewen

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Till Rohrmann

Stephan Ewen - Running Flink Everywhere

Flink Forward

http://flink-forward.org/kb_sessions/running-apache-flink-everywhere-standalone-yarn-mesos-docker-kubernetes-etc/ The world of cluster managers and deployment frameworks is getting complicated. There is zoo of tools to deploy and manage data processing jobs, all of which have different resource management and fault tolerance slightly different. Some tools have a only per-job processes (Yarn, Docker/Kubernetes), while others require some long running processes (Mesos, Standalone). In some frameworks, streaming jobs control their own resource allocation (Yarn, Mesos), while for other frameworks, resource management is handled by external tools (Kubernetes). To be broadly usable in a variety of setups, Flink needs to play well with all these frameworks and their paradigms. This talk describes Flink’s new proposed process and deployment model that will make it work together well with the above mentioned frameworks. The new abstraction is designed to cover a variety of use cases, like isolated single job deployments, sessions of multiple short jobs, and multi-tenant setups.

Unified Stream and Batch Processing with Apache Flink

DataWorks Summit/Hadoop Summit

Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...

Flink Forward

http://flink-forward.org/kb_sessions/connecting-apache-flink-with-the-world-reviewing-the-streaming-connectors/ Getting data in and out of Flink in a reliable fashion is one of the most important tasks of a stream processor. This talk will review the most important and frequently used connectors in Flink. Apache Kafka and Amazon Kinesis Streams both fall into the same category of distributed, high-throughput and durable publish-subscribe messaging systems. The talk will explain how the connectors in Flink for these systems are implemented. In particular we’ll focus on how we ensure exactly-once semantics while consuming data and how offsets/sequence numbers are handled. We will also review two generic tools in Flink for connectors: A message acknowledging source for classical message queues (like those implementing AMQP) and a generic write ahead log sink, using Flink’s state backend abstraction. The objective of the talk is to explain the internals of the streaming connectors, so that people can understand their behavior, configure them properly and implement their own connectors.

Apache Flink at Strata San Jose 2016

Kostas Tzoumas

Performance Comparison of Streaming Big Data Platforms

DataWorks Summit/Hadoop Summit

Viewers also liked (20)

Big Data visualization with Apache Spark and Zeppelin

Taking a look under the hood of Apache Flink's relational APIs.

Dynamic Scaling: How Apache Flink Adapts to Changing Workloads (at FlinkForwa...

Step-by-Step Introduction to Apache Flink

Flink Gelly - Karlsruhe - June 2015

Computing recommendations at extreme scale with Apache Flink @Buzzwords 2015

Fault Tolerance and Job Recovery in Apache Flink @ FlinkForward 2015

Streaming Data Flow with Apache Flink @ Paris Flink Meetup 2015

Flink Streaming @BudapestData

Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink

Introduction to Apache Flink - Fast and reliable big data processing

Flink Streaming Berlin Meetup

Streaming Analytics & CEP - Two sides of the same coin?

The Stream Processor as the Database - Apache Flink @ Berlin buzzwords

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Stephan Ewen - Running Flink Everywhere

Unified Stream and Batch Processing with Apache Flink

Robert Metzger - Connecting Apache Flink to the World - Reviewing the streami...

Apache Flink at Strata San Jose 2016

Performance Comparison of Streaming Big Data Platforms

Similar to Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Pyspark vs Spark Let's Unravel the Bond!

ankitbhandari32

GraphQL Europe Recap

Philipp Sporrer

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Alex Zeltov

This workshop will provide an introduction to Big Data Analytics using Apache Spark and Apache Zeppelin. https://github.com/zeltovhorton/intro_spark_zeppelin_meetup There will be a short lecture that includes an introduction to Spark, the Spark components. Spark is a unified framework for big data analytics. Spark provides one integrated API for use by developers, data scientists, and analysts to perform diverse tasks that would have previously required separate processing engines such as batch analytics, stream processing and statistical modeling. Spark supports a wide range of popular languages including Python, R, Scala, SQL, and Java. Spark can read from diverse data sources and scale to thousands of nodes. The lecture will be followed by demo . There will be a short lecture on Hadoop and how Spark and Hadoop interact and compliment each other. You will learn how to move data into HDFS using Spark APIs, create Hive table, explore the data with Spark and SQL, transform the data and then issue some SQL queries. We will be using Scala and/or PySpark for labs.

Ruby on Rails (RoR) as a back-end processor for Apex

Espen Brækken

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

Carolyn Duby

ODSC East 2017 - How to use Zeppelin and Spark to document your research. Reproducible research documents not just the findings of a study but the exact code required to produce those findings. Reproducible research is a requirement for study authors to reliably repeat their analysis or accelerate new findings by applying the same techniques to new data. The increased transparency allows peers to quickly understand and compare the methods of the study to other studies and can lead to higher levels of trust, interest and eventually more citations of your work. Big data introduces some new challenges for reproducible research. As our data universe expands and the open data movement grows, more data is available than ever to analyze, and the possible combinations are infinite. Data cleaning and feature extraction often involve lengthy sequences of transformations. The space allotted for publications is not adequate to effectively describe all the details, so they can be reviewed and reproduced by others. Fortunately, the open source community is addressing this need with Apache Spark, Zeppelin and Hadoop. Apache Spark 2.0 makes it even simpler and faster to harness the power of a Hadoop computing cluster to clean, analyze, explore and train machine learning models on large data sets. Zeppelin web-based notebooks capture and share code and interactive visualizations with others. After this session you will be able to create a reproducible data science pipeline over large data sets using Spark, Zeppelin, and a Hadoop distributed computing cluster. Learn how to combine Spark with other supported interpreters to codify your results from cleaning to exploration to feature extraction and machine learning. Discover how to share your notebooks and data with others using the cloud. This talk will cover Spark and show examples, but it is not intended to be a complete tutorial on Spark.

Enabling exploratory data science with Spark and R

Databricks

R is a favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large datasets with R is challenging, especially when data scientists use R with frameworks or tools written in other languages. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show how SparkR solves these problems to enable a much smoother experience. In this talk we will present an overview of the SparkR architecture, including how data and control is transferred between R and JVM. This knowledge will help data scientists make better decisions when using SparkR. We will demo and explain some of the existing and supported use cases with real large datasets inside a notebook environment. The demonstration will emphasize how Spark clusters, R and interactive notebook environments, such as Jupyter or Databricks, facilitate exploratory analysis of large data.

Strata NYC 2015 - Supercharging R with Apache Spark

Databricks

R is the favorite language of many data scientists. In addition to a language and runtime, R is a rich ecosystem of libraries for a wide range of use cases from statistical inference to data visualization. However, handling large or distributed data with R is challenging. Hence R is used along with other frameworks and languages by most data scientist. In this mode most of the friction is at the interface of R and the other systems. For example, when data is sampled by a big data platform, results need to be transferred to and imported in R as native data structures. In this talk we show an alternative, and complimentary, approach to SparkR for integrating Spark and R. Since SparkR was released in version 1.4 of Apache Spark distributed data remains inside the JVM instead of individual R processes running on workers. This approach is more convenient when dealing with external data sources such as Cassandra, Hive, and Spark’s own distributed DataFrames. We show two specific techniques to remove the data transfer friction between R and JVM: collecting Spark DataFrames as R data frames and user space filesystems. We think this model complements and improves the day-to-day workload of many data scientists who use R. Spark’s interactive query processing, especially with in-memory datasets, closely matches the R interactive session model. When integrated together Spark and R can provide state of the art tools for the entire end-to-end data science pipeline. We will show how such a pipeline works in real world use cases in a live demo at the end of the talk.

Why scala for data science

Guglielmo Iozzia

Unified Batch and Real-Time Stream Processing Using Apache Flink

Slim Baltagi

This talk was given at Capital One on September 15, 2015 at the launch of the Washington DC Area Apache Flink Meetup. Apache flink is positioned at the forefront of 2 major trends in Big Data Analytics: - Unification of Batch and Stream processing - Multi-purpose Big Data Analytics frameworks In these slides, we will also find answers to the burning question: Why Apache Flink? You will also learn more about how Apache Flink compares to Hadoop MapReduce, Apache Spark and Apache Storm.

Intro to Spark with Zeppelin

Hortonworks

Integrating Apache Spark and NiFi for Data Lakes

DataWorks Summit/Hadoop Summit

Learn Apache Spark: A Comprehensive Guide

Whizlabs

Apache Flink - A Stream Processing Engine

Aljoscha Krettek

20160512 apache-spark-for-everyone

Amanda Casari

Data science lifecycle with Apache Zeppelin

DataWorks Summit/Hadoop Summit

Apache BookKeeper: A High Performance and Low Latency Storage Service

Sijie Guo

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...

Flink Forward

As a low-latency streaming tool, Flink offers the possibility of using machine learning, even "deep learning" (neural networks), with low latency. The growing FlinkML library provides some of the infrastructure support required for this goal, combined with third-party tools. This talk is a progress report on several scenarios we are developing at Lightbend, which combine Flink, Deeplearning4J, Spark, and Kafka to analyze cluster telemetry for anomaly detection, predictive autoscaling, and other scenarios. I'll focus on the pragmatics of training deep learning models in a streaming context, using batch and mini-batch training, combined with low-latency application of those models. I'll discuss the architecture we're using and highlight trade offs of particular tools for certain design problems in the implementation. I'll discuss the drawbacks and workarounds of our design and finish with a look at how future developments in Flink could improve its support for scenarios like ours.

Taking Splunk to the Next Level - Technical

Splunk

Taking Splunk to the Next Level – Architecture

Splunk

Are you outgrowing your initial Splunk deployment? Is Splunk becoming mission critical and you need to make sure it's Enterprise ready? Attend this session led by Splunk experts to learn about taking your Splunk deployment to the next level. Learn about Splunk high availability architectures with Splunk Search Head Clustering and Index Replication. Additionally, learn how to manage your deployment with Splunk’s operational and management controls to manage Splunk capacity and end user experience.

sparklyr - Jeff Allen

Sri Ambati

Similar to Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin (20)

Pyspark vs Spark Let's Unravel the Bond!

GraphQL Europe Recap

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

Ruby on Rails (RoR) as a back-end processor for Apex

ODSC East 2017 - Reproducible Research at Scale with Apache Zeppelin and Spark

Enabling exploratory data science with Spark and R

Strata NYC 2015 - Supercharging R with Apache Spark

Why scala for data science

Unified Batch and Real-Time Stream Processing Using Apache Flink

Intro to Spark with Zeppelin

Integrating Apache Spark and NiFi for Data Lakes

Learn Apache Spark: A Comprehensive Guide

Apache Flink - A Stream Processing Engine

20160512 apache-spark-for-everyone

Data science lifecycle with Apache Zeppelin

Apache BookKeeper: A High Performance and Low Latency Storage Service

Flink Forward SF 2017: Dean Wampler - Streaming Deep Learning Scenarios with...

Taking Splunk to the Next Level - Technical

Taking Splunk to the Next Level – Architecture

sparklyr - Jeff Allen

More from Till Rohrmann

Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...

Till Rohrmann

Container technology experiences an ever increasing adoption throughout many industries. Not only does this technology make your applications portable across different machines and operating systems, it also allows to scale applications in a matter of seconds. Moreover, it significantly simplifies and speeds up deployments which decreases development and operation costs. Consequently, more and more Flink deployments run in containerized environments which poses new challenges for Flink. In this talk, we will take a look at Flink's current and future container support which will make it a first class citizen of the container world. First of all, we will explain how the new reactive execution mode will solve the problem of seamless application scaling and how it blends in with any environment. Complementary to the reactive mode, the active execution mode demonstrates its strengths when it comes to changing workloads such as batch jobs. Last but not least, we will take a look beyond Flink's own nose and investigate how Flink can be used together with Kubernetes operators or data Artisans' Application Manager. We will conclude the talk with a short demo of Flink's native Kubernetes support and giving an outlook on future developments in the container realm.

Apache flink 1.7 and Beyond

Till Rohrmann

The streaming space is evolving at an ever increasing pace. This trend is also reflected in Apache Flink whose latest major release included again many new features. For streaming practitioners it is essential to learn about Flink's newest capabilities because often they enable completely new use cases and applications. In this talk, I want to give a brief overview about Apache Flink and its latest feature additions, including the integration of CEP with streaming SQL, proper support for state evolution, temporal joins and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.

Elastic Streams at Scale @ Flink Forward 2018 Berlin

Till Rohrmann

One of the big operational challenges when running streaming applications is to cope with varying workloads. Variations, e.g. daily cycles, seasonal spikes or sudden events, require that allocated resources are constantly adapted. Otherwise, service quality deteriorates or money is wasted. Apache Flink 1.5 includes a lot of enhancements to support full resource elasticity on cluster management frameworks such as Apache Mesos. With the latest version, it is now possible to build elastic applications which can be programmatically scaled up or down in order to react to changing workloads. In this talk, we will discuss recent improvements to Flink's deployment model which also enables full resource elasticity. In particular, we will discuss how Flink leverages cluster management frameworks, e.g. Mesos, and already-introduced features like scalable state to support elastic streaming applications. We will conclude the presentation with a short demo showing how a stateful Flink application can be rescaled on top of Mesos.

Scaling stream data pipelines with Pravega and Apache Flink

Till Rohrmann

Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.

Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

Till Rohrmann

In our fast moving world it becomes more and more important for companies to gain near real-time insights from their data to make faster decisions. These insights do not only provide a competitve edge over ones rivals but also enable a company to create completely new services and products. Amongst others, predictive user interfaces and online recommendation can be implemented when being able to process large amounts of data in real-time. Apache Flink, one of the most advanced open source distributed stream processing platforms, allows you to extract business intelligence from your data in near real-time. With Apache Flink it is possible to process billions of messages with milliseconds latency. Moreover, its expressive APIs allow you to quickly solve your problems, ranging from classical analytical workloads to distributed event-driven applications. In this talk, I will introduce Apache Flink and explain how it enables users to develop distributed applications and process analytical workloads alike. Starting with Flink’s basic concepts of fault-tolerance, statefulness and event-time aware processing, we will take a look at the different APIs and what they allow us to do. The talk will be concluded by demonstrating how we can use Flink’s higher level abstractions such as FlinkCEP and StreamSQL to do declarative stream processing.

Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin

Till Rohrmann

Apache Mesos allows operators to run distributed applications across an entire datacenter and is attracting ever increasing interest. As much as distributed applications see increased use enabled by Mesos, Mesos also sees increasing use due to a growing ecosystem of well integrated applications. One of the latest additions to the Mesos family is Apache Flink. Flink is one of the most popular open source systems for real-time high scale data processing and allows users to deal with low-latency streaming analytical workloads on Mesos. In this talk we explain the challenges solved while integrating Flink with Mesos, including how Flink’s distributed architecture can be modeled as a Mesos framework, and how Flink was integrated with Fenzo. Next, we describe how Flink was packaged to easily run on DC/OS.

Apache Flink® Meets Apache Mesos® and DC/OS

Till Rohrmann

From Apache Flink® 1.3 to 1.4

Till Rohrmann

With Flink 1.3 being released, the Flink community is already working towards the upcoming release 1.4. Given Flink's high development pace, which manifested in Flink 1.3 being one of the feature-wise biggest releases in its recent history, it becomes more and more difficult to keep track of all development threads. Moreover, it requires more effort to learn about newly added features and which value they provide for your application. In this talk, I want to present and explain some of Flink's latest features, including incremental checkpointing, fine grained recovery, side outputs and many more. Furthermore, I want to put them in perspective with respect to Flink's future direction by giving some insights into ongoing development threads in the community. Thereby, I intend to give attendees a better picture about Flink's current and future capabilities.

Apache Flink and More @ MesosCon Asia 2017

Till Rohrmann

Apache Mesos allows operators to run distributed applications across an entire datacenter and is attracting ever increasing interest. As much as distributed applications see increased use enabled by Mesos, Mesos also sees increasing use due to a growing ecosystem of well-integrated applications. One of the latest additions to the Mesos family is Apache Flink. Flink is one of the most popular open source systems for real-time high scale data processing and allows users to deal with low-latency streaming analytical workloads on Mesos. In this talk, we explain the challenges solved while integrating Flink with Mesos, including how Flink’s distributed architecture can be modeled as a Mesos framework, and how Flink was integrated with Fenzo. Next, we describe how Flink was packaged to easily run on DC/OS.

Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017

Till Rohrmann

As stream processing engines become more and more popular and are used in different environments, the demand to support different deployment scenarios increases. Depending on the user's infrastructure, a stream processor might be run on a bare metal cluster in standalone mode, deployed via Apache Yarn and Mesos, or run in a containerized environment. In order to fulfill the requirements of different deployment options and to provide enough flexibility for the future, the Flink community has recently started to redesign Flink's distributed architecture. This talk will explain the limitations of the old architecture and how they are solved with the new design. We will present the new building blocks of a Flink cluster and demonstrate, using the example of Flink's Mesos and Docker support, how they can be combined to run Flink nearly everywhere.

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...

Till Rohrmann

In recent years, the generated and collected data is increasing at an almost exponential rate. At the same time, the data’s value has been identified in terms of insights that can be provided. However, retrieving the value requires powerful analysis tools, since valuable insights are buried deep in large amounts of noise. Unfortunately, analytic capacities did not scale well with the growing data. Many existing tools run only on a single computer and are limited in terms of data size by its memory. A very promising solution to deal with large-scale data is scaling systems and exploiting parallelism. In this presentation, we propose Gilbert, a distributed sparse linear algebra system, to decrease the imminent lack of analytic capacities. Gilbert offers a MATLAB-like programming language for linear algebra programs, which are automatically executed in parallel. Transparent parallelization is achieved by compiling the linear algebra operations first into an intermediate representation. This language-independent form enables high-level algebraic optimizations. Different optimization strategies are evaluated and the best one is chosen by a cost-based optimizer. The optimized result is then transformed into a suitable format for parallel execution. Gilbert generates execution plans for Apache Spark and Apache Flink, two massively parallel dataflow systems. Distributed matrices are represented by square blocks to guarantee a well-balanced trade-off between data parallelism and data granularity. An exhaustive evaluation indicates that Gilbert is able to process varying amounts of data exceeding the memory of a single computer on clusters of different sizes. Two well known machine learning (ML) algorithms, namely PageRank and Gaussian non-negative matrix factorization (GNMF), are implemented with Gilbert. The performance of these algorithms is compared to optimized implementations based on Spark and Flink. Even though Gilbert is not as fast as the optimized algorithms, it simplifies the development process significantly due to its high-level programming abstraction.

More from Till Rohrmann (11)

Future of Apache Flink Deployments: Containers, Kubernetes and More - Flink F...

Apache flink 1.7 and Beyond

Elastic Streams at Scale @ Flink Forward 2018 Berlin

Scaling stream data pipelines with Pravega and Apache Flink

Modern Stream Processing With Apache Flink @ GOTO Berlin 2017

Apache Flink Meets Apache Mesos And DC/OS @ Mesos Meetup Berlin

Apache Flink® Meets Apache Mesos® and DC/OS

From Apache Flink® 1.3 to 1.4

Apache Flink and More @ MesosCon Asia 2017

Redesigning Apache Flink's Distributed Architecture @ Flink Forward 2017

Gilbert: Declarative Sparse Linear Algebra on Massively Parallel Dataflow Sys...

Recently uploaded

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Nexer Digital

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Sri Ambati

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

91mobiles

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

FIDO Alliance

DevOps and Testing slides at DASA Connect

Kari Kakkonen

Assure Contact Center Experiences for Your Customers With ThousandEyes

ThousandEyes

The Future of Platform Engineering

Jemma Hussein Allen

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

BookNet Canada

The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more. Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/ Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

Quantum Computing: Current Landscape and the Future Role of APIs

Vlad Stirbu

By Design, not by Accident - Agile Venture Bolzano 2024

Pierluigi Pugliese

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

FIDO Alliance

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Ramesh Iyer

In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

nkrafacyberclub

Key Trends Shaping the Future of Infrastructure.pdf

Cheryl Hung

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

FIDO Alliance

Bits & Pixels using AI for Good.........

Alison B. Lowndes

Recently uploaded (20)

Elizabeth Buie - Older adults: Are we really designing for our future selves?

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...

Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf

Generative AI Deep Dive: Advancing from Proof of Concept to Production

FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf

DevOps and Testing slides at DASA Connect

Assure Contact Center Experiences for Your Customers With ThousandEyes

The Future of Platform Engineering

Transcript: Selling digital books in 2024: Insights from industry leaders - T...

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Introduction to CHERI technology - Cybersecurity

Quantum Computing: Current Landscape and the Future Role of APIs

By Design, not by Accident - Agile Venture Bolzano 2024

FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

Key Trends Shaping the Future of Infrastructure.pdf

FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf

Bits & Pixels using AI for Good.........

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

1. Till Rohrmann Flink PMC member trohrmann@apache.org @stsffap Interactive Data Analysis with Apache Flink

2. Data Analysis 1

3. Exploratory Data Analysis §  Visualize data §  Calculate main characteristics §  Understand data and ﬁnd possibly new hypothesis 2

4. Data Analysts 3

5. Read-Evaluate-Print Loop §  New Scala shell offers REPL §  Interactive queries §  Let’s you explore data quickly 4

6. Scala Shell 5

7. Simple Scala Shell Example 6

8. Problems §  No visualization §  No saving or replaying of written code §  No assistance à Bad IDE 7

9. Notebooks §  Web-based interactive computation environment §  Combines rich text, execution code, plots and rich media §  Storytelling 8

10. Apache Zeppelin §  Web-based REPL with pluggable interpreters §  Since 2014 in the Apache Incubator §  Supported interpreters: •  Flink •  Spark •  Python •  Markdown •  Many more … 9

11. Word Count with Zeppelin §  Find the 10 most frequent words with more than 4 letters in the King James version of the bible. 10

12. 11

13. 12

14. 13

15. 14

16. Linear regression §  Let’s predict the inﬂuence of advertisement spending on sales §  Input data set: http://www-bcf.usc.edu/~gareth/ISL/ Advertising.csv §  Features: •  TV advertisement money •  Radio advertisement money •  Newspaper advertisement money §  Response: •  Sales 15

17. 16

18. 17

19. 18

20. 19

21. 20

22. 21

23. 22

24. 23

25. 24

26. Classiﬁcation §  Let’s build a classiﬁer for insult detection §  Kaggle challenge https://www.kaggle.com/c/detecting- insults-in-social-commentary §  Label: 1 – Insult, 0 – No insult §  Feature: Comment text 25

27. 26

28. 27

29. Conclusion §  Interactive data analysis is really easy with Apache Flink §  Apache Zeppelin is great interactive notebook §  Zeppelin and Flink play well together to solve machine learning tasks and more 28

30. 29

31. ﬂink.apache.org @ApacheFlink

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin

Similar to Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin (20)

More from Till Rohrmann

More from Till Rohrmann (11)

Recently uploaded

Recently uploaded (20)

Interactive Data Analysis with Apache Flink @ Flink Meetup in Berlin