Data Intensive Applications with Apache Flink

Sam Putnam [Deep Learning]

http://flink-forward.org/kb_sessions/streaming-ml-with-flink/ As continuous big data processing is gaining popularity it naturally implies that there is a need to transition many of the distributed machine learning functionality to a streaming backend. The most common use case is to give streaming predictions based on the model learnt in batch, however in some cases it is beneficial to also update the model on the fly. It is not uncommon that streaming learners need different algorithms than their batch counterparts. The talk discusses the common use cases and the pitfalls of the streaming ML transition through the example of recommender systems. It also offer a dive into the implementation of a Scala library augmenting FlinkML with streaming predictors.

Machine Learning Pipelines

jeykottalam

This document discusses machine learning pipelines and introduces Evan Sparks' presentation on building image classification pipelines. It provides an overview of feature extraction techniques used in computer vision like normalization, patch extraction, convolution, rectification and pooling. These techniques are used to transform images into feature vectors that can be input to linear classifiers. The document encourages building simple, intermediate and advanced image classification pipelines using these techniques to qualitatively and quantitatively compare their effectiveness.

Machine learning model to production

Georg Heiler

This document discusses moving machine learning models from prototype to production. It outlines some common problems with the current workflow where moving to production often requires redevelopment from scratch. Some proposed solutions include using notebooks as APIs and developing analytics that are accessed via an API. It also discusses different data science platforms and architectures for building end-to-end machine learning systems, focusing on flexibility, security, testing and scalability for production environments. The document recommends a custom backend integrated with Spark via APIs as the best approach for the current project.

Deploying Enterprise Deep Learning Masterclass Preview - Enterprise Deep Lea...

This document summarizes Sam Putnam's presentation on deploying enterprise deep learning. It discusses how deep learning was used to analyze housing price data and predict prices. A neural network model was built using TensorFlow that improved on previous linear regression and gradient boosted decision tree models. The presentation provides an overview of deep learning concepts like neural networks, activation functions, and model architectures for different data types. It emphasizes real-world considerations for developing and deploying deep learning models in a production setting.

From Pipelines to Refineries: scaling big data applications with Tim Hunter

Big data tools are challenging to combine into a larger application: ironically, big data applications themselves do not tend to scale very well. These issues of integration and data management are only magnified by increasingly large volumes of data. Apache Spark provides strong building blocks for batch processes, streams and ad-hoc interactive analysis. However, users face challenges when putting together a single coherent pipeline that could involve hundreds of transformation steps, especially when confronted by the need of rapid iterations. This talk explores these issues through the lens of functional programming. It presents an experimental framework that provides full-pipeline guarantees by introducing more laziness to Apache Spark. This framework allows transformations to be seamlessly composed and alleviates common issues, thanks to whole program checks, auto-caching, and aggressive computation parallelization and reuse.

AutoML Toolkit – Deep Dive

The AutoML Toolkit provides tools to simplify machine learning tasks. It features techniques for feature engineering like feature interaction that combines features to gain additional predictive power. It also addresses class imbalance issues through techniques like K-Sampling, a distributed version of SMOTE oversampling that generates synthetic samples for the minority class. The toolkit uses genetic algorithms to automatically tune machine learning models for optimal performance. An upcoming roadmap includes additional tools for stacked ensembles, improved genetic search algorithms, statistical analysis of features, and visualizations.

Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...

KeystoneML is a software framework for building scalable machine learning pipelines. It provides tools for data loading, feature extraction, model training, and evaluation that work across multiple domains like computer vision, NLP, and speech. Pipelines built with KeystoneML can achieve state-of-the-art results on large datasets using modest computing resources. The framework is open source and available on GitHub.

Apache Spark has rapidly become a key tool for data scientists to explore, understand and transform massive datasets and to build and train advanced machine learning models. The question then becomes, how do I deploy these model to a production environment? How do I embed what I have learned into customer facing data applications? In this webinar, we will discuss best practices from Databricks on how our customers productionize machine learning models do a deep dive with actual customer case studies, show live tutorials of a few example architectures and code in Python, Scala, Java and SQL.

Automated Hyperparameter Tuning, Scaling and Tracking

Automated Machine Learning (AutoML) has received significant interest recently. We believe that the right automation would bring significant value and dramatically shorten time-to-value for data science teams. Databricks is automating the Data Science and Machine Learning process through a combination of product offerings, partnerships, and custom solutions. This talk will focus on how Databricks can help automate hyperparameter tuning. For both traditional Machine Learning and modern Deep Learning, tuning hyperparameters can dramatically increase model performance and improve training times. However, tuning can be a complex and expensive process. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, and Bayesian optimization). We will then discuss open source tools that implement each of these techniques, helping to automate the search over hyperparameters. Finally, we will discuss and demo improvements we built for these tools in Databricks, including integration with MLflow: Apache PySpark MLlib integration with MLflow for automatically tracking tuning Hyperopt integration with Apache Spark to distribute tuning and with MLflow for automatic tracking Recording and notebooks will be provided after the webinar so that you can practice at your own pace. Presenters Joseph Bradley, Software Engineer, Databricks Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013. Yifan Cao, Senior Product Manager, Databricks Yifan Cao is a Senior Product Manager at Databricks. His product area spans ML/DL algorithms and Databricks Runtime for Machine Learning. Prior to Databricks, Yifan worked on two Machine Learning products, applying NLP to find metadata and applying machine learning to predict equipment failures. He helped build the products from ground up to multi-million dollars in ARR. Yifan started his career as a researcher in quantum computing. Yifan received his B.S in UC Berkeley and Master from MIT.

Experimental Design for Distributed Machine Learning with Myles Baker

This document discusses experimental design for distributed machine learning models. It outlines common problems in machine learning modeling like selecting the best algorithm and evaluating a model's expected generalization error. It describes steps in a machine learning study like collecting data, building models, and designing experiments. The goal of experimentation is to understand how model factors affect outcomes and obtain statistically significant conclusions. Techniques discussed for analyzing distributed model outputs include precision-recall curves, confusion matrices, and hypothesis testing methods like the chi-squared test and McNemar's test. The document emphasizes that experimental design for distributed learning poses new challenges around data characteristics, computational complexity, and reproducing results across models.

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Formulatedby

Presented by Mostafa Madjipour., Senior Data Scientist at Time Inc. Next DSS NYC Event 👉 https://datascience.salon/newyork/ Next DSS LA Event 👉 https://datascience.salon/la/ Reducing the gap between R&D and production is still a challenge for data science/ machine learning engineering groups in many companies. Typically, data scientists develop the data-driven models in a research-oriented programming environment (such as R and python). Next, the data/machine learning engineers rewrite the code (typically in another programming language) in a way that is easy to integrate with production services. This process has some disadvantages: 1) It is time consuming; 2) slows the impact of data science team on business; 3) code rewriting is prone to errors. A possible solution to overcome the aforementioned disadvantages would be to implement a deployment strategy that easily embeds/transforms the model created by data scientists. Packages such as jPMML, MLeap, PFA, and PMML among others are developed for this purpose. In this talk we review some of the mentioned packages, motivated by a project at Time Inc. The project involves development of a near real-time recommender system, which includes a predictor engine, paired with a set of business rules.

Scalable Automatic Machine Learning in H2O

Abstract: In recent years, the demand for machine learning experts has outpaced the supply, despite the surge of people entering the field. To address this gap, there have been big strides in the development of user-friendly machine learning software that can be used by non-experts. Although H2O and other tools have made it easier for practitioners to train and deploy machine learning models at scale, there is still a fair bit of knowledge and background in data science that is required to produce high-performing machine learning models. Deep Neural Networks in particular, are notoriously difficult for a non-expert to tune properly. In this presentation, we provide an overview of the the field of "Automatic Machine Learning" and introduce the new AutoML functionality in H2O. H2O's AutoML provides an easy-to-use interface which automates the process of training a large, comprehensive selection of candidate models and a stacked ensemble model which, in most cases, will be the top performing model in the AutoML Leaderboard. H2O AutoML is available in all the H2O interfaces including the h2o R package, Python module and the Flow web GUI. We will also provide simple code examples to get you started using AutoML. Erin’s Bio: Erin is a Statistician and Machine Learning Scientist at H2O.ai. She is the main author of H2O Ensemble. Before joining H2O, she was the Principal Data Scientist at Wise.io and Marvin Mobile Security (acquired by Veracode in 2012) and the founder of DataScientific, Inc. Erin received her Ph.D. in Biostatistics with a Designated Emphasis in Computational Science and Engineering from University of California, Berkeley. Her research focuses on ensemble machine learning, learning from imbalanced binary-outcome data, influence curve based variance estimation and statistical computing. She also holds a B.S. and M.A. in Mathematics.

Balancing Automation and Explanation in Machine Learning

For a machine learning application to be successful, it is not enough to give highly accurate predictions: Customers also want to know why the model has made that prediction, so they can compare it against their intuition and (hopefully) gain trust in the model. However, there is a trade-off between model accuracy and explainability - for example, the more complex your feature transformations become, the harder it is to explain what the resulting features mean to the end customer. However, with the right system design this doesn't mean it has to be a binary choice between these two goals. It is possible to combine complex, even automatic, feature engineering with highly accurate models and explanations. We will describe how we are using lineage tracing to solve this issue at Salesforce Einstein, allowing good model explanations to coexist with automatic feature engineering and model selection. By building this into an open source AutoML library TransmogrifAI, an extension to SparkMlLib, it is easy to ensure a consistent level of transparency in all of our ML applications. As model explanations are provided out of the box, data scientists don't need to re-invent the wheel when model explanations need to be surfaced.

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...

MLeap is a machine learning platform that enables data scientists and engineers to collaborate using a single environment. It allows machine learning models trained using Spark to be deployed to production APIs without dependencies on Spark. MLeap addresses the common problems of data scientists and engineers having to re-write data pipelines and model code for production. It provides core machine learning components, a runtime for executing models without Spark, and tools for converting Spark models to the MLeap format. A demo is shown training and deploying models to an API in under 5 minutes.

Machine Learning With Spark

Shivaji Dutta

This document provides an overview of machine learning concepts and techniques using Apache Spark. It begins with introducing machine learning and describing supervised and unsupervised learning. Then it discusses Spark and how it can be used for large-scale machine learning tasks through its MLlib library and GraphX API. Several examples of machine learning applications are presented, such as classification, regression, clustering, and graph analytics. The document concludes with demonstrating machine learning algorithms in Spark.

Using H2O AutoML for Kaggle Competitions

The document discusses H2O AutoML, a tool for automated machine learning. It begins by showing some top Kagglers who have used AutoML to achieve good results with less effort. It then provides an overview of what AutoML automates in the model building process like preprocessing, training, tuning, stacking ensembles. AutoML is suitable for novice users who want automation as well as experts who want to save time on routine tasks. The document explains the interface and shows the grid search, stacking and cross-validation process done behind the scenes to build accurate models with less work.

Strata parallel m-ml-ops_sept_2017

Nisha Talagala

Machine Learning in Production The era of big data generation is upon us. Devices ranging from sensors to robots and sophisticated applications are generating increasing amounts of rich data (time series, text, images, sound, video, etc.). For such data to benefit a business’s bottom line, insights must be extracted, a process that increasingly requires machine learning (ML) and deep learning (DL) approaches deployed in production applications use cases. Production ML is complicated by several challenges, including the need for two very distinct skill sets (operations and data science) to collaborate, the inherent complexity and uniqueness of ML itself, when compared to other apps, and the varied array of analytic engines that need to be combined for a practical deployment, often across physically distributed infrastructure. Nisha Talagala shares solutions and techniques for effectively managing machine learning and deep learning in production with popular analytic engines such as Apache Spark, TensorFlow, and Apache Flink.

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

MLconf

Recommendations for Building Machine Learning Software: Building a real system that uses machine learning can be a difficult both in terms of the algorithmic and engineering challenges involved. In this talk, I will focus on the engineering side and discuss some of the practical lessons we’ve learned from years of developing the machine learning systems that power Netflix. I will go over what it takes to get machine learning working in a real-life feedback loop with our users and how that imposes different requirements and a different focus than doing machine learning only within a lab environment. This involves lessons around challenges such as where to place algorithmic components, how to handle distribution and parallelism, what kinds of modularity are useful, how to support both production experimentation, and how to test machine learning systems.

Spark Summit EU talk by Reza Karimi

The document discusses mentorship modeling based on authorship graphs from Scopus data. It describes building features from co-authorship and correspondence graphs using Spark, validating predictions via crowdsourcing, and visualizing mentorship subgraphs. Key points include normalizing authorship data, aggregating node and edge features, applying pairwise mentorship models, obtaining training data via email campaigns, and using D3.js to interactively display mentorship subgraphs in Spark applications.

Advanced Hyperparameter Optimization for Deep Learning with MLflow

Building on the "Best Practices for Hyperparameter Tuning with MLflow" talk, we will present advanced topics in HPO for deep learning, including early stopping, multi-metric optimization, and robust optimization. We will then discuss implementations using open source tools. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze the performance of our models.

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Interested in learning how Showtime is leveraging the power of Spark to transform a traditional premium cable network into a data-savvy analytical competitor? The growth in our over-the-top (OTT) streaming subscription business has led to an abundance of user-level data not previously available. To capitalize on this opportunity, we have been building and evolving our unified platform which allows data scientists and business analysts to tap into this rich behavioral data to support our business goals. We will share how our small team of data scientists is creating meaningful features which capture the nuanced relationships between users and content; productionizing machine learning models; and leveraging MLflow to optimize the runtime of our pipelines, track the accuracy of our models, and log the quality of our data over time. From data wrangling and exploration to machine learning and automation, we are augmenting our data supply chain by constantly rolling out new capabilities and analytical products to help the organization better understand our subscribers, our content, and our path forward to a data-driven future. Authors: Josh McNutt, Keria Bermudez-Hernandez

Machine Learning In Production

Samir Bessalah

This document discusses challenges in running machine learning applications in production environments. It notes that while Kaggle competitions focus on accuracy, real-world applications require balancing accuracy with interpretability, speed and infrastructure constraints. It also emphasizes that machine learning in production is as much a software and systems problem as a modeling problem. Key aspects that are discussed include flexible and scalable deployment architectures, model versioning, packaging and serving, online evaluation and experiments, and ensuring reproducibility of results.

Spark Summit EU talk by Oscar Castaneda

The document discusses transforming Excel spreadsheets into Spark DataFrames by automatically translating Excel formulas into Spark code. It presents a program transformation pipeline that takes Excel formulas, parses them using a grammar and parser to generate a parse tree, and then generates Spark code from the parse tree. Key aspects covered include using an existing grammar and parser called XLParser to parse Excel formulas, treating Excel as a domain-specific language, and generating code by writing a pretty printer for the target Spark language. The talk concludes with a demonstration of the code generation approach.

SparkApplicationDevMadeEasy_Spark_Summit_2015

Lance Co Ting Keh

The document discusses Spark application development and common problems that can occur. It introduces Unravel Data, a startup that aims to help developers visualize Spark job data, optimize performance through automated analysis and diagnoses, and strategize to prevent problems and meet goals. Key points include discussing common issues like failures, wrong results, poor performance, and resource problems; the difficulty of debugging using logs alone; and demonstrating Unravel's platform to address these challenges.

Splice Machine's use of Apache Spark and MLflow

Splice Machine is an ANSI-SQL Relational Database Management System (RDBMS) on Apache Spark. It has proven low-latency transactional processing (OLTP) as well as analytical processing (OLAP) at petabyte scale. It uses Spark for all analytical computations and leverages HBase for persistence. This talk highlights a new Native Spark Datasource - which enables seamless data movement between Spark Data Frames and Splice Machine tables without serialization and deserialization. This Spark Datasource makes machine learning libraries such as MLlib native to the Splice RDBMS . Splice Machine has now integrated MLflow into its data platform, creating a flexible Data Science Workbench with an RDBMS at its core. The transactional capabilities of Splice Machine integrated with the plethora of DataFrame-compatible libraries and MLflow capabilities manages a complete, real-time workflow of data-to-insights-to-action. In this presentation we will demonstrate Splice Machine's Data Science Workbench and how it leverages Spark and MLflow to create powerful, full-cycle machine learning capabilities on an integrated platform, from transactional updates to data wrangling, experimentation, and deployment, and back again.

Apache Spark's MLlib's Past Trajectory and new Directions

Anass Bensrhir - Senior Data Scientist

- MLlib has rapidly developed over the past 5 years, growing from a few initial algorithms to over 50 algorithms and featurizers today. - It has shifted focus from just adding algorithms to improving existing algorithms and infrastructure like DataFrame integration. - This allows for scalable machine learning workflows on big data from small laptop datasets to large clusters, with seamless integration between SQL, DataFrames, streaming, and other Spark components. - Going forward, areas of focus include continued improvements to scalability, enhancing core algorithms, extending APIs to support custom algorithms, and building out a standard library of machine learning components.

Deploying Machine Learning Models to Production

Machine learning techniques are powerful, but building and deploying such models for production use require a lot of care and expertise. A lot of books, articles, and best practices have been written and discussed on machine learning techniques and feature engineering, but putting those techniques into use on a production environment is usually forgotten and under- estimated , the aim of this talk is to shed some lights on current machine learning deployment practices, and go into details on how to deploy sustainable machine learning pipelines.

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP

Ververica

Pattern matching over event streams is increasingly being employed in many areas including financial services and click stream analysis. Flink, as a true stream processing engine, emerges as a natural candidate for these usecases. In this talk, we will present FlinkCEP, a library for Complex Event Processing (CEP) based on Flink. At the conceptual level, we will see the different patterns the library can support, we will present the main building blocks we implemented to support them, and we will discuss possible future additions that will further enhance the coverage of the library. At the practical level, we will show how the integration of FlinkCEP with Flink allows the former to take advantage of Flink's rich ecosystem (e.g. connectors) and its stream processing capabilities, such as support for event-time processing, exactly-once state semantics, fault-tolerance, savepoints and high throughput.

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Till Rohrmann

What's hot

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

Anyscale

Automated Hyperparameter Tuning, Scaling and Tracking

Experimental Design for Distributed Machine Learning with Myles Baker

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Formulatedby

Scalable Automatic Machine Learning in H2O

Balancing Automation and Explanation in Machine Learning

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...

Machine Learning With Spark

Shivaji Dutta

Using H2O AutoML for Kaggle Competitions

Strata parallel m-ml-ops_sept_2017

Nisha Talagala

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

MLconf

Spark Summit EU talk by Reza Karimi

Advanced Hyperparameter Optimization for Deep Learning with MLflow

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Machine Learning In Production

Samir Bessalah

Spark Summit EU talk by Oscar Castaneda

SparkApplicationDevMadeEasy_Spark_Summit_2015

Lance Co Ting Keh

Splice Machine's use of Apache Spark and MLflow

Apache Spark's MLlib's Past Trajectory and new Directions

Anass Bensrhir - Senior Data Scientist

Deploying Machine Learning Models to Production

What's hot (20)

Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models

Automated Hyperparameter Tuning, Scaling and Tracking

Experimental Design for Distributed Machine Learning with Myles Baker

Data Science Salon: A Journey of Deploying a Data Science Engine to Production

Scalable Automatic Machine Learning in H2O

Balancing Automation and Explanation in Machine Learning

MLLeap, or How to Productionize Data Science Workflows Using Spark by Mikha...

Machine Learning With Spark

Using H2O AutoML for Kaggle Competitions

Strata parallel m-ml-ops_sept_2017

Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...

Spark Summit EU talk by Reza Karimi

Advanced Hyperparameter Optimization for Deep Learning with MLflow

Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark

Machine Learning In Production

Spark Summit EU talk by Oscar Castaneda

SparkApplicationDevMadeEasy_Spark_Summit_2015

Splice Machine's use of Apache Spark and MLflow

Apache Spark's MLlib's Past Trajectory and new Directions

Deploying Machine Learning Models to Production

Viewers also liked

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP

Ververica

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Till Rohrmann

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...

Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow

This session will introduce a new open-source project - Flink TensorFlow - that enables Flink programs to operate on data using TensorFlow machine learning models. Applications include real-time image processing, NLP, and anomaly detection. The session will: - Introduce TensorFlow and describe its component model which allows for model reuse across environments - Demonstrate how to use TensorFlow models in Flink ML and Flink Streaming environments - Present a roadmap and provide opportunities to contribute

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...

SK telecom shares our experience of using Flink in building a solution for Predictive Maintenance (PdM). Our PdM solution named metatron PdM consists of (1) a Deep Neural Network (DNN)-based prediction model for precise prediction, and (2) a Flink-based runtime system which applies the model to a sliding window on sensor data streams. Efficient handling of multi-sensor streaming data for real-time prediction of equipment condition is a critical component of our product. In this talk, we first show why we choose Flink as a core engine for our streaming use case in which we generate real-time predictions using DNNs trained with Keras on top of TensorFlow and Theano. In addition, we present a comparative study of methods to exploit learning models on JVM such as directly using Python libraries on CPython embedded in JVM, using TensorFlow Java API (including Flink TensorFlow), and making RPC calls to TensorFlow Serving. We then explain how we implement the runtime system using Flink DataStream API, especially with event time, various window mechanisms, timestamp and watermark, custom source and sink, and checkpointing. Lastly, we present how we use the official Flink Docker image for solution delivery and the Flink metric system for monitoring and management of our solution. We hope our use case sets a good example of building a DNN-based streaming solution using Flink.

Electricity price forecasting with Recurrent Neural Networks

Taegyun Jeon

This document discusses using recurrent neural networks (RNNs) for electricity price forecasting with TensorFlow. It begins with an introduction to the speaker, Taegyun Jeon from GIST. The document then provides an overview of RNNs and their implementation in TensorFlow. It describes two case studies - using an RNN to predict a sine function and using one to forecast electricity prices. The document concludes with information on running and evaluating the RNN graph and a question and answer section.

Feature Engineering

HJ van Veen

Viewers also liked (7)

Kostas Kloudas - Complex Event Processing with Flink: the state of FlinkCEP

Machine Learning with Apache Flink at Stockholm Machine Learning Group

Flink Forward Berlin 2017: Kostas Kloudas - Complex Event Processing with Fli...

Flink Forward SF 2017: Eron Wright - Introducing Flink Tensorflow

Flink Forward Berlin 2017: Dongwon Kim - Predictive Maintenance with Apache F...

Electricity price forecasting with Recurrent Neural Networks

Feature Engineering

Similar to Data Intensive Applications with Apache Flink

Apache Fink 1.0: A New Era for Real-World Streaming Analytics

DataWorks Summit/Hadoop Summit

Apache Flink - Overview and Use cases of a Distributed Dataflow System (at pr...

Stephan Ewen

Apache Spark vs Apache Flink

AKASH SIHAG

This document compares Apache Spark and Apache Flink. Both are open-source platforms for distributed data processing. Spark was created in 2009 at UC Berkeley and donated to the Apache Foundation in 2013. It uses resilient distributed datasets (RDDs) and lazy evaluation. Flink was started in 2010 as a collaboration between universities in Germany and became an Apache project in 2014. It uses cyclic data flows and supports both batch and stream processing. While Spark is currently more mature with more components and community support, Flink claims to be faster for stream and batch processing. Overall, both platforms continue to evolve and improve.

Overview of Apache Flink: the 4G of Big Data Analytics Frameworks

This document provides an overview of Apache Flink and discusses why it is suitable for real-world streaming analytics. The document contains an agenda that covers how Flink is a multi-purpose big data analytics framework, why streaming analytics are emerging, why Flink is suitable for real-world streaming analytics, novel use cases enabled by Flink, who is using Flink, and where to go from here. Key points include Flink innovations like custom memory management, its DataSet API, rich windowing semantics, and native iterative processing. Flink's streaming features that make it suitable for real-world use include its pipelined processing engine, stream abstraction, performance, windowing support, fault tolerance, and integration with Hadoop.

Overview of Apache Fink: the 4 G of Big Data Analytics Frameworks

Slides of my talk at the Hadoop Summit Europe in Dublin, Ireland on April 13th, 2016. The talk introduces Apache Flink as both a multi-purpose Big Data analytics framework and real-world streaming analytics framework. It is focusing on Flink's key differentiators and suitability for streaming analytics use cases. It also shows how Flink enables novel use cases such as distributed CEP (Complex Event Processing) and querying the state by behaving like a key value data store.

Overview of Apache Fink: The 4G of Big Data Analytics Frameworks

Portable Streaming Pipelines with Apache Beam

confluent

1) Apache Beam is an open source unified model for defining both batch and streaming data processing pipelines. It allows writing pipelines once that can run on multiple distributed processing backends. 2) The Beam model separates the data processing logic from runtime requirements. It defines concepts like processing time vs event time to allow portability across batch and streaming runners. 3) Beam supports extensible IO connectors and aims to allow pipelines written in one language to run on different runtimes through language-specific SDKs. Currently, Java and Python SDKs can run on backends like Apache Spark, Flink, and Google Cloud Dataflow.

Present and future of unified, portable, and efficient data processing with A...

DataWorks Summit

The world of big data involves an ever-changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In a way, Apache Beam is a glue that can connect the big data ecosystem together; it enables users to "run any data processing pipeline anywhere." This talk will briefly cover the capabilities of the Beam model for data processing and discuss its architecture, including the portability model. We’ll focus on the present state of the community and the current status of the Beam ecosystem. We’ll cover the state of the art in data processing and discuss where Beam is going next, including completion of the portability framework and the Streaming SQL. Finally, we’ll discuss areas of improvement and how anybody can join us on the path of creating the glue that interconnects the big data ecosystem. Speaker Davor Bonaci, Apache Software Foundation; Simbly, V.P. of Apache Beam; Founder/CEO at Operiant

Realizing the promise of portability with Apache Beam

J On The Beach

The world of big data involves an ever changing field of players. Much as SQL stands as a lingua franca for declarative data analysis, Apache Beam (incubating) aims to provide a portable standard for expressing robust, out-of-order data processing pipelines in a variety of languages across a variety of platforms. In this talk, I will: Cover briefly the capabilities of the Beam model for data processing and integration with IOs, as well as the current state of the Beam ecosystem. Discuss the benefits Beam provides regarding portability and ease-of-use. Demo the same Beam pipeline running on multiple runners in multiple deployment scenarios (e.g. Apache Flink on Google Cloud, Apache Spark on AWS, Apache Apex on-premise). Give a glimpse at some of the challenges Beam aims to address in the future.

Portable batch and streaming pipelines with Apache Beam (Big Data Application...

Malo Denielou

Apache Beam is a top-level Apache project which aims at providing a unified API for efficient and portable data processing pipeline. Beam handles both batch and streaming use cases and neatly separates properties of the data from runtime characteristics, allowing pipelines to be portable across multiple runtimes, both open-source (e.g., Apache Flink, Apache Spark, Apache Apex, ...) and proprietary (e.g., Google Cloud Dataflow). This talk will cover the basics of Apache Beam, describe the main concepts of the programming model and talk about the current state of the project (new python support, first stable version). We'll illustrate the concepts with a use case running on several runners.

Flink in action

Artem Semenenko

Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...

Provectus

Near real-time anomaly detection at Lyft

markgrover

Introduction to Apache Flink

datamantra

This document provides an overview and introduction to Apache Flink, a stream-based big data processing engine. It discusses the evolution of big data frameworks to platforms and the shortcomings of Spark's RDD abstraction for streaming workloads. The document then introduces Flink, covering its history, key differences from Spark like its use of streaming as the core abstraction, and examples of using Flink for batch and stream processing.

Unified Batch and Real-Time Stream Processing Using Apache Flink