Adventures in Scaling from Zero to 5 Billion Data Points per Day
At Flink Forward San Francisco 2018 our team at Comcast presented the operationalized streaming ML framework which had just gone into production. This year in just a few short months we scaled a Customer Experience use case from an initial trickle of volume to processing over 5 Billion data points per day. This use case is used to help diagnose potential issues with High Speed Data service and provide recommendations to solving this issues as quickly and as cost-effectively as possible.
As with any solution that grows quickly, our platform faced challenges, bottlenecks, and technology limits; forcing us to quickly adapt and evolve our approach to enable handling 50,000+ data points per second.
We will introduce the problems, approaches, solutions, and lessons we learned along the way including: The Trigger and Diagnosis Problem, The REST problem, The “Feature Store” Problem, The “Customer State” Problem, The Savepoint Problem, The HA Problem, The Volume Problem, and of course The Really High Volume Feature Store Problem #2.
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...Flink Forward
Flink Powered Customer Experience: Scaling from 5 Billion down to One
What turns a simple interaction into a great customer experience? How do we transform a digital interaction into a personalized conversation? Over the past year our team scaled a Customer Experience use case to process over 5 Billion data points per day using Apache Flink. This presentation will show how Flink helps deliver a personalized, contextual interaction by scaling the solution down to one…the customer.
Extending Flink State Serialization for Better Performance and Smaller Checkp...Flink Forward
Operations with Flink state are a common source of performance issues for a typical stateful stream processing application. One tiny mistake can easily make your job to spend most of a precious CPU time in serialization and inflate a checkpoint size to the sky. In this talk we’ll focus on a Flink serialization framework and common problems happening around it:
* Is Kryo fallback is really that expensive from the CPU and state size perspective?
* How to plug your own or existing serializers into the Flink (like protobuf).
* Using Scala sealed traits without Kryo fallback.
* Using custom integer variable-length encoding and delta encoding for primitive arrays to further reduce the state size.
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
Dave Klein, Confluent, Developer Advocate
Apache Kafka is the core of an amazing ecosystem of tools and frameworks that enable us to get more value from our data. In this session we'll have a gentle introduction to Apache Kafka and a survey of some of the more popular components in the Kafka ecosystem.
https://www.meetup.com/KafkaBayArea/events/276592389/
Crossing the Streams: Event Streaming with Kafka Streams
Viktor Gamov, Developer Advocate, Confluent
https://www.meetup.com/Lille-Kafka/events/270888493/
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Flink Forward San Francisco 2019: Flink Powered Customer Experience: Scaling ...Flink Forward
Flink Powered Customer Experience: Scaling from 5 Billion down to One
What turns a simple interaction into a great customer experience? How do we transform a digital interaction into a personalized conversation? Over the past year our team scaled a Customer Experience use case to process over 5 Billion data points per day using Apache Flink. This presentation will show how Flink helps deliver a personalized, contextual interaction by scaling the solution down to one…the customer.
Extending Flink State Serialization for Better Performance and Smaller Checkp...Flink Forward
Operations with Flink state are a common source of performance issues for a typical stateful stream processing application. One tiny mistake can easily make your job to spend most of a precious CPU time in serialization and inflate a checkpoint size to the sky. In this talk we’ll focus on a Flink serialization framework and common problems happening around it:
* Is Kryo fallback is really that expensive from the CPU and state size perspective?
* How to plug your own or existing serializers into the Flink (like protobuf).
* Using Scala sealed traits without Kryo fallback.
* Using custom integer variable-length encoding and delta encoding for primitive arrays to further reduce the state size.
Introducing Arc: A Common Intermediate Language for Unified Batch and Stream...Flink Forward
Today's end-to-end data pipelines need to combine many diverse workloads such as machine learning, relational operations, stream dataflows, tensor transformations, and graphs. For each of these workload types exist several frontends (e.g., DataFrames/SQL, Beam, Keras) based on different programming languages as well as different runtimes (e.g., Spark, Flink, Tensorflow) that target a particular frontend and possibly a hardware architecture (e.g., GPUs). Putting all the pieces of a data pipeline together simply leads to excessive data materialisation, type conversions and hardware utilisation as well as miss-matches of processing guarantees.
Our research group at RISE and KTH in Sweden has founded Arc, an intermediate language that bridges the gap between any frontend and a dataflow runtime (e.g., Flink) through a set of fundamental building blocks for expressing data pipelines. Arc incorporates Flink and Beam-inspired stream semantics such as windows, state and out of order processing as well as concepts found in batch computation models. With Arc, we can cross- compile and optimise diverse tasks written in any programming language into a unified dataflow program. Arc programs can run on various hardware backends efficiently as well as allowing seamless, distributed execution on dataflow runtimes. To that end, we showcase Arcon a concept runtime built in Rust that can execute Arc programs natively as well as presenting a minimal set of extensions to make Flink an Arc-ready runtime.
Dave Klein, Confluent, Developer Advocate
Apache Kafka is the core of an amazing ecosystem of tools and frameworks that enable us to get more value from our data. In this session we'll have a gentle introduction to Apache Kafka and a survey of some of the more popular components in the Kafka ecosystem.
https://www.meetup.com/KafkaBayArea/events/276592389/
Crossing the Streams: Event Streaming with Kafka Streams
Viktor Gamov, Developer Advocate, Confluent
https://www.meetup.com/Lille-Kafka/events/270888493/
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation.
Storm often coexists in Big Data architectures with Hadoop. We will talk about different approaches to this interoperability between the systems, their benefits & trade-offs, and a new open source library available for high throughput use.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Using Kafka to integrate DWH and Cloud Based big data systemsconfluent
Mic Hussey, Senior Systems Engineer, Confluent
Using Kafka to integrate DWH and Cloud Based big data systems
https://www.meetup.com/Stockholm-Apache-Kafka-Meetup-by-Confluent/events/268636234/
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Slides of QCon London 2016 talk. How stream processing is used within the Uber's Marketplace system to solve a wide range problems, including but not limited to realtime indexing and querying of geospatial time series, aggregation and computing of streaming data, and extracting patterns from data streams. In addition, it will touch upon various TimeSeries analysis and predictions. The underlying systems utilize many open source technologies such as Apache Kafka, Samza and Spark streaming.
James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo. James covers the architectural decisions and lessons learned building an exactly-once ingest pipeline storing raw events across in-memory row storage and on-disk columnar storage and a custom metalanguage and query layer leveraging partial OLAP result set caching and query canonicalization. Putting all the pieces together provides thousands of Uber employees with subsecond p95 latency analytical queries spanning hundreds of millions of recent events.
The Sparkling Water project brings integration of H2O platform with Spark Ecosystem.
In this hands-on session we will show how to start with Sparkling Water - launching H2O services on top of Spark, loading data with help of H2O, manipulating data with help of Spark and H2O APIs, and finally, preparing and using DeepLearning and GBM models.
The presentation also demonstrates using R to access results prepared with help of Sparkling Water.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
A talk given at Hadoop Summit 2016, Dublin.
The internet of things (IoT) generates data at an unprecedented rate and requires results at low latency. Only streaming technologies can keep up. But IoT applications must be integrated with existing applications, such as Tableau, whose lingua franca is SQL. Samza and Storm are adding support standard SQL with extensions for streaming, using Calcite for parsing and planning. We show, with examples, how you would accomplish typical tasks in Kafka/Samza or Storm/Trident using SQL. Streaming SQL allows you to work at a higher level. For example, many queries need to combine streams with historic or reference data in databases or HDFS, or recent data in memory. We show how to define multiple data source systems, and how to write queries that compute joins or aggregations on streams.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
This slide covers the LendingClub use case for implementing real time analytics using Oracle goldengate for Big Data. It covers architecture ,implementation and troubleshooting steps.
Storm is a distributed realtime computation system. Similar to how Hadoop provides a set of general primitives for doing batch processing, Storm provides a set of general primitives for doing realtime computation. Storm is simple, can be used with any programming language, and is a lot of fun to use!
Using Kafka to integrate DWH and Cloud Based big data systemsconfluent
Mic Hussey, Senior Systems Engineer, Confluent
Using Kafka to integrate DWH and Cloud Based big data systems
https://www.meetup.com/Stockholm-Apache-Kafka-Meetup-by-Confluent/events/268636234/
Distributed Real-Time Stream Processing: Why and How 2.0Petr Zapletal
The demand for stream processing is increasing a lot these day. Immense amounts of data has to be processed fast from a rapidly growing set of disparate data sources. This pushes the limits of traditional data processing infrastructures. These stream-based applications include trading, social networks, Internet of things, system monitoring, and many other examples.
In this talk we are going to discuss various state of the art open-source distributed streaming frameworks, their similarities and differences, implementation trade-offs and their intended use-cases. Apart of that, I’m going to speak about Fast Data, theory of streaming, framework evaluation and so on. My goal is to provide comprehensive overview about modern streaming frameworks and to help fellow developers with picking the best possible for their particular use-case.
Slides of QCon London 2016 talk. How stream processing is used within the Uber's Marketplace system to solve a wide range problems, including but not limited to realtime indexing and querying of geospatial time series, aggregation and computing of streaming data, and extracting patterns from data streams. In addition, it will touch upon various TimeSeries analysis and predictions. The underlying systems utilize many open source technologies such as Apache Kafka, Samza and Spark streaming.
James Burkhart explains how Uber supports millions of analytical queries daily across real-time data with Apollo. James covers the architectural decisions and lessons learned building an exactly-once ingest pipeline storing raw events across in-memory row storage and on-disk columnar storage and a custom metalanguage and query layer leveraging partial OLAP result set caching and query canonicalization. Putting all the pieces together provides thousands of Uber employees with subsecond p95 latency analytical queries spanning hundreds of millions of recent events.
The Sparkling Water project brings integration of H2O platform with Spark Ecosystem.
In this hands-on session we will show how to start with Sparkling Water - launching H2O services on top of Spark, loading data with help of H2O, manipulating data with help of Spark and H2O APIs, and finally, preparing and using DeepLearning and GBM models.
The presentation also demonstrates using R to access results prepared with help of Sparkling Water.
What is SamzaSQL, and what might I use it for? Does this mean that Samza is turning into a database? What is a query optimizer, and what can it do for my streaming queries?
How does Apache Calcite parse, validate and optimize streaming SQL queries? How is relational algebra extended to handle streaming?
Photon Technical Deep Dive: How to Think VectorizedDatabricks
Photon is a new vectorized execution engine powering Databricks written from scratch in C++. In this deep dive, I will introduce you to the basic building blocks of a vectorized engine by walking you through the evaluation of an example query with code snippets. You will learn about expression evaluation, compute kernels, runtime adaptivity, filter evaluation, and vectorized operations against hash tables.
This presentation is about Scalding with focus on the programming model compared to Hadoop and Cascading. I did this presentation for the group http://www.meetup.com/riviera-scala-clojure
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentJulian Hyde
A talk given at Hadoop Summit 2016, Dublin.
The internet of things (IoT) generates data at an unprecedented rate and requires results at low latency. Only streaming technologies can keep up. But IoT applications must be integrated with existing applications, such as Tableau, whose lingua franca is SQL. Samza and Storm are adding support standard SQL with extensions for streaming, using Calcite for parsing and planning. We show, with examples, how you would accomplish typical tasks in Kafka/Samza or Storm/Trident using SQL. Streaming SQL allows you to work at a higher level. For example, many queries need to combine streams with historic or reference data in databases or HDFS, or recent data in memory. We show how to define multiple data source systems, and how to write queries that compute joins or aggregations on streams.
How to Improve the Observability of Apache Cassandra and Kafka applications...Paul Brebner
As distributed cloud applications grow more complex, dynamic, and massively scalable, “observability” becomes more critical.
Observability is the practice of using metrics, monitoring and distributed tracing to understand how a system works.
We’ll explore two complementary Open Source technologies:
Prometheus for monitoring application metrics, and
OpenTracing and Jaeger for distributed tracing.
We’ll discover how they improve the observability of
an Anomaly Detection application, deployed on AWS Kubernetes, and using Instaclustr managed Apache Cassandra and Kafka clusters.
Streaming is necessary to handle data rates and latency, but SQL is unquestionably the lingua franca of data. Is it possible to combine SQL with streaming, and if so, what does the resulting language look like? Apache Calcite is extending SQL to include streaming, and Apache Apex is using Calcite to support streaming SQL. In this talk, Julian Hyde describes streaming SQL in detail and shows how you can use streaming SQL in your application. He also describes how Calcite’s planner optimizes queries for throughput and latency.
Julian Hyde gave this talk at Apex Big Data World, Mountain View, on April 4, 2017.
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
LendingClub RealTime BigData Platform with Oracle GoldenGate BigData Adapter. This was presented at Oracle Open World 2017 at San Francisco.
Speaker :
Rajit Saha
Vengata Guruswami
Oracle Goldengate for Big Data - LendingClub ImplementationVengata Guruswamy
This slide covers the LendingClub use case for implementing real time analytics using Oracle goldengate for Big Data. It covers architecture ,implementation and troubleshooting steps.
Going Native: Leveraging the New JSON Native Datatype in Oracle 21cJim Czuprynski
Need to incorporate JSON documents into existing Oracle database applications? The new native JSON datatype introduced in Oracle 21c makes it simple to store, access, traverse, and filter the complex data often found within JSON documents, often without any application code changes.
How to solve complex business requirements with Oracle Data Integrator?Gurcan Orhan
Business requirements are always hard to implement, develop, operate and always changeable. In this session attendees will have some fact examples of turning unstructured data into structural meaning, writing complex queries without typing anything, adding function based joins, implementing CTAS (Create Table As Select) and IAS (Insert As Select) methods and simplifying business rules, writing optimized queries to decrease operational and development costs as well as the faster loads.
In this presentation, see how you can solve some of complex business requirements with Oracle Data Integrator's flexibility and ease of usage features.
SRV405 Deep Dive Amazon Redshift & Redshift Spectrum at Cardinal HealthAmazon Web Services
Get a technical deep dive into Amazon Redshift and Redshift Spectrum. Learn best practices for taking advantage of Amazon Redshift’s columnar technology and parallel processing capabilities, to improve overall database performance. This session will explain how to migrate from existing data warehouses, create an optimized schema, efficiently load data, use workload management and use Redshift Spectrum to query data directly in Amazon S3. The session will feature Jeff Battisti, Director Global Cloud BI&A Medical IT at Cardinal Health, and Greg Cantwell, Senior Consultant, Business Metrics / Analytics, who will provide lessons learned and best practices, from creating a new data warehouse to supporting Global Sales & Financial reporting in over 60 countries with Amazon Redshift.
A story about developing an application for an online store, persisting all the data as JSON.
Gives an overview of JSON functionality in Oracle Database 19c.
IBM Informix® Dynamic Server (IDS) 11 has new features to enable IDS applications to work with, store and search XML. This session discusses various scenarios and illustrates how to use and exploit these features.
Declarative Multilingual Information Extraction with SystemTLaura Chiticariu
Information extraction (IE), the task of extracting structured information from unstructured or semi-structured data, is increasingly important to a wide array of enterprise applications, ranging from Business Intelligence to Data-as-a-Service.
In the first part of the talk, we give an overview of SystemT, a declarative IE system designed and developed to address the requirements driven by modern applications: scalability, expressivity, and transparency. SystemT is based on the basic principle underlying relational database technology: complete separation of specification from execution. SystemT uses a declarative language for expressing NLP algorithms called AQL, and an optimizer that generates high-performance algebraic execution plans for AQL rules. It makes IE orders of magnitude more scalable and easy to use, maintain and customize. Today, SystemT ships with multiple products across 4 IBM Software Brands and it being taught in universities. Our ongoing research and development efforts focus on making SystemT more usable for both technical and business users, and continuing enhancing its core functionalities based on natural language processing, machine learning, and database technology.
In the second part of the talk we present POLYGLOT, a multilingual semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. The key feature of the system is that it treats the semantic labels of the English Proposition Bank as “universal semantic labels”: Given a sentence in any of the supported languages, POLYGLOT will predict appropriate English PropBank frame and role annotation. We illustrate how these universal semantic labels can be used within SystemT to create information extractors that immediately work across different languages. In addition, we illustrate how we automatically generate Proposition Banks for new languages in order to enable multilingual SRL and discuss some challenges of crosslingual semantics.
[Velocity Conf 2017 NY] How Twitter built a framework to improve infrastructu...Vinu Charanya
Twitter is powered by thousands of microservices running on an internal cloud platform consisting of a suite of multitenant platform services that offer compute, storage, messaging, monitoring, etc. as a service. These platforms have thousands of tenants and run atop hundreds of thousands of servers both on-premises and in the public cloud. The scale of diversity in Twitter’s multitenant infrastructure services makes it extremely difficult to effectively forecast capacity, compute resource utilization, and cost and drive efficiency.
Vinu Charanya explains how she and her team are building a system that captures, defines, provisions, meters, and charges infrastructure resources, redefining how systems are built atop Twitter infrastructure. The infrastructure resources include primitive bare metal servers and VMs in the public cloud and abstract resources offered by multitenant services such as a compute platform (powered by Apache Aurora and Mesos), storage (Manhattan for key-value, cache, RDBMS), and observability. Along the way, Vinu shares how Twitter used this data to better plan capacity and drive a cultural change in engineering that helped improve overall resource utilization and led to significant savings in infrastructure spend.
Using SQL-MapReduce for Advanced AnalyticsTeradata Aster
Industry analyst Rick van der Lans explains how Aster Data's patent-pending SQL-MapReduce programming framework makes new types of analytic queries possible. The main benefits he outlines are: Parallelization of complex operations; Simplification of queries; Predictabile query performance; Efficient data access; and Linear scalability.
Check out Rick's complimentary research report at http://www.asterdata.com/ar_SQL-MapReduce_for_Advanced_Analytics/, in which he provides a very clear technical explanation of SQL-MapReduce and its analytic application use cases.
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
SQL is the lingua franca of data processing and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink’s SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers. In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential. Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
Microservices Manchester: How Lagom Helps to Build Real World Microservice Sy...OpenCredo
Markus Eisele's slides from his talk at Microservices Manchester "How Lagom Helps to Build Real World Microservice Systems,' which provide an introduction to Lightbend's Lagom Framework (lightbend.com/lagom).
Markus Eisele is a Developer Advocate for Lightbend.
www.lightbend.com
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Transcript: Selling digital books in 2024: Insights from industry leaders - T...BookNet Canada
The publishing industry has been selling digital audiobooks and ebooks for over a decade and has found its groove. What’s changed? What has stayed the same? Where do we go from here? Join a group of leading sales peers from across the industry for a conversation about the lessons learned since the popularization of digital books, best practices, digital book supply chain management, and more.
Link to video recording: https://bnctechforum.ca/sessions/selling-digital-books-in-2024-insights-from-industry-leaders/
Presented by BookNet Canada on May 28, 2024, with support from the Department of Canadian Heritage.
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Epistemic Interaction - tuning interfaces to provide information for AI supportAlan Dix
Paper presented at SYNERGY workshop at AVI 2024, Genoa, Italy. 3rd June 2024
https://alandix.com/academic/papers/synergy2024-epistemic/
As machine learning integrates deeper into human-computer interactions, the concept of epistemic interaction emerges, aiming to refine these interactions to enhance system adaptability. This approach encourages minor, intentional adjustments in user behaviour to enrich the data available for system learning. This paper introduces epistemic interaction within the context of human-system communication, illustrating how deliberate interaction design can improve system understanding and adaptation. Through concrete examples, we demonstrate the potential of epistemic interaction to significantly advance human-computer interaction by leveraging intuitive human communication strategies to inform system design and functionality, offering a novel pathway for enriching user-system engagements.
By Design, not by Accident - Agile Venture Bolzano 2024
Flink Forward San Francisco 2019: Adventures in Scaling from Zero to 5 Billion Data Points per Day - Dave Torok
1. ADVENTURES IN SCALING
FROM ZERO TO 5 BILLION
DATA POINTS PER DAY
Dave Torok
Distinguished Architect
Comcast Corporation
2 April, 2019
Flink Forward – San Francisco 2019
2. 2
COMCAST CUSTOMER RELATIONSHIPS
30.3 MILLION OVERALL CUSTOMER
RELATIONSHIPS AT 2018 YEAR END
25.1 MILLION RESIDENTIAL HIGH-SPEED
INTERNET CUSTOMERS AT 2018 YEAR
END
1.2 MILLION RESIDENTIAL HIGH-SPEED
INTERNET CUSTOMER NET ADDITIONS IN
2018
3. 3
DELIVER THE ULTIMATE CUSTOMER EXPERIENCE
IS THE CUSTOMER HAVING A GOOD EXPERIENCE
FOR HIGH SPEED DATA (HSD) SERVICE?
IF THE CUSTOMER ENGAGES US DIGITALLY, CAN
WE OFFER A SELF-SERVICE SOLUTION?
IF THERE IS AN ISSUE CAN WE OFFER OUR AGENTS
AND TECHNICIANS A DIAGNOSIS TO HELP SOLVE
THE PROBLEM QUICKER?
REDUCE THE TIME TO RESOLVE ISSUES
REDUCE COST TO THE BUSINESS AND THE
CUSTOMER
6. 6
NINE ADVENTURES IN SCALING
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
THE
FEATURE STORE
PROBLEM
THE
CHECKPOINT
PROBLEM
THE
REST
PROBLEM
THE
VOLUME
PROBLEM
THE
TRIGGER AND
DIAGNOSIS
PROBLEM
THE
INEFFICIENT
OBJECT HANDLING
PROBLEM
THE
CUSTOMER STATE
PROBLEM
THE
REALLY HIGH VOLUME
AND FEATURE STORE
PROBLEM #2
7. 7
ML FRAMEWORK ARCHITECTURE – 2018
” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8
REQUESTING
APPLICATION
REQUESTING
APPLICATION
Feature Creation
Pipelines
ML FRAMEWORK
Online Feature
Store
Model/ Feature
Metadata
Feature Check
Async IO
Feature
Assembly
Model
Execution
Prediction /
Outcome Store
Prediction Store
Sink
Prediction
Store
REST Service
Feature Store
REST Service
Model
Execution
Request
REST
Service
Model
Prediction
REST Service
Model Execute
Async IO
FEATURE
SOURCES /
DATA AND
SERVICES
9. 9
MAY 2018 – INITIAL VOLUMES
INDICATOR #1
9 MILLION
INDICATOR #2
166 MILLION
INDICATOR #3
1.2 MILLION
INDICATOR #4
2500 (SMALL TRIAL)
TOTAL
175
MILLION / DAY
Comcast collects, stores, and uses all data in accordance with our privacy disclosures to users and applicable laws.
10. 1 0
175 MILLION PREDICTIONS?
175 MILLION
DATA
POINTS
??????
PREDICTION /
DIAGNOSIS
OUTPUT
STORE
11. 1 1
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
12. 1 2
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
RED
GREEN
RED
13. 1 3
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
RED
GREEN
REDDETECT STATE CHANGE
14. 1 4
INTRODUCE A ‘TRIGGER’ CONDITION
GREEN
GREEN
GREEN
RED
GREEN
GREEN
GREEN
GREEN
GREEN
RED
GREEN
RED
GREEN
RED
GREEN
REDDETECT STATE CHANGE
SEND FOR
DIAGNOSIS
15. 1 5
State
Flow
And
Trigger
INTRODUCE FLINK LOCAL STATE
STATE MANAGER AND TRIGGER
175 MILLION
DATA
POINTS
Customer
Experience
Data Point
Producers
Customer
Experience
Diagnosis
Platform
“A FEW
MILLION”
TRIGGERS
FLINK
LOCAL
STATE PREDICTION /
DIAGNOSIS
OUTPUT
STORE
17. 1 7
ML FRAMEWORK ARCHITECTURE – 2018
” E M B E D D I N G F L I N K T H R O U G H O U T AN O P E R AT I O N AL I Z E D S T R E AM I N G M L L I F E C Y C L E ” – S F F L I N K F O RWAR D 2 0 1 8
REQUESTING
APPLICATION
REQUESTING
APPLICATION
Feature Creation
Pipelines
ML FRAMEWORK
Online Feature
Store
Model/ Feature
Metadata
Feature Check
Async IO
Feature
Assembly
Model
Execution
Prediction /
Outcome Store
Prediction Store
Sink
Prediction
Store
REST Service
Feature Store
REST Service
Model
Execution
Request
REST
Service
Model
Prediction
REST Service
Model Execute
Async IO
FEATURE
SOURCES /
DATA AND
SERVICES
18. 1 8 ML FRAMEWORK
State Flow
And Model
Execution
Trigger
INITIAL PLATFORM ARCHITECTURE
Online Feature
Store
Model/ Feature
Metadata
Feature Getter
Async IO
Feature
Assembly
Feature
Sink
Model
Execution
Prediction /
Outcome Store
Prediction Store
Sink
Prediction
Store
REST Service
Feature Store
REST Service
Customer
Experience
Data Point
Producers
Model
Execution
Request
REST
Service
Model
Prediction
REST Service
FLINK
LOCAL
STATE
Model Execute
Async IO
REQUESTING
APPLICATION
USE CASE
19. 1 9
ORIGINAL ML FRAMEWORK ARCHITECTURE
EVERY COMMUNICATION
WAS A REST SERVICE
ML FRAMEWORK
ISOLATED FROM DATA
POINT USE CASE
BY DESIGN
FLINK ASYNC + REST IS
DIFFICULT TO SCALE
ELASTICALLY
HTTP
REQUEST
RESPONSE
20. 2 0
MAX THROUGHPUT PER SECOND = (SLOTS) * (THREADS) * (1000 / TASK DURATION MSEC)
(12 SLOTS) * (20 THREADS) * (1000 / 300) = 800 / SECOND
TUNING ASYNC I/O – LATENCY AND VOLUME
APACHE HTTP CLIENT / EXECUTOR THREADS
setMaxTotal
setDefaultMaxPerRoute (THREADS)
FLINK ASYNC MAX OUTSTANDING
RichAsyncFunction “Capacity”
AsyncDataStream.unorderedWait() with max capacity
parameter
RATE THROTTLING
if (numberOfRequestsPerPeriod > 0) {
int numberOfProcesses = getRuntimeContext()
.getNumberOfParallelSubtasks();
numberOfRequestsPerPeriodPerProcess =
Math.max(1, numberOfRequestsPerPeriod /
numberOfProcesses);
}
21. 2 1
State Flow
And Model
Execution
Trigger
REPLACE REST CALLS WITH KAFKA (I)
Online Feature
Store
Model/ Feature
Metadata
Feature Getter
Async IO
Feature
Assembly
Prediction /
Outcome Store
Prediction
Store
REST Service
Feature Store
REST Service
Customer
Experience
Data Point
Producers
FLINK
LOCAL
STATE
REQUESTING
APPLICATION
Feature Sink
(JDBCAppendTableSink)
Model
Execution
Prediction Store Sink
(JDBCAppendTableSink)
22. 2 2
REPLACE REST CALL WITH KAFKA (II)
Online Feature
Store
Model/ Feature
Metadata
Feature
Assembly
Feature Sink Flow
(JDBCAppendTableSink)
Model
Execution
Prediction /
Outcome Store
Prediction Store Sink
(JDBCAppendTableSink)
Customer
Experience
Data Point
Producers
FLINK
LOCAL
STATE
State Flow
And Trigger
FLINK
LOCAL
STATE
FLINK
LOCAL
STATE
Feature Store
Manager
FLINK
LOCAL
STATE
24. 2 4
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
New ObjectMapper()
(but don’t cache it)
Parse JSON string into Map
Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
Serialize transformed Map back
into string result
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
25. 2 5
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
New ObjectMapper()
(but don’t cache it)
Parse JSON string into Map
Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
Serialize transformed Map back
into string result
Then let’s make a NEW JSONObject
from that string
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
26. 2 6
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
New ObjectMapper()
(but don’t cache it)
Parse JSON string into Map
Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
Serialize transformed Map back
into string result
Then let’s make a NEW JSONObject
from that string
Go deep into the path to get value
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
27. 2 7
HAS THIS EVER HAPPENED TO YOU?
New “JOLT” JSON Transformer
(but don’t cache it)
New ObjectMapper()
(but don’t cache it)
Parse JSON string into Map
Load and parse “Transform” JSON
from the resource path
(but don’t cache it)
Serialize transformed Map back
into string result
Then let’s make a NEW JSONObject
from that string
Go deep into the path to get value
Cast exception? STATUS_INVALID
String transformedJson =
new JoltJsonTransformer()
.transformJson(myJsonValue,
pTool.getRequired(SPEC_PATH));
JSONObject inputPayload = new
JSONObject(transformedJson);
try {
aType = inputPayload.getJSONObject("info")
.getJSONObject(“aType").
getString("string");
} catch (Exception e) {
// ignore, invalid object
aType = STATUS_INVALID;
}
28. 2 8
SLIGHTLY MORE EFFICIENT SOLUTION
Cache our new ObjectMapper()
Parse JSON string into Map ONCE
Lightweight Map Traversal utility
Java Stream syntax
Optional.ofNullable
Optional.orElse
if (mapper == null)
mapper = new ObjectMapper();
Map<?, ?> resultMap =
JsonUtils.readValue(mapper, (String)
result, Map.class);
MapTreeWalker mapWalker = new
MapTreeWalker(resultMap);
final String aType =
mapWalker.step("info")
.step(“aSchemaInfo")
.step(“aType")
.step("string")
.<String>get().
orElse(STATUS_INVALID);
29. 2 9
FLINK OBJECT EFFICIENCIES
REDUCE OBJECT
CREATION AND
GARBAGE
COLLECTION
BE CAREFUL OF
SIDE EFFECTS
WITHIN
OPERATORS
StreamExecutionEnvironment env;
env.getConfig().enableObjectReuse();
SEMANTIC ANNOTATIONS
@NonForwardedFields
@ForwardedFields
31. 3 1
WHAT IS A FEATURE STORE?
“FEATURE” IS A DATA POINT INPUT TO ML MODEL
WE NEED TO:
• STORE ALL THE INPUT DATA POINTS
• SNAPSHOT AT ANY MOMENT IN TIME
• ASSOCIATE WITH A DIAGNOSIS (MODEL OUTPUT)
• HAVE ACCESS TO THE DATA POINTS (API FOR OTHER APPS)
• STORE ALL DATA POINTS FOR ML MODEL TRAINING
Feature Creation
Pipeline
Historical
Feature Store
Online
Feature Store
Prediction
Phase
Model
Training
Phase
AppendOverwrite
32. 3 2
FIRST TRY: AWS AURORA RDBMS
“HEY IT SHOULD GET US UP TO 10,000 / SECOND”
…THE UPSERT PROBLEM
DIDN’T WORK MUCH PAST 3,000 / SECOND
…SOON OUR INPUT RATE WAS 20,000 / SECOND
33. 3 3
MITIGATION: STORE ONLY AT DIAGNOSIS TIME
STOPPED STORING ALL RAW DATA POINTS
ONLY STORE DATA POINTS ALONGSIDE
DIAGNOSIS
CON: ONLY A SMALL % OF DATA POINTS
CON: DATA NOT CURRENT, ONLY AS OF
RECENT DIAGNOSIS
CON: LIMITED USEFULNESS
Feature
Sink
Model
Execution
Feature Store
Prediction /
Outcome Store
Prediction Store
Sink
34. 3 4
OPTIMIZING THE WRITE PATH
WHICH FLINK FLOW WRITES TO FEATURE STORE?
FEATURE STORE NOT NEEDED FOR TRIGGER
FEATURE STORE NOT NEEDED FOR MODEL EXECUTION
SEPARATE ‘TRANSFORM AND SINK’ FLOW
Online Feature
Store
Feature
Manager
Feature Store
REST Service
Online Feature
Store
Feature Sink
Flow
Feature
Manager
FLINK
LOCAL
STATE
Upstream
Flow
Model
Execution
BEFORE
AFTER
35. 3 5
ALTERNATIVE DATA STORES
CASSANDRA
COST AND COMPLEXITY?
TIME SERIES DB
DRUID
INFLUX
REDIS
STORE ONLY LATEST DATA POINTS?
FLINK QUERYABLE STATE
PRODUCTION HARDENING?
READ VOLUME?
36. 3 6
REDIS
WE ONLY NEED < 7 DAYS OF DATA, ONLY LATEST
VALUE, SO USE REDIS AS THE KV STORE
ACCESSIBLE BY OTHER CONSUMERS OR VIA A
SERVICE
COULDN'T PUT MORE THAN 8,000 / SECOND FOR A
'REASONABLE' CLUSTER SIZE
FLINK SINK CONNECTOR IS NOT CLUSTER-ENABLED
ONE OBJECT PUT PER CALL
JEDIS VS LETTUCE FRAMEWORK AND PIPELINING
SOME OPTIMIZATION IF ABLE TO BATCH REDIS
SHARDS BY PRE-COMPUTING
37. 3 7
FLINK QUERYABLE STATE
REUSE OUR FEATURE REQUEST/RESPONSE
FRAMEWORK
SCALABLE, FAST
OUR READ LOAD IS LOW (10 / SECOND)
EXPENSIVE (RELATIVELY)
NOT A “PRODUCTION” CAPABILITY
NOT A “DATABASE” TECHNOLOGY
38. 3 8
QUERYABLE STATE - ONLINE FEATURE STORE
Feature
Manager
FLINK
LOCAL
STATE
Upstream
Flow
Model
Execution
Online Feature
Store
FLINK
LOCAL
STATE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Upstream
Flow
Customer
Experience
Data Point
Producers
ONLINE FEATURE STORE PATH
MODEL EXECUTION PATH
47. 4 7
MODEL EXECUTION ORCHESTRATION
MODEL EXECUTION BACKUP PROBLEM
GENERAL SLOWNESS BACKS UP MODEL EXECUTION PIPELINE
Model Execution
AsyncDataStream
60 second
timeout
Model
Execution
REST Service
Feature Assembly
5 Minute Global Window
Feature Getter
51. 5 1
KEEPING TRACK OF DATA POINT STATE
QUERYABLE STATE FEATURE STORE
UP TO 15KB (JSON) PER DATA POINT
10KB * 25 MILLION = 250 GB
1KB * 25 MILLION = 25 GB
APPROXIMATE TOTAL = 1 TB
RED / GREEN “TRIGGER”
MAPSTATE () / KEYBY (ENTITY ID)
[ENTITY ID] (DATA POINT TYPE, STATE)
25 MILLION CUSTOMERS
10+ DATA POINT TYPES
AVG 33 BYTES PER ENTITY ID
~1GB TOTAL
25+ MILLION HIGH SPEED INTERNET CUSTOMERS
10+ DATA POINT TYPES
TWO FLINK STATES TO MANAGE:
53. 5 3
CHECKPOINT TUNING
FEATURE MANAGER – UP TO 1TB OF STATE
2.5 MINUTES WRITING A CHECKPOINT EVERY 10 MINUTES
TURNED OFF FOR AWHILE!
(RATIONALE: WOULD ONLY TAKE A FEW HOURS TO RECOVER ALL STATE)
INCREMENTAL CHECKPOINTING HELPED
LONG ALIGNMENT TIMES ON HIGH PARALLELISM
REDUCED THE SIZE OF STATE FOR 4 AND 24 HOUR WINDOWS USING FEATURE LOOKUP
TABLE
55. 5 5
AMAZON EMR AND HIGH AVAILABILITY
EMR FLINK CLUSTER
SINGLE AVAILABILITY ZONE (AZ)
SINGLE MASTER NODE
NOT HA
AWS REGION
AZ1
AZ2
EMR
CLUSTER
1
M T
T
T
T
EMR
CLUSTER
2
M T
T
T
T
S3
Checkpoints
56. 5 6
AMAZON AWS EMR
SESSIONS KEPT DYING
TMP DIR IS PERIODICALLY CLEARED
BREAKS ‘BLOB STORE’ ON LONG RUNNING JOBS
SOLUTION - CONFIGURE TO NOT BE IN /TMP
jobmanager.web.tmpdir
blob.storage.directory
EMR FLINK VERSION LAGS
APACHE RELEASE BY 1-2 MONTHS
57. 5 7
TECHNOLOGY OPTIONS
FLINK ON KUBERNETES
AMAZON EKS (ELASTIC KUBERNETES SERVICE)
SELF-MANAGED KUBERNETES ON EC2
DATA ARTISANS VERVERICA PLATFORM
Bonus:
Reduced Cost / Overhead by
having fewer Master Nodes
Less overhead for smaller (low
parallelism) flows
60. 6 0
DATA POINT VOLUME MAY–NOVEMBER 2018
OCTOBER 2018 – 1 BILLION
61. 6 1
DATA POINT VOLUME MAY–NOVEMBER 2018
OCTOBER 2018 – 1 BILLION
NOVEMBER 2018 – 2 BILLION
62. 6 2
LET’S ADD ANOTHER 3 BILLION!
15 MILLION DEVICES
3 BILLION DATA POINTS
63. 6 3
SOLUTION APPROACH
ISOLATE TRIGGER FOR HIGH-
VOLUME DATA POINT TYPES
ISOLATE ONLINE FEATURE
STORE FOR HIGH- VOLUME
DATA POINT TYPES
ISOLATE DEDICATED FEATURE
STORAGE FLOW TO HIGH-SPEED
FEATURE STORE
SEPARATE HISTORICAL
FEATURE STORAGE (S3 WRITER)
FROM REAL TIME MODEL
EXECUTION FLOWS
2
3
4
1
64. 6 4
1: SEPARATE HIGH VOLUME TRIGGER FLOWS
State Flow
And Trigger
FLINK
LOCAL
STATE
HIGH VOLUME
State / Trigger 2
HIGH
VOLUME
Data Point
Producer
2
FLINK
LOCAL
STATE
HIGH VOLUME
State / Trigger 1
HIGH
VOLUME
Data Point
Producer
1
FLINK
LOCAL
STATE
ONLINE FEATURE STORE
MODEL EXECUTION
3B
IN
50M
OUT
65. 6 5
2: FEDERATED ONLINE FEATURE STORE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Online Feature
Store
FLINK
LOCAL
STATE
Online Feature
Store 2
HIGH
VOLUME
Data Point
Producer
2
FLINK
LOCAL
STATE
Online Feature
Store 1
HIGH
VOLUME
Data Point
Producer
1
FLINK
LOCAL
STATE
FEDERATEDFEATURESTORE
Model/ Feature
Metadata
66. 6 6
3+4: HISTORICAL FEATURE STORE
Feature
REST Service
(Queryable
State Client)
REQUESTING
APPLICATION
Online Feature
Store
HIGH
VOLUME
Data Point
Producer
1
FLINK
LOCAL
STATE
Historical
Feature Store
(S3)
HIGH VOLUME
State / Trigger
FLINK
LOCAL
STATE
Feature
Producer
Feature
Sink
MODEL EXECUTION
67. 6 7
MODEL EXECUTION
FEDERATED
ONLINE
FEATURE
STORE
State Flow
And Trigger
CURRENT ARCHITECTURE
High Volume
Feature Producer
Feature Sink
State Trigger
Prediction /
Outcome Store
Prediction
Store
REST Service
Feature Store
REST Service
Customer
Experience
Data Point
Producers
FLINK
LOCAL
STATE
HIGH
VOLUME
Data Point
Producer
High Volume
Feature Producer
Feature Sink
State Trigger
HIGH
VOLUME
Data Point
Producer
Historical
Feature Store
Feature
Sink
FEATURE
MANAGER
Feature
Manager
Feature
Assembly
Model
Execution
Prediction
Sink
Feature
Manager
Feature
Assembly
68. 6 8
WHERE ARE WE TODAY
INDICATORS HIGH-VOLUME
INDICATORS TOTAL
725+
MILLION
DATA
POINTS
PER DAY
6 BILLION
PER DAY
~7 BILLION
DATA POINTS
PER DAY
69. 6 9
WHERE ARE WE TODAY – FLINK CLUSTERS
CLUSTERS
14
INSTANCES
150
VCPU
1100
RAM
5.8 TB
70. 7 0
FUTURE WORK
EXPAND
CUSTOMER EXPERIENCE
INDICATORS TO OTHER
COMCAST PRODUCTS
IMPROVED ML MODELS
BASED ON RESULTS AND
CONTINUOUS
RETRAINING
MODULAR
MODEL EXECUTION
VIA KUBEFLOW
AND GRPC CALLS