As many industries, banking is undergoing a fundamental change because of the software revolution. No longer are banks competing only on interest rates and having the best traders, these days customer experience and having the best engineers are the focus. In this changing world, banks compete with new start-ups, the so-called Fintechs, and with large platform organisations such as Google, Facebook and Apple. At ING, we believe that staying ahead of the game means changing how we interact with our customers, no longer a traditional model of waiting for the customers to come to the bank through our website or apps, but to actively reach out to the customer with information that is relevant to him or her in order to make their financial life frictionless. Many of these changes are driven by reacting to all events that are relevant to the customer, and using streaming analytics to be able to reach out to the customer in milliseconds after the event occurs. Apache Flink is key for ING to achieve this. This presentation addresses how ING approaches the challenge, the role that Apache Flink plays, and the consequences regulations have on how we work with Open Source in general, and with Apache Flink (and data Artisans) in particular. This keynote takes place at Kino 3.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
Responding to a global pandemic presents a unique set of technical and public health challenges. The real challenge is the ability to gather data coming in via many data streams in variety of formats influences the real-world outcome and impacts everyone. The Centers for Disease Control and Prevention CELR (COVID Electronic Lab Reporting) program was established to rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners. Confluent Kafka with KStreams and Connect play a critical role in program objectives to:
o Track the threat of COVID-19 virus
o Provide comprehensive data for local, state, and federal response
o Better understand locations with an increase in incidence
http://flink-forward.org/kb_sessions/flink-and-beam-current-state-roadmap/
It is no secret that the Dataflow model, which evolved from Google’s MapReduce, Flume, and MillWheel, has been a major influence to Apache Flink’s streaming API. The essentials of this model are captured in Apache Beam. Beam provides the Dataflow API with the option to deploy to various backends (e.g. Flink, Spark). In this talk we will examine the current state of the Flink Runner. Beam’s Runners manage the translation of the Beam API into the backend API. The Beam project itself has made an effort to summarize the capabilities of each Runner to provide an overview of the supported API concepts. From all open sources backends, Flink is currently the Runner which supports the most features. We will look at the supported Beam features and their counterpart in Flink. Further, we will look at potential improvements and upcoming features of the Flink Runner.
Apache Flink 101 - the rise of stream processing and beyondBowen Li
Apache Flink is the most popular and widely adopted streaming processing framework, powering real time stream event computations at extremely large scale in companies like Uber, Lyft, AWS, Alibaba, Pinterest, Splunk, Yelp, etc.
In this talk, we will go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink fills the gaps and stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will also take a look at how Flink is going beyond stream processing into areas like unified data processing, enterprise intergration, AI/machine learning (especially online ML), and serverless computation, and how Flink fits with its distinct value.
SPEAKER: Bowen Li
SPEAKER BIO: Bowen is a committer of Apache Flink, senior engineer at Alibaba, and host of Seattle Flink Meetup.
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
To remain competitive, organizations need to democratize access to fast analytics, not only to gain real-time insights on their business but also to power smart apps that need to react in the moment. In this session, you will learn how Kafka and SingleStore enable modern, yet simple data architecture to analyze both fast paced incoming data as well as large historical datasets. In particular, you will understand why SingleStore is well suited process data streams coming from Kafka.
KSQL-ops! Running ksqlDB in the Wild (Simon Aubury, ThoughtWorks) Kafka Summi...confluent
Since its release in 2018, KSQL has grown from interesting curiosity into ksqlDB - a production grade streaming system. What does it look like to run KSQL in the enterprise? How has the promise of the Kafka Streams with an SQL dialect worked in the wild?
Let's explore stream processing with ksqlDB in the enterprise. How is it used to rapid prototyping; for taking an idea to production. Using the flexible scripting to help teams with error discover and system introspection. Plus how extended teams can use KSQL as a stepping stone for building and sharing real-time scoring and streaming insights.
This session will cover production deployments of ksqlDB in banking, finance, transport and insurance. What can go wrong, and what can go right. See how teams embrace the technology to solve stream processing challenges.
Scaling stream data pipelines with Pravega and Apache FlinkTill Rohrmann
Extracting insights out of continuously generated data requires a stream processor with powerful data analytics features such as Apache Flink. A stream data pipeline with Flink typically includes a storage component to ingest and serve the data. Pravega is a stream store that ingests and stores stream data permanently, making the data available for tail, catch-up, and historical reads. One important challenge for such stream data pipelines is coping with the variations in the workload. Daily cycles and seasonal spikes might require the provisioning of the application to adapt accordingly. Pravega has a feature called stream scaling, which enables the capacity offered for the ingestion of events of a stream to grow and shrink over time according to workload. Such a feature is useful when the application downstream has the ability of accommodating such changes and also scale its provisioning accordingly. In this presentation, we introduce stream scaling in Pravega and how Flink jobs leverage this feature to rescale stateful jobs according to variations in the workload.
Flink Forward Berlin 2018: Aljoscha Krettek & Till Rohrmann - Keynote: "A Yea...Flink Forward
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Flattening the Curve with Kafka (Rishi Tarar, Northrop Grumman Corp.) Kafka S...confluent
Responding to a global pandemic presents a unique set of technical and public health challenges. The real challenge is the ability to gather data coming in via many data streams in variety of formats influences the real-world outcome and impacts everyone. The Centers for Disease Control and Prevention CELR (COVID Electronic Lab Reporting) program was established to rapidly aggregate, validate, transform, and distribute laboratory testing data submitted by public health departments and other partners. Confluent Kafka with KStreams and Connect play a critical role in program objectives to:
o Track the threat of COVID-19 virus
o Provide comprehensive data for local, state, and federal response
o Better understand locations with an increase in incidence
http://flink-forward.org/kb_sessions/flink-and-beam-current-state-roadmap/
It is no secret that the Dataflow model, which evolved from Google’s MapReduce, Flume, and MillWheel, has been a major influence to Apache Flink’s streaming API. The essentials of this model are captured in Apache Beam. Beam provides the Dataflow API with the option to deploy to various backends (e.g. Flink, Spark). In this talk we will examine the current state of the Flink Runner. Beam’s Runners manage the translation of the Beam API into the backend API. The Beam project itself has made an effort to summarize the capabilities of each Runner to provide an overview of the supported API concepts. From all open sources backends, Flink is currently the Runner which supports the most features. We will look at the supported Beam features and their counterpart in Flink. Further, we will look at potential improvements and upcoming features of the Flink Runner.
Apache Flink 101 - the rise of stream processing and beyondBowen Li
Apache Flink is the most popular and widely adopted streaming processing framework, powering real time stream event computations at extremely large scale in companies like Uber, Lyft, AWS, Alibaba, Pinterest, Splunk, Yelp, etc.
In this talk, we will go over use cases and basic (yet hard to achieve!) requirements of stream processing, and how Flink fills the gaps and stands out with some of its unique core building blocks, like pipelined execution, native event time support, state support, and fault tolerance.
We will also take a look at how Flink is going beyond stream processing into areas like unified data processing, enterprise intergration, AI/machine learning (especially online ML), and serverless computation, and how Flink fits with its distinct value.
SPEAKER: Bowen Li
SPEAKER BIO: Bowen is a committer of Apache Flink, senior engineer at Alibaba, and host of Seattle Flink Meetup.
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...confluent
PayPal currently processes tens of billions of signals per day from different sources in batch and streaming mode. The data processing platform is the one powering these different analytical needs and use cases, not just at PayPal but our adjacencies like Venmo, Hyperwallet and iZettle. End users of this platform demand access to data insights with as much flexibility as possible to explore it with low processing latency.
One such use case is where our Switchboard(data de-multiplexer) platform where we process approximately 20 billion events daily and provide data to different teams and platforms with-in PayPal and also to platform outside PayPal for more insights. When we started building this platform Kafka was just another asynchronous message processing platform for us but we have seen it evolving to a place where its adds value not just in terms of event processing but also for platform resiliency and scalability.
Takeaway for the audience: Most people work with and have knowledge about data. With this talk I want to present information which is relevant and meaningful to the audience. Information and examples which will make it easier for attendees to understand our complex system and hopefully have some practical takeaways to use Kafka for similar problems on their hand.
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
Speaker: Yupeng Fu, Staff Engineer, Uber
High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned.
Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Flink Forward
http://flink-forward.org/kb_sessions/flink-in-zalandos-world-of-microservices/
In this talk we present Zalando’s microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing with Apache Flink for near-real time business intelligence.
Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization.
We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach – Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams.
We no longer live in a world of static data sets, but are instead confronted with endless streams of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch.
With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL.
Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements.
On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability.
Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward
One of the main characteristics of the good streaming pipeline is correctness for event time processing. Real challenges become when such pipeline should be resilient to different types of failures. In this talk, we describe how Criteo runs Flink on one of the biggest Yarn clusters in Europe and computes 100k messages per second to acknowledge revenue of our platform within the delay of 5 minutes. Real-time revenue monitoring system calculates data under 1% of discrepancies and minimizes business impact in case of revenue anomalies.
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
CloudStream service is a Full Management Service in Huawei Cloud. Support several features, such as On-Demand Billing, easy-to-use Stream SQL in online SQL editor, test Stream SQL in real-time style, Multi-tenant, security isolation and so on. We choose Apache Flink as streaming compute platform. Inside of CloudStream Cluster, Flink job can run on Yarn, Mesos, Kubernetes. We also have extended Apache Flink to meet IoT scenario needs. There are specialized tests on Flink reliability with college cooperation. Finally continuously improve the infrastructure around CS including open source projects and cloud services. CloudStream is different with any other real-time analysis cloud service. The development process can also be shared at architecture and principles.
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward
The increasing number of available data sources in today's application stacks created a demand to continuously capture and process data from various sources to quickly turn high volume streams of raw data into actionable insights. Apache Flink addresses many of the challenges faced in this domain as it's specifically tailored to distributed computations over streams. While Flink provides all the necessary capabilities to process streaming data, provisioning and maintaining a Flink cluster still requires considerable effort and expertise. We will discuss how cloud services can remove most of the burden of running the clusters underlying your Flink jobs and explain how to build a real-time processing pipeline on top of AWS by integrating Flink with Amazon Kinesis and Amazon EMR. We will furthermore illustrate how to leverage the reliable, scalable, and elastic nature of the AWS cloud to effectively create and operate your real-time processing pipeline with little operational overhead.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward
In Zalando's microservice architecture, each service continuously generates streams of events for the purposes of inter-service communication or data integration. Some of these events describe business processes, e.g. a customer has placed an order or a parcel has been shipped. Out of this, the need to materialize event streams from the central event bus into persistent cloud storage evolved. The temporarily persisted data is then integrated into our relational data warehouse. In this talk we present a materialization engine backed by Apache Flink. We show how we employ Flink’s RESTful API, custom accumulators and stoppable sources to provide another API abstraction layer for deploying, monitoring and controlling our materialization jobs. Our jobs compact event streams depending on event properties and transform their complex JSON structures into flat files for easier integration into the data warehouse.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
(Bob Lehmann, Bayer) Kafka Summit SF 2018
You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why?
In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform.
In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward
In this talk we walk you through our operational journey for building and operating large-scale production streaming applications at King. Our Flink streaming jobs consume over 40 billion events every day maintaining over 10 TBs of user state with Flink's strong consistency guarantees. We will focus on the evolution of our in-house stream processing platform, and the different challenges we have faced at different times in it's 1.5 years of production lifeline. Topics we'll cover: - Overview of our platform - Cluster environment, and deployment procedures - Flink job management (configuration, versioning, etc.) - Monitoring and debugging practices - Some Flink features that we find super amazing.
New Approaches for Fraud Detection on Apache Kafka and KSQLconfluent
Speakers: Dale Kim, Sr. Director, Products/Solutions, Arcadia Data + Chong Yan, Solutions Architect, Confluent
When it comes to corporate fraud, early detection is integral to mitigating and preventing drastic damage.
Modern streaming data technologies like Apache Kafka® and Confluent KSQL, the streaming SQL engine for Apache Kafka, can help companies catch and detect fraud in real time instead of after the fact. Kafka is ideal for managing fast, incoming data points, and KSQL provides the de facto standard for reading that data. Combine this with Arcadia Data visualizations designed for modern data types, and you have a powerful foundation for combating fraud.
You will learn:
-Why traditional batch-driven approaches to fraud detection are insufficient today
-Why Apache Kafka is widely used for real-time fraud detection
-How KSQL and real-time visualizations open more opportunities for searching for fraud
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent
Several different frameworks have been developed to draw data from Kafka and maintain standard SQL over continually changing data. This provides an easy way to query and transform data - now accessible by orders of magnitude more users.
At the same time, using Standard SQL against changing data is a new pattern for many engineers and analysts. While the language hasn’t changed, we’re still in the early stages of understanding the power of SQL over Kafka - and in some interesting ways, this new pattern introduces some exciting new idioms.
In this session, we’ll start with some basic use cases of how Standard SQL can be effectively used over events in Kafka- including how these SQL engines can help teams that are brand new to streaming data get started. From there, we’ll cover a series of more advanced functions and their implications, including:
- WHERE clauses that contain time change the validity intervals of your data; you can programmatically introduce and retract records based on their payloads!
- LATERAL joins turn streams of query arguments into query results; they will automatically share their query plans and resources!
- GROUP BY aggregations can be applied to ever-growing data collections; reduce data that wouldn't even fit in a database in the first place.
We'll review in-production examples where each of these cases make unmodified Standard SQL, run and maintain over data streams in Kafka, and provide the functionality of bespoke stream processors.
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
SQL is the lingua franca of data processing and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink’s SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers. In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential. Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Stream processing still evolves and changes at a speed that can make it hard to keep up with the developments. Being at the forefront of stream processing technology, the evolution of Apache Flink has mirrored many of these developments and continues to do so.
We will take you on a journey through the major milestones of stream processing technology in past years, diving into the latest additions that Apache Flink and other communities introduced to the stream processing landscape, such as Streamng SQL, Time Versioned Tables, cluster-library-duality, language portability, etc.
We will take a sneak peek into our crystal ball and present in what the Flink community is working on next.
Disaster Recovery for Multi-Region Apache Kafka Ecosystems at Uberconfluent
Speaker: Yupeng Fu, Staff Engineer, Uber
High availability and reliability are important requirements to Uber services, and the services shall tolerate datacenter failures in a region and fail over to another region. In this talk, we will present the active-active Apache Kafka® at Uber and how it facilitates disaster discovery across regions for Uber services. In particular, we will highlight the key components including topic replication, topic aggregation, offsets sync and then walk through several use cases of their disaster recovery strategy using active-active Kafka. Lastly, we will present several interesting challenges and the future work planned.
Yupeng Fu is a staff engineer in Uber Data Org leading the streaming data platform. Previously, he worked at Alluxio and Palantir, building distributed data analysis and storage platforms. Yupeng holds a B.S. and an M.S. from Tsinghua University and did his Ph.D. research on databases at UCSD.
Javier Lopez_Mihail Vieru - Flink in Zalando's World of Microservices - Flink...Flink Forward
http://flink-forward.org/kb_sessions/flink-in-zalandos-world-of-microservices/
In this talk we present Zalando’s microservices architecture, introduce Saiki – our next generation data integration and distribution platform on AWS and show how we employ stream processing with Apache Flink for near-real time business intelligence.
Zalando is one of the largest online fashion retailers in Europe. In order to secure our future growth and remain competitive in this dynamic market, we are transitioning from a monolithic to a microservices architecture and from a hierarchical to an agile organization.
We first have a look at how business intelligence processes have been working inside Zalando for the last years and present our current approach – Saiki. It is a scalable, cloud-based data integration and distribution infrastructure that makes data from our many microservices readily available for analytical teams.
We no longer live in a world of static data sets, but are instead confronted with endless streams of events that constantly inform us about relevant happenings from all over the enterprise. The processing of these event streams enables us to do near-real time business intelligence. In this context we have evaluated Apache Flink vs. Apache Spark in order to choose the right stream processing framework. Given our requirements, we decided to use Flink as part of our technology stack, alongside with Kafka and Elasticsearch.
With these technologies we are currently working on two use cases: a near real-time business process monitoring solution and streaming ETL.
Monitoring our business processes enables us to check if technically the Zalando platform works. It also helps us analyze data streams on the fly, e.g. order velocities, delivery velocities and to control service level agreements.
On the other hand, streaming ETL is used to relinquish resources from our relational data warehouse, as it struggles with increasingly high loads. In addition to that, it also reduces the latency and facilitates the platform scalability.
Finally, we have an outlook on our future use cases, e.g. near-real time sales and price monitoring. Another aspect to be addressed is to lower the entry barrier of stream processing for our colleagues coming from a relational database background.
Flink Forward Berlin 2018: Oleksandr Nitavskyi - "Data lossless event time st...Flink Forward
One of the main characteristics of the good streaming pipeline is correctness for event time processing. Real challenges become when such pipeline should be resilient to different types of failures. In this talk, we describe how Criteo runs Flink on one of the biggest Yarn clusters in Europe and computes 100k messages per second to acknowledge revenue of our platform within the delay of 5 minutes. Real-time revenue monitoring system calculates data under 1% of discrepancies and minimizes business impact in case of revenue anomalies.
Flink Forward San Francisco 2018: - Jinkui Shi and Radu Tudoran "Flink real-t...Flink Forward
CloudStream service is a Full Management Service in Huawei Cloud. Support several features, such as On-Demand Billing, easy-to-use Stream SQL in online SQL editor, test Stream SQL in real-time style, Multi-tenant, security isolation and so on. We choose Apache Flink as streaming compute platform. Inside of CloudStream Cluster, Flink job can run on Yarn, Mesos, Kubernetes. We also have extended Apache Flink to meet IoT scenario needs. There are specialized tests on Flink reliability with college cooperation. Finally continuously improve the infrastructure around CS including open source projects and cloud services. CloudStream is different with any other real-time analysis cloud service. The development process can also be shared at architecture and principles.
Flink Forward Berlin 2017: Steffen Hausmann - Build a Real-time Stream Proces...Flink Forward
The increasing number of available data sources in today's application stacks created a demand to continuously capture and process data from various sources to quickly turn high volume streams of raw data into actionable insights. Apache Flink addresses many of the challenges faced in this domain as it's specifically tailored to distributed computations over streams. While Flink provides all the necessary capabilities to process streaming data, provisioning and maintaining a Flink cluster still requires considerable effort and expertise. We will discuss how cloud services can remove most of the burden of running the clusters underlying your Flink jobs and explain how to build a real-time processing pipeline on top of AWS by integrating Flink with Amazon Kinesis and Amazon EMR. We will furthermore illustrate how to leverage the reliable, scalable, and elastic nature of the AWS cloud to effectively create and operate your real-time processing pipeline with little operational overhead.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Flink Forward Berlin 2017: Mihail Vieru - A Materialization Engine for Data I...Flink Forward
In Zalando's microservice architecture, each service continuously generates streams of events for the purposes of inter-service communication or data integration. Some of these events describe business processes, e.g. a customer has placed an order or a parcel has been shipped. Out of this, the need to materialize event streams from the central event bus into persistent cloud storage evolved. The temporarily persisted data is then integrated into our relational data warehouse. In this talk we present a materialization engine backed by Apache Flink. We show how we employ Flink’s RESTful API, custom accumulators and stoppable sources to provide another API abstraction layer for deploying, monitoring and controlling our materialization jobs. Our jobs compact event streams depending on event properties and transform their complex JSON structures into flat files for easier integration into the data warehouse.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
Apache Flink features two APIs which are based on relational algebra, a SQL interface and the so-called Table API, which is a LINQ-style API available for Scala and Java. Relational APIs are interesting because they are easy to use and queries can be automatically optimized and translated into efficient runtime code. Flink offers both APIs for streaming and batch data sources. This talk takes a look under the hood of Flink’s relational APIs. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries of both APIs into the same representation, leverages Apache Calcite to optimize them, and generates runtime code for efficient execution. Finally, the slides discuss potential improvements and give an outlook for future extensions and features.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Bringing Streaming Data To The Masses: Lowering The “Cost Of Admission” For Y...confluent
(Bob Lehmann, Bayer) Kafka Summit SF 2018
You’ve built your streaming data platform. The early adopters are “all in” and have developed producers, consumers and stream processing apps for a number of use cases. A large percentage of the enterprise, however, has expressed interest but hasn’t made the leap. Why?
In 2014, Bayer Crop Science (formerly Monsanto) adopted a cloud first strategy and started a multi-year transition to the cloud. A Kafka-based cross-datacenter DataHub was created to facilitate this migration and to drive the shift to real-time stream processing. The DataHub has seen strong enterprise adoption and supports a myriad of use cases. Data is ingested from a wide variety of sources and the data can move effortlessly between an on premise datacenter, AWS and Google Cloud. The DataHub has evolved continuously over time to meet the current and anticipated needs of our internal customers. The “cost of admission” for the platform has been lowered dramatically over time via our DataHub Portal and technologies such as Kafka Connect, Kubernetes and Presto. Most operations are now self-service, onboarding of new data sources is relatively painless and stream processing via KSQL and other technologies is being incorporated into the core DataHub platform.
In this talk, Bob Lehmann will describe the origins and evolution of the Enterprise DataHub with an emphasis on steps that were taken to drive user adoption. Bob will also talk about integrations between the DataHub and other key data platforms at Bayer, lessons learned and the future direction for streaming data and stream processing at Bayer.
Flink Forward Berlin 2017: Gyula Fora - Building and operating large-scale st...Flink Forward
In this talk we walk you through our operational journey for building and operating large-scale production streaming applications at King. Our Flink streaming jobs consume over 40 billion events every day maintaining over 10 TBs of user state with Flink's strong consistency guarantees. We will focus on the evolution of our in-house stream processing platform, and the different challenges we have faced at different times in it's 1.5 years of production lifeline. Topics we'll cover: - Overview of our platform - Cluster environment, and deployment procedures - Flink job management (configuration, versioning, etc.) - Monitoring and debugging practices - Some Flink features that we find super amazing.
New Approaches for Fraud Detection on Apache Kafka and KSQLconfluent
Speakers: Dale Kim, Sr. Director, Products/Solutions, Arcadia Data + Chong Yan, Solutions Architect, Confluent
When it comes to corporate fraud, early detection is integral to mitigating and preventing drastic damage.
Modern streaming data technologies like Apache Kafka® and Confluent KSQL, the streaming SQL engine for Apache Kafka, can help companies catch and detect fraud in real time instead of after the fact. Kafka is ideal for managing fast, incoming data points, and KSQL provides the de facto standard for reading that data. Combine this with Arcadia Data visualizations designed for modern data types, and you have a powerful foundation for combating fraud.
You will learn:
-Why traditional batch-driven approaches to fraud detection are insufficient today
-Why Apache Kafka is widely used for real-time fraud detection
-How KSQL and real-time visualizations open more opportunities for searching for fraud
Time Series Analysis Using an Event Streaming PlatformDr. Mirko Kämpf
Advanced time series analysis (TSA) requires very special data preparation procedures to convert raw data into useful and compatible formats.
In this presentation you will see some typical processing patterns for time series based research, from simple statistics to reconstruction of correlation networks.
The first case is relevant for anomaly detection and to protect safety.
Reconstruction of graphs from time series data is a very useful technique to better understand complex systems like supply chains, material flows in factories, information flows within organizations, and especially in medical research.
With this motivation we will look at typical data aggregation patterns. We investigate how to apply analysis algorithms in the cloud. Finally we discuss a simple reference architecture for TSA on top of the Confluent Platform or Confluent cloud.
How to use Standard SQL over Kafka: From the basics to advanced use cases | F...HostedbyConfluent
Several different frameworks have been developed to draw data from Kafka and maintain standard SQL over continually changing data. This provides an easy way to query and transform data - now accessible by orders of magnitude more users.
At the same time, using Standard SQL against changing data is a new pattern for many engineers and analysts. While the language hasn’t changed, we’re still in the early stages of understanding the power of SQL over Kafka - and in some interesting ways, this new pattern introduces some exciting new idioms.
In this session, we’ll start with some basic use cases of how Standard SQL can be effectively used over events in Kafka- including how these SQL engines can help teams that are brand new to streaming data get started. From there, we’ll cover a series of more advanced functions and their implications, including:
- WHERE clauses that contain time change the validity intervals of your data; you can programmatically introduce and retract records based on their payloads!
- LATERAL joins turn streams of query arguments into query results; they will automatically share their query plans and resources!
- GROUP BY aggregations can be applied to ever-growing data collections; reduce data that wouldn't even fit in a database in the first place.
We'll review in-production examples where each of these cases make unmodified Standard SQL, run and maintain over data streams in Kafka, and provide the functionality of bespoke stream processors.
Flink Forward San Francisco 2018: Fabian Hueske & Timo Walther - "Why and how...Flink Forward
SQL is the lingua franca of data processing and everybody working with data knows SQL. Apache Flink provides SQL support for querying and processing batch and streaming data. Flink’s SQL support powers large-scale production systems at Alibaba, Huawei, and Uber. Based on Flink SQL, these companies have built systems for their internal users as well as publicly offered services for paying customers. In our talk, we will discuss why you should and how you can (not being Alibaba or Uber) leverage the simplicity and power of SQL on Flink. We will start exploring the use cases that Flink SQL was designed for and present real-world problems that it can solve. In particular, you will learn why unified batch and stream processing is important and what it means to run SQL queries on streams of data. After we explored why you should use Flink SQL, we will show how you can leverage its full potential. Since recently, the Flink community is working on a service that integrates a query interface, (external) table catalogs, and result serving functionality for static, appending, and updating result sets. We will discuss the design and feature set of this query service and how it can be used for exploratory batch and streaming queries, ETL pipelines, and live updating query results that serve applications, such as real-time dashboards. The talk concludes with a brief demo of a client running queries against the service.
Modern ETL Pipelines with Change Data CaptureDatabricks
In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.
We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Digital Transformation Mindset - More Than Just Technologyconfluent
Many enterprises faced with silo’ed, batch-oriented, legacy systems struggle to compete in this new digital-first world. Adhering to the ‘If it’s not broken don’t fix it’ mentality leaves the door wide open for native digital challengers to grow and succeed. To stay competitive, your organization must respond in real time to every customer experience transaction, sale, and market movement. But how do you get there? First, you must change your mindset.
As streaming platforms become central to data strategies, companies both small and large are re-thinking their enterprise architecture with real-time context at the forefront. Monoliths are evolving into microservices. Datacenters are moving to the cloud. What was once a ‘batch’ mindset is quickly being replaced with stream processing as the demands of the business impose real-time requirements on technology leaders.
Join Argyle, in partnership with Confluent, in our 2018 CIO Virtual Event: The Digital Transformation Mindset – More Than Just Technology. During the webinar we’ll learn how leading companies across industries rely on a streaming platform to make event-driven architectures central to:
• How data strategies and IT initiatives are improving the digital customer experiences
• How executives are reducing risk with real time monitoring and anomaly detection
• Increasing operational agility with microservices and IoT architectures within organizations
In order to deal with customers expecting a seamless omnichannel experience, increased regulations and speed with which innovative fin-techs enter the market, ING has formulated a customer centric strategy based on data and analytics.
Last year we talked about the fact that ING developed a new architecture, the ING Data Lake. And how within ING In parallel the Big Data paradigm, based on Hadoop, appeared and how this was mapped on the Data Lake architecture to make sure Hadoop is leveraged to the maximum.
This year we want to tell you how the international working group helped realizing the advanced analytic pattern on the ING private cloud, without prior management approval.
This presentation will discuss the community strategy, how to stay under the radar, how to surface when actual content is strong enough to force change, open issues and the private cloud challenges ING is dealing with. Join us in this ride from community idea through architecture to private cloud implementation with some organizational challenges along the way.
Digital revolution is disrupting businesses like never before! Ability to extract actionable insight from a large amount of disparate data has become the determining factor of competitive advantage! Everyday new business models are created around data and forcing the incumbents to reinvent themselves to be relevant. Consumer facing businesses felt this pressure early on but eventually every business need to be data driven. But what is the best strategy to address this digital disruption? Our experience says the core data infrastructure modernization is the logical starting point! In this session, we will share trends, strategies and our experience on rejuvenating data integration landscape to address digital disruptions.
Flink Forward Berlin 2017: Bas Geerdink, Martijn Visser - Fast Data at ING - ...Flink Forward
ING is using Apache Flink for creating streaming analytics ('fast data') solutions. We created a platform with Flink and Kafka that offers high-throughput and low-latency, ideally suited for complex and demanding use cases in the international bank such as customer notifications and fraud detection. These use cases require fast data processing and a business rules engine and/or machine learning evaluation system. Integrating these components together in a always-on, distributed architecture can be challenging. In this talk, we'll start with a brief overview of the use cases. You'll learn why ING chose Flink for these use cases, and see the architecture of the streaming data platform in depth. Finally, we'll share some lessons learned and useful insights for organizations who embark on a similar journey.
Data is both our most valuable asset and our biggest ongoing challenge. As data grows in volume, variety and complexity, across applications, clouds and siloed systems, traditional ways of working with data no longer work.
Unlike traditional databases, which arrange data in rows, columns and tables, Neo4j has a flexible structure defined by stored relationships between data records.
We'll discuss the primary use cases for graph databases
Explore the properties of Neo4j that make those use cases possible
Look into the visualisation of graphs
Introduce how to write queries.
Webinar, 23 July 2020
We are a IT consulting company providing services to clients across geographies in Data Engineering, AI/ML, Cloud & DevOps, Platform Engineering, and Process Hyper automation.
Agile Mumbai 2022
Real-Time Insights and AI for better Products, Customer experience and Resilient Platform
Balvinder Kaur
Principal Consultant, Thoughtworks
Sushant Joshi
Product Manager, Thoughtworks
Using Kafka in Your Organization with Real-Time User Insights for a Customer ...confluent
(Chris Maier + Steven Royster, West Monroe Partners) Kafka Summit SF 2018
The value of real-time data is growing as an increasing number of companies look to provide a comprehensive experience for their customers. Utilizing Kafka in key facets of your organization will yield greater customer satisfaction and promote a better understanding of user interactions. As data streaming is becoming more prevalent in a wide variety of industries, companies are seeking to modernize their tech stacks by employing the extensible, scalable infrastructure afforded by Kafka.
Over the past few months, we have successfully developed a containerized Kafka implementation at a major healthcare provider. In addition, we created producers to publish messages to the Kafka cluster and consumers to receive them on the other end. By capturing a plethora of data around customer activity, we created opportunities for the business to act upon real-time metrics in order to provide an improved customer experience.
In this talk, we will cover the user-related data sources we connected to Kafka, the reasons we chose them, and how the insights gained from each source can be leveraged in your business. You will walk out understanding how capturing a wide variety of customer activity data can create opportunities for the business to act on real-time metrics in order to provide an improved customer experience.
Building a fully managed stream processing platform on Flink at scale for Lin...Flink Forward
Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...Flink Forward
Flink Forward San Francisco 2022.
To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy.
by
Aansh Shah
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...Flink Forward
Flink Forward San Francisco 2022.
Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover.
by
Mason Chen
One sink to rule them all: Introducing the new Async SinkFlink Forward
Flink Forward San Francisco 2022.
Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework.
by
Steffen Hausmann & Danny Cranmer
Tuning Apache Kafka Connectors for Flink.pptxFlink Forward
Flink Forward San Francisco 2022.
In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you!
by
Olena Babenko
Flink powered stream processing platform at PinterestFlink Forward
Flink Forward San Francisco 2022.
Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework.
by
Rainie Li & Kanchi Masalia
Flink Forward San Francisco 2022.
This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications.
We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs.
After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are.
by
David Moravek
Where is my bottleneck? Performance troubleshooting in FlinkFlink Forward
Flinkn Forward San Francisco 2022.
In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times.
by
Piotr Nowojski
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Flink Forward San Francisco 2022.
The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades.
by
David Andreson
Flink Forward San Francisco 2022.
Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects:
1. Two different modes of use Pulsar as a metadata store
2. Data format transformation and management
3. SQL semantics support within Pulsar context
by
Sijie Guo & Neng Lu
Dynamic Rule-based Real-time Market Data AlertsFlink Forward
Flink Forward San Francisco 2022.
At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey.
by
Ajay Vyasapeetam & Madhuri Jain
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
Flink Forward San Francisco 2022.
At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing.
by
Xiang Zhang & Pratyush Sharma & Xiaoman Dong
Processing Semantically-Ordered Streams in Financial ServicesFlink Forward
Flink Forward San Francisco 2022.
What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases.
by
Patrick Lucas
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
Flink Forward San Francisco 2022.
In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling.
by
Gang Ye & Steven Wu
Batch Processing at Scale with Flink & IcebergFlink Forward
Flink Forward San Francisco 2022.
Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features.
by
Andreas Hailu
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer friction through streaming analytics
1. Taking away customer friction
through streaming analytics
FlinkForward Berlin
Ferd Scheepers. Chief Information Architect ING.
How ING uses data in real time with Apache Flink to enable it’s data driven journey
Sept. 2017
2. Market leaders Benelux
Growth markets
Commercial Banking
Challengers
The world of ING- The best global bank in the world
according to
Global Finance magazine
2
Customers
37 Million
Private, Corporate and
Institutional Customers
Countries
more than 40
In Europe, Asia, Australia,
North and South America
Employees
52,000
3. 3
1. Earn the primary relationship
2. Develop analytics skills to understand our customers better
3. Increase the pace of innovation to serve changing customer needs
4. Think beyond traditional banking to develop new services and business models
Empowering people to stay a step
ahead in life and in business.
Simplify &
Streamline
Operational
Excellence
Performance
Culture
Lending
Capabilities
Purpose
Customer
Promise
Strategic
Priorities
Enablers
Creating a differentiating customer
experience
Clear and Easy Anytime, Anywhere Empower Keep Getting Better
4. Trends in the banking landscape continue to evolve, our
strategy is there to adapt and come out on top
4
Customer
Behaviour
Competitive
Landscape
Technology
Fintech
SocietyRegulation
People’s time is precious;
they don’t want to spend it
on finance
Products have become commoditised; the only
way to differentiate is through the experience
Regulatory uncertainty
continues
Digitalisation is erasing
borders
Ability to leverage new
technologies will define
future competitive
advantage
5. “ING is an IT company with a banking license” – Ralph
Hamers
5
6. In 2017, a few companies are making huge amounts of money
with data. The data world is maturing.
6
The Economist: “The world’s most valuable resource is no longer oil, it’s data”
7. 5 years ago, everybody needed data scientists, now we need
data engineers, we are out of experimentation only.
7
8. Our data driven journey started around 5 years ago, even
before the CEO vision. Enter the ING Data Lake architecture.
8
Governance, Risk and
Compliance Platform
Events to
evaluate
Information
Service Calls
Data Load
Data
Feeds
Information
Service
Calls
Data
Export
Search
Requests
Report
Queries
Understand
Information
Sources
Understand
Information
Sources
Deploy
Decision
Models
Deploy
Real-time
Decision Models
Understand
Compliance
Report
Compliance
Information
Service
Calls
Data
Export
Advertise
Information
Source
Information
Owner
Information Systems
System of Record
Applications
Distribution Layer
Customer
Centric Core Layer
EnterpriseServiceBus
New Sources
Third Party Feeds
Internal Sources
Fulfilment Layer
Generic & Support
Services Layer
Data Lake
INFORMATION WAREHOUSE
DEEP DATA
Advanced Data
Provisioning
Catalog
Interfaces
Other
Data Lakes
Other
Data Lakes
Inter-lake
Exchange
Line of
Business
Interaction
Data Marts
Search Data
Information
Integration &
Governance
INFORMATION
BROKER
OPERATIONAL
GOVERNANCE
HUB
CODE
HUB
Decision Model
Management
10001
01011
01101
Line of Business
Insight
Simple,
Ad Hoc
Discovery
and
Analytics
Reporting
DataLakeRepositories
Shared
Operational
Data
ASSET
HUB
ACTIVITY
HUB
OPERATIONAL
STATUS
Descriptive
Data
INFORMATION
VIEWS
CATALOG
Deposited
Data
Harvested
Data
DEEP DATA
INFORMATION WAREHOUSE
Data Lake Operations
MONITOR WORKFLOW
Self-service
Data Access
Deploy
Real-time
Decision
Models
STAGING AREAS
Enterprise IT
Interaction
Information
Ingestion
Publishing
Feeds
Real-time
Interfaces
Generic
Real-time
Analytics
Decisions
STREAMING
ANALYTICS
Notification
Events
9. In 2015, we started to work on the Touch Point Architecture,
loosely based on platform thinking in the automotive
industry.
9
10. We identified the essential components of the new modular
bank, focussing on the customer and customer interaction
10
Authentication/Security provides the components and their interactions that facilitate a shared
security architecture, covering Authentication, Security Tokens, Authorisation and Risk Engine
interaction. Authentication/Security is based on the principles of ‘Stateless Design’, ‘Use of industry
standards’, ‘Channel-agnostic Authentication means’ and ‘Means-agnostic Business API’s’.
Notification brings the bank to its Customers. A push mechanism shifts ING towards becoming a
proactive bank versus a reactive bank. We will provide meaningful information to our Customers at all
time. We know the customer extremely well and add value by using pro-active notifications.
Standardised Global Party, Product & Arrangement Management (PPAM) defines a common data
exchange model and a set of interaction patterns that support globally reusable components with
regards to parties, products, agreements and mandates. All PPAM processes, exchange models and
data models are aligned in a standardised Global PPAM. Global PPAM has a standard interaction
pattern and a single API, with no boundaries between segments and countries.
Next to authentication, there is a specific Authorization workflow to request explicit consent of a
customer (by ‘signing’) for executing actions like a payments or changing an agreement.
Customer Context is about what we already know about the customer (previous locations,
arrangements, his financial position, etc.), about the environmental context (world news, news about
ING, what happens in my region) and what we know about the current customer’s touchpoint
(location, device, time). Events are about a specific touchpoint, and as such influence the current
interaction, but can also update person context as well as environment context for future analytics.
12. Offering our customer actionable insights is the foundation of
the new banking experience
1212
Relevant
Actionable
Real Time
Personal
13. To deliver on the promise of TPA, we combined some
workstreams, and started to look at solutions already in
place.
13
14. Except for Coral, none of the existing solutions really suited
the architecture, and Coral turned out to be hard to develop.
14
Streaming Computing
Web, Apps and Services
Event Bus
Big Data (Not Realtime)
other data sources,
Lakes, SoR and services
REST
API
STREAM
API
OLTP
API
Data Science Scenario testing
● Low Latency Data store
● Streaming computing on event Bus
● In-Memory Big Data computing
Data Store
Mobile and Web
15. As we started looking at patterns for streaming analytics, they
always looked alike. Fraud and customer use cases are
identical.
15
Producer Raw event What is it Relevant event
What to
do with it
Outcome
Producer
Producer
Producer
CEP
Filtering
Enriching
Rules
Model scoring
Artificial Intelligence
Machine Learning
16. To re-use all the analytics, and to share state, we decided on
one global platform for all streaming analytics in ING.
16
And we did a beauty contest for the technology choice
17. The result is an Event based architecture, with Apache Kafka
and Apache Flink at the heart of it. Delivered as one platform.
17
18. CCN
The Operating Model for all TPA platforms is global, one tribe,
one product owner, a platform squad, potentially satellite
squads
18
Platform
tribe
Flink Platform
Squad
SAS platform
squad
X
Z
Y
Feature
squad1
Feature
squad 2
Satellite
Satellite
Satellite
Organization • Platform team builds the core
• Satellite teams across different regions do first line support
• Feature teams build on top of the platform
Code base • Maintained by Platform team
Deployment • Handled by platform team
• Multiple instances in different regions under same Life Cycle
Management
Use case jobs • Build and maintained by feature team
Support/
Monitoring
• Platform squad and satellite squads will both perform monitoring.
• First-line support will be done by the satellite squads. Second-line
support will be done by the platform squad.
19. Multiple startups have declared ETL dead, and that streaming
will take over. Is batch just a slower version of real time?
19
20. But I have a different opinion on where we are in this journey
20
Ferd Scheepers, Chief Information Architect ING
21. Choosing how to address the needs of the enterprise will
determine who will succeed in this space.
21
22. 22
ING is championing an Open Metadata and Governance that
will allow…
… metadata to be captured when the data is created, moved with the data and
be augmented and processed by any of the vendor tools.
23. All enterprises have a heteregeneous data landscape, and the
need for managing meta data across technology boundaries.
23
24. •
24
The Open Metadata and Governance initiative will allow easy
implementation of meta data in any platform
• Common base:
• Automate the collection, management and use of metadata across an enterprise
An enterprise catalog of data resources that are transparently assessed,
governed and used in order to deliver maximum value to the enterprise.
an open set
of APIs
an open set
of metadata
types
an open set of
exchange
protocols
Governance Access
Frameworks
Discovery
25. 25
We will deliver through Apache Atlas both a meta data
highway, and an implementation that can be used by all
Open and
Unified Metadata
Metadata
repository
Apache Atlas
Metadata
repository
IBM
Metadata
repository
Flink
Open Metadata Repository Service
OMRS
Open Metadata Access Service
OMAS
Components defined
and being developed
by Open Metadata &
Governance project
Meta data
highway
26. 1. Tink, a temporal graph analytics library for Apache Flink.
13-9 at 14:00 – Palais Atelier
2. Fast Data at ING – building a streaming data platform with Flink and Kafka
13-9 at 15:20 - Kesselhaus
Drop me a note at Ferd.Scheepers@ing.com
Or: @Ferdscheepers
Before I sign off, there is more from ING tomorrow:
26