Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

•

0 likes•1,596 views

Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes At Branch, we process more than 12 billions events per day, and store and aggregate terabytes of data daily. We use Apache Flink for processing, transforming and aggregating events, and parquet as the data storage format. This talk covers our challenges with scaling our warehouse, namely: How did we scale our Flink-Parquet warehouse to handle 3x increase in traffic? How do we ensure exactly once, event-time based, fault tolerant processing of events? In this talk, we also provide an overview on deploying and scaling our streaming warehouse. We give an overview on: How we scaled our Parquet warehouse by tuning memory Running on Kubernetes cluster for resource management How we migrated our streaming jobs with no disruption from Mesos to Kubernetes Our challenges and learnings along the way

Technology

Scaling
Warehouse with
Flink, Parquet &
Kubernetes
Aditi Verma &
Ramesh Shanmugam

Aditi Verma
Sr Software Engineer
@aditiverma89 <averma@branch.io>
Ramesh Shanmugam
Sr Data Engineer
@rameshs01 <rshanmugam@branch.io>

Agenda
● Background
● Moving data with Flink @ Branch
● Scale & Performance
● Flink on Kubernetes
● Auto Scaling & Failure Recovery

12B requests per day (+70% y/y)
3B user sessions per day
50 TB of data per day
200K events per second
60+ Flink pipelines
5+ Kubernetes cluster

Moving data
with Flink @
Branch
“Life is 10% what happens to you and 90% how
you react to it.”
― Charles R. Swindoll
Receive information
Process it
React to it
FAST!!

State Backend
- Relatively small state
backend
- File system backed state

Parquet
- Higher compression
- Read heavy data set: ingested to Druid and Presto (3M+ queries/day)
- Avro data format
- Memory intensive writes

Writing parquet with Flink
Two approaches:
1) Close the file with checkpointing

Writing parquet with Flink
Two approaches:
a) Close the file with checkpointing
b) Bucketing file sink
i) Configured with custom event-time bucketer, parquet writer
and batch size
ii) Files are rolled out with a timeout of 10 min within a bucket

Performance and Scale
- 100% traffic increase each year
- Higher parallelism impacts application performance and state size
- Kafka partitions < Flink parallelism requires rebalance on the input
stream
- Task manager timeouts

Analyzing memory usage
❖ Network Buffers
❖ Memory Segments
❖ User code
❖ Memory and GC stats
❖ JVM parameters

Containerizing Flink - Mesos
● Longer start-up time on Mesos
● Moved to containerizing Flink application on Kubernetes
● Kubernetes is resource oriented, declarative

Flink on Kubernetes @ Branch
● Single job per cluster
● Docker image
○ flink image - Task manager
+ job manager
○ Job launcher - custom
launcher + job jar
● Job launcher
○ Application jar
○ Uploads jar
● Config map - flink config.xml
○ jobmanager.rpc.address

Auto Scaling
● When & How much scale
○ Auto - Joblauncher
● Scale
○ Replica Set
● Flink job with new
parallelism

Savepoint Failure
● Reasons
○ Truncation
○ Schema mismatch
○ Hdfs outage

Savepoint Structure
Foo
Flink
10 11 1312
H=10/*.parquet
H=11/*parquet
H=12/*.in-progress
Sfoo
Run-id 1
CP-1
CP-2
Job-id 1
● job/run-id/flink-job-id/cp-x
● Run id - incremental
number
● Job id - flink job name
Save Point Structure

Savepoint failure recovery
Foo
Flink
10 11 1312
H=10/*parquet
H=11/*.parquet
H=12/*.in-progress
foo
Run-id 1
CP-1
CP-2
Job-id x
Run-id 2
CP-x
Job-id x

Auto Recovery does not work?
● Continuous
monitoring and
proper alerts
● start job from latest
offset
● Have different backfill
route

Next Steps….
● Parquet memory consumption (when too many buckets open)
○ Window + Rocks db => Parquet
○ Two stage process
● row oriented streaming
● batch to convert columnar

Container technology experiences an ever increasing adoption throughout many industries. Not only does this technology make your applications portable across different machines and operating systems, it also allows to scale applications in a matter of seconds. Moreover, it significantly simplifies and speeds up deployments which decreases development and operation costs. Consequently, more and more Flink deployments run in containerized environments which poses new challenges for Flink. In this talk, we will take a look at Flink's current and future container support which will make it a first class citizen of the container world. First of all, we will explain how the new reactive execution mode will solve the problem of seamless application scaling and how it blends in with any environment. Complementary to the reactive mode, the active execution mode demonstrates its strengths when it comes to changing workloads such as batch jobs. Last but not least, we will take a look beyond Flink's own nose and investigate how Flink can be used together with Kubernetes operators or data Artisans' Application Manager. We will conclude the talk with a short demo of Flink's native Kubernetes support and giving an outlook on future developments in the container realm.

Flink Forward San Francisco 2019: Developing and operating real-time applicat...

Flink Forward

Developing and operating real-time applications with Oceanus The Tencent Data team (data.qq.com) is responsible to build a reliable and scalable infrastructure for various products at Tencent including Tencent Games, Tencent Video and WeChat. We have to process over 1.6 trillion events per day and the peak throughput reaches 210 million per second. Such a large scale exposes great challenges in the developing and operating of real-time applications. In this talk, we will give an introduction to Oceanus, a platform built at Tencent to facilitate the development of real-time applications. We will introduce typical user cases and how applications are developed, deployed and monitored with Oceanus. We will also introduce the management of Flink jobs in Oceanus, together with some experience gained in our operating practice. Finally, we will discuss some efforts we have done to enhance the reliability and efficiency, e.g. master failover and local keyed streams.

Flink Forward San Francisco 2019: Moving from Lambda and Kappa Architectures ...

Flink Forward

Moving from Lambda and Kappa Architectures to Kappa+ at Uber Kappa+ is a new approach developed at Uber to overcome the limitations of the Lambda and Kappa architectures. Whether your realtime infrastructure processes data at Uber scale (well over a trillion messages daily) or only a fraction of that, chances are you will need to reprocess old data at some point. There can be many reasons for this. Perhaps a bug fix in the realtime code needs to be retroactively applied (aka backfill), or there is a need to train realtime machine learning models on last few months of data before bringing the models online. Kafka's data retention is limited in practice and generally insufficient for such needs. So data must be processed from archives. Aside from addressing such situations, enabling efficient stream processing on archived as well as realtime data also broadens the applicability of stream processing. This talk introduces the Kappa+ architecture which enables the reuse of streaming realtime logic (stateful and stateless) to efficiently process any amounts of historic data without requiring it to be in Kafka. We shall discuss the complexities involved in such kind of processing and the specific techniques employed in Kappa+ to tackle them.

Flink Forward San Francisco 2019: The Trade Desk's Year in Flink - Jonathan ...

Flink Forward

The Trade Desk's Year in Flink At advertising technology leader, The Trade Desk, we built a data pipeline with three distinct large-scale products using Flink. The keynote gives you a peek into our journey, the lessons learned and offers some hard-won tips from the trenches. Most jobs were surprisingly simple to build. However, we'll deep-dive into one particularly challenging Flink job where we learned how to synchronise/union multiple streams, both in terms of asymmetric throughput and differing lateness/time.

Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...

Flink Forward

Over 137 million members worldwide are enjoying TV series, feature films across a wide variety of genres and languages on Netflix. It leads to petabyte scale of user behavior data. At Netflix, our client logging platform collects and processes this data to empower recommendations, personalization and many other services to enhance user experience. Built with Apache Flink, this platform processes 100s of billion events and a petabyte data per day, 2.5 million events/sec in sub milliseconds latency. The processing involves a series of data transformations such as decryption and data enrichment of customer, geo, device information using microservices based lookups. The transformed and enriched data is further used by multiple data consumers for a variety of applications such as improving user-experience with A/B tests, tracking application performance metrics, tuning algorithms. This causes redundant reads of the dataset by multiple batch jobs and incurs heavy processing costs. To avoid this, we have developed a config driven, centralized, managed platform, on top of Apache Flink, that reads this data once and routes it to multiple streams based on dynamic configuration. This has resulted in improved computation efficiency, reduced costs and reduced operational overhead. Stream processing at scale while ensuring that the production systems are scalable and cost-efficient brings interesting challenges. In this talk, we will share about how we leverage Apache Flink to achieve this, the challenges we faced and our learnings while running one of the largest Flink application at Netflix.

Streaming your Lyft Ride Prices - Flink Forward SF 2019

Thomas Weise

At Lyft we dynamically price our rides with a combination of various data sources, machine learning models, and streaming infrastructure for low latency, reliability and scalability. Dynamic pricing allows us to quickly adapt to real world changes and be fair to drivers (by say raising rates when there's a lot of demand) and fair to passengers (by let’s say offering to return 10 mins later for a cheaper rate). The streaming platform powers pricing by bringing together the best of two worlds using Apache Beam; ML algorithms in Python and Apache Flink as the streaming engine. https://sf-2019.flink-forward.org/conference-program#streaming-your-lyft-ride-prices

Flink Forward San Francisco 2019: Apache Beam portability in the times of rea...

Flink Forward

Apache Beam was open sourced by the big data team at Google in 2016, and has become an active community with participants from all over. Beam is a framework to define data processing workflows and run them on various runners (Flink included). In this talk, I will talk about some cool things you can do with Beam + Flink such as running pipelines written in Go and Python; then I’ll mention some cool tools in the Beam ecosystem. Finally, we’ll wrap up with some cool things we expect to be able to do soon - and how you can get involved.

Flink Forward San Francisco 2019: Building production Flink jobs with Airstre...

Flink Forward

AirStream is a realtime stream computation framework that supports Flink as one of its processing engines. It allows engineers and data scientists at Airbnb to easily leverage Flink to build real time data pipelines and feedback loops. Multiple mission critical applications have been built on top of it. In this talk, we will start with an overview of AirStream, and describe how we have designed Airstream to leverage SQL support in Flink to allow users to easily build real time data pipelines. We will go over a few production use cases such as building a user activity profiler and building user identity mapping in realtime. We will also cover how we have integrated Airstream into the data infrastructure ecosystem at Airbnb through easily configurable connectors such as Kafka and Hive that allow users to easily leverage these components in their pipelines.

Elastic Data Processing with Apache Flink and Apache Pulsar More and more applications are using Flink for low-latency data processing. Flink unifies batch and stream processing using one computation engine. However in reality, in order to really unify batch and stream processing, it requires a data system offers one unified data representation for both batch and streaming data. Nowadays, streaming data is typically stored in a log storage or messaging system, while batch data is stored in distributed filesystem and object stores. That means that data scientists still need write two different computing jobs to access same data stored in different data systems. Apache Pulsar is the next generation messaging and streaming data system. It was originally built at Yahoo, and has graduated from Apache Incubator and become a Top-Level-Project. Pulsar separates messaging serving and data storage into two layers. Such layered architecture provides high throughput and low-latency while ensuring high availability and scalability. Pulsar’s segment centric storage design along with layered architecture makes Pulsar a perfect unbounded streaming data system, which can well fit into Flink’s computation model. In this talk, Sijie Guo from Apache Pulsar PMC, will introduce Pulsar and its layered architecture and segment-centric storage, detailing how this architecture can well integrate with Flink to provide elastic unified batch and stream processing.

Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"

Flink Forward

Over 109 million subscribers are enjoying more than 125 million hours of TV shows and movies per day on Netflix. This leads to massive amount of data flowing through our data ingestion pipeline to improve service and user experience. They are powering various data analytic cases like personalization, operational insight, fraud detection. At the heart of this massive data ingestion pipeline is a self-serve stream processing platform that processes 3 trillion events and 12 PB of data every day. We have recently migrated this stream processing platform from Samza to Flink. In this talk, we will share the challenges and issues that we run into when running Flink at scale in cloud. We will dive deep into the troubleshooting techniques and lessons learned.

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...

Flink Forward

This document discusses providing an R dataframe abstraction for efficient distributed computation on Apache Flink. The goals are to provide a natural API for R and achieve performance comparable to Flink's native dataflow. The approach represents R dataframes as Flink data sets and compiles R functions into the native execution plan where possible. For user-defined R functions, they are evaluated within worker tasks using a just-in-time compiler. This allows executing R code within the same Java virtual machine as Flink for good performance, even on a single node. Results show it can achieve native Flink performance even for functions containing R code.

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

In this talk, we describe the design and implementation of the Python Streaming API support that has been submitted for inclusion in mainline Flink. Python is one of the most popular programming languages for data analysis. Its readability emphasizes development productivity and as a scripting language, it does not require a compilation nor complex development environment setup. Flink already has support for Python APIs for batch programming and unfortunately, the mechanism used to support batch programs (i.e., DataSet APIs) do does not work for Streaming API. We describe the limitations with the batch implementation and provide insights into how we solved this using Jython. We will walk through some of the examples programs using the new Python API and compare programmability and performance with the Java and Scala streaming APIs.

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Flink Forward

As organizations are getting better at capturing streaming data and the data velocity and volume are ever-increasing, the traditional messaging queues or log storage systems are suffering from scalability or operational and maintenance problems. Apache Pulsar is a multi-tenant, high-performance distributed pub-sub messaging system. Pulsar includes multiple features, such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, very low publishing and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper. In this talk, I will use one of the most popular stream processing engines, Apache Flink, as an example, to share our experience in building a stream processing and storage stack. Some of the traits are: * How to ensure end-to-end exactly-once semantics based on Pulsar's durable and replayable storage as well as Pulsar transaction. * How to implement Pulsar topics as infinite tables based on Pulsar's schema. * How to efficiently store stream states in Pulsar based on Pulsar's layered storage API. * A usage scenario that chaining all functionalities in the streaming platform.

Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...

Flink Forward

Many stream processing applications can benefit from or need to rely on the prediction made with machine learning (ML) methods. In this presentation, new features of Apache Samoa are presented with a real data processing scenario. These features make Apache SAMOA fully accessible for Apache Flink users: (1) the data stream processed within Apache Flink is forwarded to Apache Samoa stream mining engine to perform predictions with stream-oriented ML models, (2) ML models evolve after every labelled instance and, at the same time, new predictions are sent back to Apache Flink. In both cases, Apache Kafka is used for data exchange. Hence, Apache Samoa is used as stream mining engine, provided with input data from, and sending predictions to Apache Flink. During the presentation, real life aspects are illustrated with code examples, such as input and prediction stream integration and monitoring latency of data processing and stream mining.

Flink Connector Development Tips & Tricks

Eron Wright

Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

Flink Forward

Apache Flink, a powerful distributed stateful stream processing framework, is an especially good fit for deployment on a containerization platform: its storage requirement is primarily external (e.g. HDFS or S3), clusters often share the lifetime of the jobs that run on them, and the flexibility of allocating resources on such a platform allows for scaling jobs up and down as necessary. In this talk I will give a brief introduction to Apache Flink, then describe the journey to making it a first-class citizen of the container world. I will cover my experience preparing to publish the “official repository” of Flink images on Docker Hub, the challenges of fitting a Flink deployment in a Kubernetes-shaped box, and the rough edges of Flink itself that were exposed by this process.

Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...

Flink Forward

Apache Flink provides powerful stream processing capabilities which can allow organizations to move directly from batch to real time analytics, skipping the lambda architecture entirely. However, getting to production is not always as simple as rewriting your job in a new API, but requires rethinking your application design with a stream first mindset. This talk will cover MediaMath’s journey in rebuilding its reporting infrastructure using Apache Flink. We will discuss high level architectural designs when building an extensible reporting platform as well as deep dive into specific technical hurdles. Topics will include managing a Flink cluster on EC2 spot instances, reconciling Flink’s consistency model with S3’s, handling massive data skew as well as tools and techniques for building performant, fault tolerant streaming applications.

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...

Flink Forward

eal-time Processing with Flink for Machine Learning at Netflix Machine learning plays a critical role in providing a great Netflix member experience. It is used to drive many parts of the site including video recommendations, search results ranking, and selection of artwork images. Providing high-fidelity, near real-time data is increasingly important for these machine learning pipelines, especially as multi-armed bandit and reinforcement learning techniques, in addition to more ""traditional"" supervised learning, become more prevalent. With access to this data, models are able to converge more quickly, features can be updated more frequently, and analysis can be done in a more timely manner. In this talk, we will focus on the practical details of leveraging Flink to process trillions of events per day, work with the time dimension, and manage large and frequently-changing state. We will discuss different processing schemes and dataflows, scalability and resiliency challenges we tackled, operational considerations, and instrumentation we added for monitoring job health in production.

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...

Flink Forward

Flink is a great stream processor, Python is a great programming language, Apache Beam is a great programming model and portability layer. Using all three together is a great idea! We will demo and discuss writing Beam Python pipelines and running them on Flink. We will cover Beam's portability vision that led here, what you need to know about how Beam Python pipelines are executed on Flink, and where Beam's portability framework is headed next (hint: Python pipelines reading from non-Python connectors)

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Flink Forward

Keystone Data Pipeline manages several thousand Flink pipelines, with variable workloads. These pipelines are simple routers which consume from Kafka and write to one of three sinks. In order to alleviate our operational overhead, we’ve implemented autoscaling for our routers. Autoscaling has reduced our resource usage by 25% - 45% (varying by region and time), and has reduced our on call burden. This talk will take an in depth look at the mathematics, algorithms, and infrastructure details for implementing autoscaling of simple pipelines at scale. It will also discuss future work for autoscaling complex pipelines.

Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...

Flink Forward

Apache Mesos allows operators to run distributed applications across an entire datacenter and is attracting ever increasing interest. As much as distributed applications see increased use enabled by Mesos, Mesos also sees increasing use due to a growing ecosystem of well integrated applications. One of the latest additions to the Mesos family is Apache Flink. Flink is one of the most popular open source systems for real-time high scale data processing and allows users to deal with low-latency streaming analytical workloads on Mesos. In this talk we explain the challenges solved while integrating Flink with Mesos, including how Flink’s distributed architecture can be modeled as a Mesos framework, and how Flink was integrated with Fenzo. Next, we describe how Flink was packaged to easily run on DC/OS.

Marton Balassi – Stateful Stream Processing

Flink Forward

Flink Forward San Francisco 2019: High cardinality data stream processing wit...

Flink Forward

High cardinality data stream processing with large states At Klaviyo, we process more than a billion events daily with spikes as high as 75,000/s on peak days. The workload is growing exponentially year over year. We migrated our legacy event processing pipeline from Python to Flink in 2018 and gained a tremendous amount of performance. At the same time, we reduced our AWS EC2 instances from over 100 nodes to 15. Due to the nature of multi-tenancy and diverse dataset for over a billion user profiles, we spent a lot of engineering effort solving performance bottlenecks specific to handling high cardinality data streams in Flink. In this talk, we would like to share the learnings on using keyed operator states, windowing on high cardinality data, and making Flink production ready. We will also share the journey of moving from a 99% Python culture to Java.

Unify Enterprise Data Processing System Platform Level Integration of Flink a...

Flink Forward

In this talk, I will present how Flink enables enterprise customers to unify their data processing systems by using Flink to query Hive data. Unification of streaming and batch is a main theme for Flink. Since 1.9.0, we have integrated Flink with Hive in a platform level. I will talk about: - what features we have released so far, and what they enable our customers to do - best practices to use Flink with Hive - what is the latest development status of Flink-Hive integration at the time of Flink Forward Berlin (Oct 2019), and what to look for in the next release (probably 1.11)

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...

Flink Forward

Flink Forward San Francisco 2019: Building Financial Identity Platform using ...

Flink Forward

Building Financial Identity Platform using Apache Flink To power financial prosperity around the world, Intuit needs to create personalized product experience and new data centric products. Some of these use cases include Enabling 360 Customer View for Personalization and Targeting, building Ecosystem for Data Exchange between internal and 3rd party and personalize financial offerings, creating platform for Personalized security experience based on risk factors of people and devices. Unlike workflow centric products (for example, tax processing, accounting transactions), these use cases are often information-intensive and require real-time access to a large amount of connected data associated with people, organizations and things they own. To achieve this, we have created a platform called Unified Profile Service utilizing Flink. This platform is intended to provide the strategic data asset of a trusted, real-time, unified and connected view of people, organization and things they own. We have abstracted re-usable components such as sources, sinks, transformations etc and created a template. Utilizing this template our Product teams are able to rapidly test domain specific transformations and computations by creating and deploying Flink Jobs. This platform is running in production on AWS EMR, powering multiple use cases, ingesting and processing billions of events per day. In this talk, we will be discussing the design details of this Platform built leveraging Flink and Flink APIs as well as challenges faced along the way. We will begin by talking about the various components of the pipeline such as Identity Stitching, Entity Resolution, Reconciliation and Data Persistence. We will then dig in to the technical details of how we abstracted away these common components and created a template. We will also talk about how we update Consumer’s Financial Identity Graph in real-time through custom built AWS Dynamodb and Neptune Sink using Flink’s Connector API. Finally we will touch on lessons learnt along the way as we deployed the platform in production and offer advice on things to avoid as well as how to take things to the next level.

Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...

Flink Forward

Data stream processing has redefined how many of us build data pipelines. Apache Flink is one of the systems at the forefront of that development: With its versatile APIs (event-time streaming, Stream SQL, events/state) and powerful execution model, Flink has been part of re-defining what stream processing can do. By now, Apache Flink powers some of the largest data stream processing pipelines in open source data stream processing. In this keynote, we will look at the evolution of Stream Processing and Apache Flink during the last year, and what we believe will be the next wave of stream processing applications. We show how the Flink community and users evolved, what use cases are coming up, and how new and upcoming features in Flink are making new types of applications possible. We will also discuss common challenges that companies are facing when adopting stream processing, and how we can help companies to rapidly adopt and roll out stream processing company-wide.

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Flink Forward

During last two major versions (1.9 & 1.10), Apache Flink community spent lots of effort to improve the architecture for further unified batch & streaming processing. One example for that is Flink SQL added the ability to support multiple SQL planners under the same API. This talk will first discuss the motivation behind these movements, but more importantly will have a deep dive into Flink SQL. The presentation shows the unified architecture to handle streaming and batch queries and explain how Flink translates queries into the relational expressions, leverages Apache Calcite to optimize them, and generates efficient runtime code for execution. Besides, this talk will also describe the lifetime of a query in detail, how optimizer improve the plan based on relational node patterns, how Flink leverages binary data format for its basic data structure, and how does certain operator works. This would give audience better understanding of Flink SQL internals.

Stephan Ewen - Experiences running Flink at Very Large Scale

Ververica

This talk shares experiences from deploying and tuning Flink steam processing applications for very large scale. We share lessons learned from users, contributors, and our own experiments about running demanding streaming jobs at scale. The talk will explain what aspects currently render a job as particularly demanding, show how to configure and tune a large scale Flink job, and outline what the Flink community is working on to make the out-of-the-box for experience as smooth as possible. We will, for example, dive into - analyzing and tuning checkpointing - selecting and configuring state backends - understanding common bottlenecks - understanding and configuring network parameters

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

What's hot

Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...

Flink Forward

Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"

Flink Forward

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...

Flink Forward

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Flink Forward

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Flink Forward

Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...

Flink Forward

Flink Connector Development Tips & Tricks

Eron Wright

Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

Flink Forward

Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...

Flink Forward

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...

Flink Forward

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...

Flink Forward

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Flink Forward

Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...

Flink Forward

Marton Balassi – Stateful Stream Processing

Flink Forward

Flink Forward San Francisco 2019: High cardinality data stream processing wit...

Flink Forward

Unify Enterprise Data Processing System Platform Level Integration of Flink a...

Flink Forward

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...

Flink Forward

Flink Forward San Francisco 2019: Building Financial Identity Platform using ...

Flink Forward

Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...

Flink Forward

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Flink Forward

What's hot (20)

Flink Forward San Francisco 2019: Elastic Data Processing with Apache Flink a...

Flink Forward San Francisco 2018: Steven Wu - "Scaling Flink in Cloud"

Flink Forward Berlin 2017: Andreas Kunft - Efficiently executing R Dataframes...

Flink Forward Berlin 2017: Zohar Mizrahi - Python Streaming API

Virtual Flink Forward 2020: Build your next-generation stream platform based ...

Flink Forward Berlin 2017: Piotr Wawrzyniak - Extending Apache Flink stream p...

Flink Connector Development Tips & Tricks

Flink Forward Berlin 2017: Patrick Lucas - Flink in Containerland

Flink Forward SF 2017: Cliff Resnick & Seth Wiesman - From Zero to Streami...

Flink Forward San Francisco 2019: Real-time Processing with Flink for Machine...

Flink Forward Berlin 2017: Aljoscha Krettek - Talk Python to me: Stream Proce...

Virtual Flink Forward 2020: Autoscaling Flink at Netflix - Timothy Farkas

Flink Forward Berlin 2017: Jörg Schad, Till Rohrmann - Apache Flink meets Apa...

Marton Balassi – Stateful Stream Processing

Flink Forward San Francisco 2019: High cardinality data stream processing wit...

Unify Enterprise Data Processing System Platform Level Integration of Flink a...

Flink Forward Berlin 2017: Dominik Bruhn - Deploying Flink Jobs as Docker Con...

Flink Forward San Francisco 2019: Building Financial Identity Platform using ...

Flink Forward Berlin 2017: Stephan Ewen - The State of Flink and how to adopt...

Virtual Flink Forward 2020: A deep dive into Flink SQL - Jark Wu

Similar to Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

Stephan Ewen - Experiences running Flink at Very Large Scale

Ververica

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Flink Forward

Netflix Open Source Meetup Season 4 Episode 2

aspyker

In this episode, we will take a close look at 2 different approaches to high-throughput/low-latency data stores, developed by Netflix. The first, EVCache, is a battle-tested distributed memcached-backed data store, optimized for the cloud. You will also hear about the road ahead for EVCache it evolves into an L1/L2 cache over RAM and SSDs. The second, Dynomite, is a framework to make any non-distributed data-store, distributed. Netflix's first implementation of Dynomite is based on Redis. Come learn about the products' features and hear from Thomson and Reuters, Diego Pacheco from Ilegra and other third party speakers, internal and external to Netflix, on how these products fit in their stack and roadmap.

Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...

Flink Forward

Failures are inevitable. How can we recover a Flink job from outage? How do we reprocess data from outage period? What are the implications to downstream consumers? These are important questions that we need to answer when running Flink for critical data processing applications. We implemented two solutions for our stream processing platform: (1) use data warehouse, like Hive, as backfill source (2) rewind Flink job using external checkpoint. We will describe both solutions in details, and discuss the pros and cons of each approach. We will also take a look at some of the caveats to watch out for.

Bootstrapping state in Apache Flink

DataWorks Summit

Apache Flink is a popular stream computing framework for real-time stream computing. Many stream compute algorithms require trailing data in order to compute the intended result. One example is computing the number of user logins in the last 7 days. This creates a dilemma where the results of the stream program are incomplete until the runtime of the program exceeds 7 days. The alternative is to bootstrap the program using historic data to seed the state before shifting to use real-time data. This talk will discuss alternatives to bootstrap programs in Flink. Some alternatives rely on technologies exogenous to the stream program, such as enhancements to the pub/sub layer, that are more generally applicable to other stream compute engines. Other alternatives include enhancements to Flink source implementations. Lyft is exploring another alternative using orchestration of multiple Flink programs. The talk will cover why Lyft pursued this alternative and future directions to further enhance bootstrapping support in Flink. Speaker Gregory Fee, Principal Engineer, Lyft

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes

HBaseCon

Zhiyong Bai As a high performance and scalable key value database, Zhihu use HBase to provide online data store system along with Mysql and Redis. Zhihu’s platform team had accumulated some experience in technology of container, and this time, based on Kubernetes, we build flexible platform of online HBase system, create multiple logic isolated HBase clusters on the shared physical cluster with fast rapid，and provide customized service for different business needs. Combined with Consul and DNS server, we implement high available access of HBase using client mainly written with Python. This presentation is mainly shared the architecture of online HBase platform in Zhihu and some practical experience in production environment. hbaseconasia2017 hbasecon hbase

Flink at netflix paypal speaker series

Monal Daxini

Enabling Presto Caching at Uber with Alluxio

Alluxio, Inc.

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Flink Forward

Flink Forward San Francisco 2022. At Stripe we have created a complete end to end exactly-once processing pipeline to process financial data at scale, by combining the exactly-once power from Flink, Kafka, and Pinot together. The pipeline provides exactly-once guarantee, end-to-end latency within a minute, deduplication against hundreds of billions of keys, and sub-second query latency against the whole dataset with trillion level rows. In this session we will discuss the technical challenges of designing, optimizing, and operating the whole pipeline, including Flink, Kafka, and Pinot. We will also share our lessons learned and the benefits gained from exactly-once processing. by Xiang Zhang & Pratyush Sharma & Xiaoman Dong

Stream Processing Live Traffic Data with Kafka Streams

Tim Ysewyn

In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system. Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won’t come back to haunt you. With some basic stream operations (count, filter, … ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream. But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows. After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.

Sweet Streams (Are made of this)

Corneil du Plessis

This document provides an overview of Spring Cloud Data Flow, including what it is, its key components like Spring Batch and Spring Cloud Stream applications, how it can be used for batch jobs, tasks, and streams, and how it provides orchestration and deployment on platforms like Kubernetes. It also discusses Spring Cloud Data Flow's observability features and includes an interview discussing how one user implemented batch and stream processing using Spring Cloud Data Flow to ingest and process data in a more real-time and fault-tolerant manner.

Google Cloud Computing on Google Developer 2008 Day

programmermag

The document discusses the evolution of computing models from clusters and grids to cloud computing. It describes how cluster computing involved tightly coupled resources within a LAN, while grids allowed for resource sharing across domains. Utility computing introduced an ownership model where users leased computing power. Finally, cloud computing allows access to services and data from any internet-connected device through a browser.

3 Flink Mistakes We Made So You Won't Have To

HostedbyConfluent

"Is your team looking to bring the power of full, end-to-end stream processing with Apache Flink to your organization but are concerned about the time, resources or skills required? In this talk, Sharon Xie, Decodable Founding Engineer and Apache Flink PMC Member, Robert Metzger, will reveal the biggest lessons learned, and how to avoid common mistakes when adopting Apache Flink. If you have any plans on implementing Apache Flink, then this is a session you do not want to miss. We will talk about avoiding data-loss with Flink’s Kafka exactly-once producer, configuring Flink for getting the most bang for the buck out of your memory configuration and tuning for efficient checkpointing."

Massively Scaled High Performance Web Services with PHP

Demin Yin

Over the years, people have questioned if PHP is a good choice for building web services. In this talk, I will share how we use PHP on the backend for Glu Mobile’s flagship mobile game Design Home, enabling it to regularly rank amongst the top free mobile games in the Apple App Store and the Google Play Store. We will deep dive into the thought processes, development, testing, and deployment strategy, showcasing what we have achieved with PHP.

Web TCard - Speed optimization

Eric Guo

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

Kevin Lynch

Stream Processing Live Traffic Data with Kafka Streams

Tom Van den Bulck

In this workshop we will set up a streaming framework which will process realtime data of traffic sensors installed within the Belgian road system. Starting with the intake of the data, you will learn best practices and the recommended approach to split the information into events in a way that won't come back to haunt you. With some basic stream operations (count, filter, ... ) you will get to know the data and experience how easy it is to get things done with Spring Boot & Spring Cloud Stream. But since simple data processing is not enough to fulfill all your streaming needs, we will also let you experience the power of windows. After this workshop, tumbling, sliding and session windows hold no more mysteries and you will be a true streaming wizard.

Scaling GraphQL Subscriptions

Артём Курапов

The document discusses scaling GraphQL subscriptions. It begins by introducing the speaker, Artjom Kurapov, and his background working with GraphQL for the past 3 years. The bulk of the document then discusses how Pipedrive scaled their GraphQL subscriptions to handle over 80GB of traffic per day by implementing a solution using WebSocket transports, Apollo Server, and Redis pub/sub channels to allow subscriptions to scale horizontally across multiple workers. It also covers performance testing showing it can handle over 200 events per second for a single entity type while being deployed across all of Pipedrive's customers.

Flink forward-2017-netflix keystones-paas

Monal Daxini

The need for gleaning answers from unbounded data streams is moving from nicety to a necessity. Netflix is a data driven company, and has a need to process over 1 trillion events a day amounting to 3 PB of data to derive business insights. To ease extracting insight, we are building a self-serve, scalable, fault-tolerant, multi-tenant "Stream Processing as a Service" platform so the user can focus on data analysis. I'll share our experience using Flink to help build the platform.

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

Flink Forward

Similar to Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam (20)

Stephan Ewen - Experiences running Flink at Very Large Scale

Flink Forward SF 2017: Stephan Ewen - Experiences running Flink at Very Large...

Netflix Open Source Meetup Season 4 Episode 2

Flink Forward Berlin 2018: Steven Wu - "Failure is not fatal: what is your re...

Bootstrapping state in Apache Flink

hbaseconasia2017: Building online HBase cluster of Zhihu based on Kubernetes

Flink at netflix paypal speaker series

Enabling Presto Caching at Uber with Alluxio

Exactly-Once Financial Data Processing at Scale with Flink and Pinot

Stream Processing Live Traffic Data with Kafka Streams

Sweet Streams (Are made of this)

Google Cloud Computing on Google Developer 2008 Day

3 Flink Mistakes We Made So You Won't Have To

Massively Scaled High Performance Web Services with PHP

Web TCard - Speed optimization

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)

Stream Processing Live Traffic Data with Kafka Streams

Scaling GraphQL Subscriptions

Flink forward-2017-netflix keystones-paas

Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...

More from Flink Forward

Building a fully managed stream processing platform on Flink at scale for Lin...

Flink Forward

Apache Flink is a distributed stream processing framework that allows users to process and analyze data in real-time. At LinkedIn, we developed a fully managed stream processing platform on Flink running on K8s to power hundreds of stream processing pipelines in production. This platform is the backbone for other infra systems like Search, Espresso (internal document store) and feature management etc. We provide a rich authoring and testing environment which allows users to create, test, and deploy their streaming jobs in a self-serve fashion within minutes. Users can focus on their business logic, leaving the Flink platform to take care of management aspects such as split deployment, resource provisioning, auto-scaling, job monitoring, alerting, failure recovery and much more. In this talk, we will introduce the overall platform architecture, highlight the unique value propositions that it brings to stream processing at LinkedIn and share the experiences and lessons we have learned.

Evening out the uneven: dealing with skew in Flink

Flink Forward

Flink Forward San Francisco 2022. When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment. by Jun Qin & Karl Friedrich

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Flink Forward

Flink Forward San Francisco 2022. To improve Amazon Alexa experiences and support machine learning inference at scale, we built an automated end-to-end solution for incremental model building or fine-tuning machine learning models through continuous learning, continual learning, and/or semi-supervised active learning. Customer privacy is our top concern at Alexa, and as we build solutions, we face unique challenges when operating at scale such as supporting multiple applications with tens of thousands of transactions per second with several dependencies including near-real time inference endpoints at low latencies. Apache Flink helps us transform and discover metrics in near-real time in our solution. In this talk, we will cover the challenges that we faced, how we scale the infrastructure to meet the needs of ML teams across Alexa, and go into how we enable specific use cases that use Apache Flink on Amazon Kinesis Data Analytics to improve Alexa experiences to delight our customers while preserving their privacy. by Aansh Shah

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Flink Forward

Flink Forward San Francisco 2022. Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink. by Nico Kruber

Introducing the Apache Flink Kubernetes Operator

Flink Forward

Flink Forward San Francisco 2022. The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples." by Thomas Weise

Autoscaling Flink with Reactive Mode

Flink Forward

Flink Forward San Francisco 2022. Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo. by Robert Metzger

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

Flink Forward

Flink Forward San Francisco 2022. Flink consumers read from Kafka as a scalable, high throughput, and low latency data source. However, there are challenges in scaling out data streams where migration and multiple Kafka clusters are required. Thus, we introduced a new Kafka source to read sharded data across multiple Kafka clusters in a way that conforms well with elastic, dynamic, and reliable infrastructure. In this presentation, we will present the source design and how the solution increases application availability while reducing maintenance toil. Furthermore, we will describe how we extended the existing KafkaSource to provide mechanisms to read logical streams located on multiple clusters, to dynamically adapt to infrastructure changes, and to perform transparent cluster migrations and failover. by Mason Chen

One sink to rule them all: Introducing the new Async Sink

Flink Forward

Flink Forward San Francisco 2022. Next time you want to integrate with a new destination for a demo, concept or production application, the Async Sink framework will bootstrap development, allowing you to move quickly without compromise. In Flink 1.15 we introduced the Async Sink base (FLIP-171), with the goal to encapsulate common logic and allow developers to focus on the key integration code. The new framework handles things like request batching, buffering records, applying backpressure, retry strategies, and at least once semantics. It allows you to focus on your business logic, rather than spending time integrating with your downstream consumers. During the session we will dive deep into the internals to uncover how it works, why it was designed this way, and how to use it. We will code up a new sink from scratch and demonstrate how to quickly push data to a destination. At the end of this talk you will be ready to start implementing your own Flink sink using the new Async Sink framework. by Steffen Hausmann & Danny Cranmer

Tuning Apache Kafka Connectors for Flink.pptx

Flink Forward

Flink Forward San Francisco 2022. In normal situations, the default Kafka consumer and producer configuration options work well. But we all know life is not all roses and rainbows and in this session we’ll explore a few knobs that can save the day in atypical scenarios. First, we'll take a detailed look at the parameters available when reading from Kafka. We’ll inspect the params helping us to spot quickly an application lock or crash, the ones that can significantly improve the performance and the ones to touch with gloves since they could cause more harm than benefit. Moreover we’ll explore the partitioning options and discuss when diverging from the default strategy is needed. Next, we’ll discuss the Kafka Sink. After browsing the available options we'll then dive deep into understanding how to approach use cases like sinking enormous records, managing spikes, and handling small but frequent updates.. If you want to understand how to make your application survive when the sky is dark, this session is for you! by Olena Babenko

Flink powered stream processing platform at Pinterest

Flink Forward

Flink Forward San Francisco 2022. Pinterest is a visual discovery engine that serves over 433MM users. Stream processing allows us to unlock value from realtime data for pinners. At Pinterest, we adopt Flink as the unified streaming processing engine. In this talk, we will share our journey in building a stream processing platform with Flink and how we onboarding critical use cases to the platform. Pinterest has supported 90+near realtime streaming applications. We will cover the problem statement, how we evaluate potential solutions and our decision to build the framework. by Rainie Li & Kanchi Masalia

Apache Flink in the Cloud-Native Era

Flink Forward

Flink Forward San Francisco 2022. This talk will take you on the long journey of Apache Flink into the cloud-native era. It started all the way from where Hadoop and YARN were the standard way of deploying and operating data applications. We're going to deep dive into the cloud-native set of principles and how they map to the Apache Flink internals and recent improvements. We'll cover fast checkpointing, fault tolerance, resource elasticity, minimal infrastructure dependencies, industry-standard tooling, ease of deployment and declarative APIs. After this talk you'll get a broader understanding of the operational requirements for a modern streaming application and where the current limits are. by David Moravek

Where is my bottleneck? Performance troubleshooting in Flink

Flink Forward

Flinkn Forward San Francisco 2022. In this talk, we will cover various topics around performance issues that can arise when running a Flink job and how to troubleshoot them. We’ll start with the basics, like understanding what the job is doing and what backpressure is. Next, we will see how to identify bottlenecks and which tools or metrics can be helpful in the process. Finally, we will also discuss potential performance issues during the checkpointing or recovery process, as well as and some tips and Flink features that can speed up checkpointing and recovery times. by Piotr Nowojski

Using the New Apache Flink Kubernetes Operator in a Production Deployment

Flink Forward

Flink Forward San Francisco 2022. Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A by James Busche & Ted Chang

The Current State of Table API in 2022

Flink Forward

Flink Forward San Francisco 2022. The Table API is one of the most actively developed components of Flink in recent time. Inspired by databases and SQL, it encapsulates concepts many developers are familiar with. It can be used with both bounded and unbounded streams in a unified way. But from afar it can be difficult to keep track of what this API is capable of and how it relates to Flink's other APIs. In this talk, we will explore the current state of Table API. We will show how it can be used as a batch processor, a changelog processor, or a streaming ETL tool with many built-in functions and operators for deduplicating, joining, and aggregating data. By comparing it to the DataStream API we will highlight differences and elaborate on when to use which API. We will demonstrate hybrid pipelines in which both APIs interact with one another and contribute their unique strengths. Finally, we will take a look at some of the most recent additions as a first step to stateful upgrades. by David Andreson

Flink SQL on Pulsar made easy

Flink Forward

Flink Forward San Francisco 2022. Based on the new Flink-Pulsar connector, we implemented Flink's TableAPI and Catalog to help users to interact with the Pulsar cluster via Flink SQL easily. We would like to go through the design and implementation of the SQL connector in the following aspects: 1. Two different modes of use Pulsar as a metadata store 2. Data format transformation and management 3. SQL semantics support within Pulsar context by Sijie Guo & Neng Lu

Dynamic Rule-based Real-time Market Data Alerts

Flink Forward

Flink Forward San Francisco 2022. At Bloomberg, we deal with high volumes of real-time market data. Our clients expect to be notified of any anomalies in this market data, which may indicate volatile movements in the markets, notable trades, forthcoming events, or system failures. The parameters for these alerts are always evolving and our clients can update them dynamically. In this talk, we'll cover how we utilized the open source Apache Flink and Siddhi SQL projects to build a distributed, scalable, low-latency and dynamic rule-based, real-time alerting system to solve our clients' needs. We'll also cover the lessons we learned along our journey. by Ajay Vyasapeetam & Madhuri Jain

Processing Semantically-Ordered Streams in Financial Services

Flink Forward

Flink Forward San Francisco 2022. What if my data is already in order? Stream Processing has given us an elegant and powerful solution for running analytic queries and logic over high volumes of continuously arriving data. However, in both Apache Flink and Apache Beam, the notion of time-ordering is baked in at a very low level, making it difficult to express computations that are interested in a semantic-, rather than time-ordering of the data. In financial services, what often matters the most about the data moving between systems is not when the data was created, but in what order, to the extent that many institutions engineer a global sequencing over all data entering and produced by their systems to achieve complete determinism. How, then, can financial institutions and others best employ Stream Processing on streams of data that are already ordered? I will cover various techniques that can make this work, as well as seek input from the community on how Flink might be improved to better support these use-cases. by Patrick Lucas

Tame the small files problem and optimize data layout for streaming ingestion...

Flink Forward

Flink Forward San Francisco 2022. In modern data platform architectures, stream processing engines such as Apache Flink are used to ingest continuous streams of data into data lakes such as Apache Iceberg. Streaming ingestion to iceberg tables can suffer by two problems (1) small files problem that can hurt read performance (2) poor data clustering that can make file pruning less effective. To address those two problems, we propose adding a shuffling stage to the Flink Iceberg streaming writer. The shuffling stage can intelligently group data via bin packing or range partition. This can reduce the number of concurrent files that every task writes. It can also improve data clustering. In this talk, we will explain the motivations in details and dive into the design of the shuffling stage. We will also share the evaluation results that demonstrate the effectiveness of smart shuffling. by Gang Ye & Steven Wu

Batch Processing at Scale with Flink & Iceberg

Flink Forward

Flink Forward San Francisco 2022. Goldman Sachs's Data Lake platform serves as the firm's centralized data platform, ingesting 140K (and growing!) batches per day of Datasets of varying shape and size. Powered by Flink and using metadata configured by platform users, ingestion applications are generated dynamically at runtime to extract, transform, and load data into centralized storage where it is then exported to warehousing solutions such as Sybase IQ, Snowflake, and Amazon Redshift. Data Latency is one of many key considerations as producers and consumers have their own commitments to satisfy. Consumers range from people/systems issuing queries, to applications using engines like Spark, Hive, and Presto to transform data into refined Datasets. Apache Iceberg allows our applications to not only benefit from consistency guarantees important when running on eventually consistent storage like S3, but also allows us the opportunity to improve our batch processing patterns with its scalability-focused features. by Andreas Hailu

Welcome to the Flink Community!

Flink Forward

Flink Forward San Francisco 2022. At Flink Forward, we get to hear creative, unique use cases, often on the bleeding edge of some of the most exciting current technologies. This talk will give you a chance to get to open up the hood on our driven and innovative Open Source community. I will cover what our community has been working on this past year, and how this work relates to our (Ververica's) exciting new Flink engineering roadmap! I will also go through some best practices and upcoming opportunities for getting involved in this community! by Caito Scherr

More from Flink Forward (20)

Building a fully managed stream processing platform on Flink at scale for Lin...

Evening out the uneven: dealing with skew in Flink

“Alexa, be quiet!”: End-to-end near-real time model building and evaluation i...

Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...

Introducing the Apache Flink Kubernetes Operator

Autoscaling Flink with Reactive Mode

Dynamically Scaling Data Streams across Multiple Kafka Clusters with Zero Fli...

One sink to rule them all: Introducing the new Async Sink

Tuning Apache Kafka Connectors for Flink.pptx

Flink powered stream processing platform at Pinterest

Apache Flink in the Cloud-Native Era

Where is my bottleneck? Performance troubleshooting in Flink

Using the New Apache Flink Kubernetes Operator in a Production Deployment

The Current State of Table API in 2022

Flink SQL on Pulsar made easy

Dynamic Rule-based Real-time Market Data Alerts

Processing Semantically-Ordered Streams in Financial Services

Tame the small files problem and optimize data layout for streaming ingestion...

Batch Processing at Scale with Flink & Iceberg

Welcome to the Flink Community!

Recently uploaded

AWS Certified Solutions Architect Associate (SAA-C03)

HarpalGohil4

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

Fwdays

At this talk we will discuss DDoS protection tools and best practices, discuss network architectures and what AWS has to offer. Also, we will look into one of the largest DDoS attacks on Ukrainian infrastructure that happened in February 2022. We'll see, what techniques helped to keep the web resources available for Ukrainians and how AWS improved DDoS protection for all customers based on Ukraine experience

ScyllaDB Tablets: Rethinking Replication

ScyllaDB

ScyllaDB is making a major architecture shift. We’re moving from vNode replication to tablets – fragments of tables that are distributed independently, enabling dynamic data distribution and extreme elasticity. In this keynote, ScyllaDB co-founder and CTO Avi Kivity explains the reason for this shift, provides a look at the implementation and roadmap, and shares how this shift benefits ScyllaDB users.

What is an RPA CoE? Session 2 – CoE Roles

DianaGray10

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

Neo4j

What is an RPA CoE? Session 1 – CoE Vision

DianaGray10

Christine's Supplier Sourcing Presentaion.pptx

christinelarrosa

Introducing BoxLang : A new JVM language for productivity and modularity!

Ortus Solutions, Corp

Just like life, our code must adapt to the ever changing world we live in. From one day coding for the web, to the next for our tablets or APIs or for running serverless applications. Multi-runtime development is the future of coding, the future is to be dynamic. Let us introduce you to BoxLang. Dynamic. Modular. Productive. BoxLang redefines development with its dynamic nature, empowering developers to craft expressive and functional code effortlessly. Its modular architecture prioritizes flexibility, allowing for seamless integration into existing ecosystems. Interoperability at its Core With 100% interoperability with Java, BoxLang seamlessly bridges the gap between traditional and modern development paradigms, unlocking new possibilities for innovation and collaboration. Multi-Runtime From the tiny 2m operating system binary to running on our pure Java web server, CommandBox, Jakarta EE, AWS Lambda, Microsoft Functions, Web Assembly, Android and more. BoxLang has been designed to enhance and adapt according to it's runnable runtime. The Fusion of Modernity and Tradition Experience the fusion of modern features inspired by CFML, Node, Ruby, Kotlin, Java, and Clojure, combined with the familiarity of Java bytecode compilation, making BoxLang a language of choice for forward-thinking developers. Empowering Transition with Transpiler Support Transitioning from CFML to BoxLang is seamless with our JIT transpiler, facilitating smooth migration and preserving existing code investments. Unlocking Creativity with IDE Tools Unleash your creativity with powerful IDE tools tailored for BoxLang, providing an intuitive development experience and streamlining your workflow. Join us as we embark on a journey to redefine JVM development. Welcome to the era of BoxLang.

Astute Business Solutions | Oracle Cloud Partner |

AstuteBusiness

Day 2 - Intro to UiPath Studio Fundamentals

UiPathCommunity

In our second session, we shall learn all about the main features and fundamentals of UiPath Studio that enable us to use the building blocks for any automation project. 📕 Detailed agenda: Variables and Datatypes Workflow Layouts Arguments Control Flows and Loops Conditional Statements 💻 Extra training through UiPath Academy: Variables, Constants, and Arguments in Studio Control Flow in Studio

Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf

leebarnesutopia

So… you want to become a Test Automation Engineer (or hire and develop one)? While there’s quite a bit of information available about important technical and tool skills to master, there’s not enough discussion around the path to becoming an effective Test Automation Engineer that knows how to add VALUE. In my experience this had led to a proliferation of engineers who are proficient with tools and building frameworks but have skill and knowledge gaps, especially in software testing, that reduce the value they deliver with test automation. In this talk, Lee will share his lessons learned from over 30 years of working with, and mentoring, hundreds of Test Automation Engineers. Whether you’re looking to get started in test automation or just want to improve your trade, this talk will give you a solid foundation and roadmap for ensuring your test automation efforts continuously add value. This talk is equally valuable for both aspiring Test Automation Engineers and those managing them! All attendees will take away a set of key foundational knowledge and a high-level learning path for leveling up test automation skills and ensuring they add value to their organizations.

From Natural Language to Structured Solr Queries using LLMs

Sease

This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints. That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source. The objective of the presentation is to propose a technical approach and a way forward to achieve this goal. The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata. This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr. The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.

Dandelion Hashtable: beyond billion requests per second on a commodity server

Antonios Katsarakis

This slide deck presents DLHT, a concurrent in-memory hashtable. Despite efforts to optimize hashtables, that go as far as sacrificing core functionality, state-of-the-art designs still incur multiple memory accesses per request and block request processing in three cases. First, most hashtables block while waiting for data to be retrieved from memory. Second, open-addressing designs, which represent the current state-of-the-art, either cannot free index slots on deletes or must block all requests to do so. Third, index resizes block every request until all objects are copied to the new index. Defying folklore wisdom, DLHT forgoes open-addressing and adopts a fully-featured and memory-aware closed-addressing design based on bounded cache-line-chaining. This design offers lock-free index operations and deletes that free slots instantly, (2) completes most requests with a single memory access, (3) utilizes software prefetching to hide memory latencies, and (4) employs a novel non-blocking and parallel resizing. In a commodity server and a memory-resident workload, DLHT surpasses 1.6B requests per second and provides 3.5x (12x) the throughput of the state-of-the-art closed-addressing (open-addressing) resizable hashtable on Gets (Deletes).

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors

DianaGray10

Join us to learn how UiPath Apps can directly and easily interact with prebuilt connectors via Integration Service--including Salesforce, ServiceNow, Open GenAI, and more. The best part is you can achieve this without building a custom workflow! Say goodbye to the hassle of using separate automations to call APIs. By seamlessly integrating within App Studio, you can now easily streamline your workflow, while gaining direct access to our Connector Catalog of popular applications. We’ll discuss and demo the benefits of UiPath Apps and connectors including: Creating a compelling user experience for any software, without the limitations of APIs. Accelerating the app creation process, saving time and effort Enjoying high-performance CRUD (create, read, update, delete) operations, for seamless data management. Speakers: Russell Alfeche, Technology Leader, RPA at qBotic and UiPath MVP Charlie Greenberg, host

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

Fwdays

Direct losses from downtime in 1 minute = $5-$10 thousand dollars. Reputation is priceless. As part of the talk, we will consider the architectural strategies necessary for the development of highly loaded fintech solutions. We will focus on using queues and streaming to efficiently work and manage large amounts of data in real-time and to minimize latency. We will focus special attention on the architectural patterns used in the design of the fintech system, microservices and event-driven architecture, which ensure scalability, fault tolerance, and consistency of the entire system.

QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...

AlexanderRichford

QR Secure: A Hybrid Approach Using Machine Learning and Security Validation Functions to Prevent Interaction with Malicious QR Codes. Aim of the Study: The goal of this research was to develop a robust hybrid approach for identifying malicious and insecure URLs derived from QR codes, ensuring safe interactions. This is achieved through: Machine Learning Model: Predicts the likelihood of a URL being malicious. Security Validation Functions: Ensures the derived URL has a valid certificate and proper URL format. This innovative blend of technology aims to enhance cybersecurity measures and protect users from potential threats hidden within QR codes 🖥 🔒 This study was my first introduction to using ML which has shown me the immense potential of ML in creating more secure digital environments!

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill

LizaNolte

HERE IS YOUR WEBINAR CONTENT! 'Mastering Customer Journey Management with Dr. Graham Hill'. We hope you find the webinar recording both insightful and enjoyable. In this webinar, we explored essential aspects of Customer Journey Management and personalization. Here’s a summary of the key insights and topics discussed: Key Takeaways: Understanding the Customer Journey: Dr. Hill emphasized the importance of mapping and understanding the complete customer journey to identify touchpoints and opportunities for improvement. Personalization Strategies: We discussed how to leverage data and insights to create personalized experiences that resonate with customers. Technology Integration: Insights were shared on how inQuba’s advanced technology can streamline customer interactions and drive operational efficiency.

Apps Break Data

Ivo Velitchkov

How information systems are built or acquired puts information, which is what they should be about, in a secondary place. Our language adapted accordingly, and we no longer talk about information systems but applications. Applications evolved in a way to break data into diverse fragments, tightly coupled with applications and expensive to integrate. The result is technical debt, which is re-paid by taking even bigger "loans", resulting in an ever-increasing technical debt. Software engineering and procurement practices work in sync with market forces to maintain this trend. This talk demonstrates how natural this situation is. The question is: can something be done to reverse the trend?

JavaLand 2024: Application Development Green Masterplan

Miro Wengner

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...

Jason Yip

The typical problem in product engineering is not bad strategy, so much as “no strategy”. This leads to confusion, lack of motivation, and incoherent action. The next time you look for a strategy and find an empty space, instead of waiting for it to be filled, I will show you how to fill it in yourself. If you’re wrong, it forces a correction. If you’re right, it helps create focus. I’ll share how I’ve approached this in the past, both what works and lessons for what didn’t work so well.

Recently uploaded (20)

AWS Certified Solutions Architect Associate (SAA-C03)

"Frontline Battles with DDoS: Best practices and Lessons Learned", Igor Ivaniuk

ScyllaDB Tablets: Rethinking Replication

What is an RPA CoE? Session 2 – CoE Roles

Harnessing the Power of NLP and Knowledge Graphs for Opioid Research

What is an RPA CoE? Session 1 – CoE Vision

Christine's Supplier Sourcing Presentaion.pptx

Introducing BoxLang : A new JVM language for productivity and modularity!

Astute Business Solutions | Oracle Cloud Partner |

Day 2 - Intro to UiPath Studio Fundamentals

Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf

From Natural Language to Structured Solr Queries using LLMs

Dandelion Hashtable: beyond billion requests per second on a commodity server

Connector Corner: Seamlessly power UiPath Apps, GenAI with prebuilt connectors

"$10 thousand per minute of downtime: architecture, queues, streaming and fin...

QR Secure: A Hybrid Approach Using Machine Learning and Security Validation F...

inQuba Webinar Mastering Customer Journey Management with Dr Graham Hill

Apps Break Data

JavaLand 2024: Application Development Green Masterplan

[OReilly Superstream] Occupy the Space: A grassroots guide to engineering (an...

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

1. Scaling Warehouse with Flink, Parquet & Kubernetes Aditi Verma & Ramesh Shanmugam

2. Aditi Verma Sr Software Engineer @aditiverma89 <averma@branch.io> Ramesh Shanmugam Sr Data Engineer @rameshs01 <rshanmugam@branch.io>

3. Agenda ● Background ● Moving data with Flink @ Branch ● Scale & Performance ● Flink on Kubernetes ● Auto Scaling & Failure Recovery

4. 12B requests per day (+70% y/y) 3B user sessions per day 50 TB of data per day 200K events per second 60+ Flink pipelines 5+ Kubernetes cluster

6. Moving data with Flink @ Branch “Life is 10% what happens to you and 90% how you react to it.” ― Charles R. Swindoll Receive information Process it React to it FAST!!

7. Flink @ Branch

8. State Backend - Relatively small state backend - File system backed state

9. Parquet - Higher compression - Read heavy data set: ingested to Druid and Presto (3M+ queries/day) - Avro data format - Memory intensive writes

10. Writing parquet with Flink Two approaches: 1) Close the file with checkpointing

11. Writing parquet with Flink Two approaches: a) Close the file with checkpointing b) Bucketing file sink i) Configured with custom event-time bucketer, parquet writer and batch size ii) Files are rolled out with a timeout of 10 min within a bucket

12. Performance and Scale - 100% traffic increase each year - Higher parallelism impacts application performance and state size - Kafka partitions < Flink parallelism requires rebalance on the input stream - Task manager timeouts

13.

14. Analyzing memory usage ❖ Network Buffers ❖ Memory Segments ❖ User code ❖ Memory and GC stats ❖ JVM parameters

15. Containerizing Flink - Mesos ● Longer start-up time on Mesos ● Moved to containerizing Flink application on Kubernetes ● Kubernetes is resource oriented, declarative

16. Kubernetes Terms

17.

18. Flink on Kubernetes @ Branch ● Single job per cluster ● Docker image ○ flink image - Task manager + job manager ○ Job launcher - custom launcher + job jar ● Job launcher ○ Application jar ○ Uploads jar ● Config map - flink config.xml ○ jobmanager.rpc.address

19. Auto Scaling ● When & How much scale ○ Auto - Joblauncher ● Scale ○ Replica Set ● Flink job with new parallelism

20. Failure Recovery

21. Job / Task Manager Goes Down?

22. Job / Task Manager Goes Down?

23. Savepoint Failure ● Reasons ○ Truncation ○ Schema mismatch ○ Hdfs outage

24. Savepoint Structure Foo Flink 10 11 1312 H=10/*.parquet H=11/*parquet H=12/*.in-progress Sfoo Run-id 1 CP-1 CP-2 Job-id 1 ● job/run-id/flink-job-id/cp-x ● Run id - incremental number ● Job id - flink job name Save Point Structure

25. Savepoint failure recovery Foo Flink 10 11 1312 H=10/*parquet H=11/*.parquet H=12/*.in-progress foo Run-id 1 CP-1 CP-2 Job-id x Run-id 2 CP-x Job-id x

26. Auto Recovery does not work? ● Continuous monitoring and proper alerts ● start job from latest offset ● Have different backfill route

27. Next Steps…. ● Parquet memory consumption (when too many buckets open) ○ Window + Rocks db => Parquet ○ Two stage process ● row oriented streaming ● batch to convert columnar

28. Q & A

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam

Similar to Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam (20)

More from Flink Forward

More from Flink Forward (20)

Recently uploaded

Recently uploaded (20)

Flink Forward San Francisco 2019: Scaling a real-time streaming warehouse with Apache Flink, Parquet and Kubernetes - Aditi Verma & Ramesh Shanmugam