Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming, in part, from the deluge of high-velocity data streams as well as the need for instant data-driven insights, and there has been a proliferation of messaging and streaming frameworks that enterprises utilize to satisfy the needs of various applications.
Drawing on their experience operating streaming systems at Twitter scale, Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. They also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, they explore the interplay between storage and stream processing and speculate about future developments.
Topics include:
Basic requirements of stream processing
Streaming and one-pass algorithms
Different types of streaming architectures
An in-depth review of streaming frameworks
Deploying and operating stream processing applications
Lessons learned from building a real-time stack using Apache Pulsar and Apache Heron at Twitter Scale
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors.
Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations.
We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged.
Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st
reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It is designed to be very cost-effective and easy to operate, as it does not index the contents of the logs, but rather labels for each log stream.
In this talk, we will introduce Loki, its architecture and the design trade-offs in an approachable way. We’ll both cover Loki and Promtail, the agent used to scrape local logs to push to Loki, including the Prometheus-style service discovery used to dynamically discover logs and attach metadata from applications running in a Kubernetes cluster.
Finally, we’ll show how to query logs with Grafana using LogQL - the Loki query language - and the latest Grafana features to easily build dashboards mixing metrics and logs.
Pulsar is used by a portfolio of products at Splunk for stream processing of different types of data, including metrics and logs. In this talk, Karthik Ramasamy will share how Splunk helped a flagship customer scale a Pulsar deployment to handle 10 PB/day in a single cluster. He will talk about the journey, the challenges faced, and the trade-offs made to scale Pulsar and operate it reliably and stably in Google Cloud Platform (GCP).
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors.
Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations.
We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged.
Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st
reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.
Using the New Apache Flink Kubernetes Operator in a Production DeploymentFlink Forward
Flink Forward San Francisco 2022.
Running natively on Kubernetes, using the new Apache Flink Kubernetes Operator is a great way to deploy and manage Flink application and session deployments. In this presentation, we provide: - A brief overview of Kubernetes operators and their benefits. - Introduce the five levels of the operator maturity model. - Introduce the newly released Apache Flink Kubernetes Operator and FlinkDeployment CRs - Dockerfile modifications you can make to swap out UBI images and Java of the underlying Flink Operator container - Enhancements we're making in: - Versioning/Upgradeability/Stability - Security - Demo of the Apache Flink Operator in-action, with a technical preview of an upcoming product using the Flink Kubernetes Operator. - Lessons learned - Q&A
by
James Busche & Ted Chang
Making Apache Spark Better with Delta LakeDatabricks
Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake offers ACID transactions, scalable metadata handling, and unifies the streaming and batch data processing. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
In this talk, we will cover:
* What data quality problems Delta helps address
* How to convert your existing application to Delta Lake
* How the Delta Lake transaction protocol works internally
* The Delta Lake roadmap for the next few releases
* How to get involved!
Grafana Loki: like Prometheus, but for LogsMarco Pracucci
Loki is a horizontally-scalable, highly-available log aggregation system inspired by Prometheus. It is designed to be very cost-effective and easy to operate, as it does not index the contents of the logs, but rather labels for each log stream.
In this talk, we will introduce Loki, its architecture and the design trade-offs in an approachable way. We’ll both cover Loki and Promtail, the agent used to scrape local logs to push to Loki, including the Prometheus-style service discovery used to dynamically discover logs and attach metadata from applications running in a Kubernetes cluster.
Finally, we’ll show how to query logs with Grafana using LogQL - the Loki query language - and the latest Grafana features to easily build dashboards mixing metrics and logs.
Pulsar is used by a portfolio of products at Splunk for stream processing of different types of data, including metrics and logs. In this talk, Karthik Ramasamy will share how Splunk helped a flagship customer scale a Pulsar deployment to handle 10 PB/day in a single cluster. He will talk about the journey, the challenges faced, and the trade-offs made to scale Pulsar and operate it reliably and stably in Google Cloud Platform (GCP).
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Native Support of Prometheus Monitoring in Apache Spark 3.0Databricks
All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together.
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.
How Orange Financial combat financial frauds over 50M transactions a day usin...StreamNative
You will learn how Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
https://www.youtube.com/playlist?list=PLAiEy9H6ItrKC5PbH7KiELiSEIKv3tuov
-What is Prometheus?
-Difference Between Nagios vs Prometheus
-Architecture
-Alertmanager
-Time series DB
-PromQL (Prometheus Query Language)
-Live Demo
-Grafana
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person.
Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.
Building Cloud-Native App Series - Part 11 of 11
Microservices Architecture Series
Service Mesh - Observability
- Zipkin
- Prometheus
- Grafana
- Kiali
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Modern cloud-native applications are incredibly complex systems. Keeping the systems healthy and meeting SLAs for our customers is crucial for long-term success. In this session, we will dive into the three pillars of observability - metrics, logs, tracing - the foundation of successful troubleshooting in distributed systems. You'll learn the gotchas and pitfalls of rolling out the OpenTelemetry stack on Kubernetes to effectively collect all your signals without worrying about a vendor lock in. Additionally we will replace parts of the Prometheus stack to scrape metrics with OpenTelemetry collector and operator.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperKarthik Ramasamy
Enterprises are increasingly building applications to take advantage of real-time workloads, and Kubernetes is fast becoming the de-facto scheduler and orchestrating system for distributed systems. In this session, Karthik Ramasamy will show how easy it is to deploy Apache Pulsar for queuing and streaming and Twitter Heron for stream processing using Kubernetes and show how end-to-end data pipelines are built. Pulsar and Heron are at the core of Streamlio's end-to-end real-time platform, which is ideally suited to enterprises building next generation real time pipelines
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.
How Orange Financial combat financial frauds over 50M transactions a day usin...StreamNative
You will learn how Orange Financial combat financial frauds over 50M transactions a day using Apache Pulsar. The presentation is shared at Strata Data Conference at New York, US, 2019/09.
Prometheus - Intro, CNCF, TSDB,PromQL,GrafanaSridhar Kumar N
https://www.youtube.com/playlist?list=PLAiEy9H6ItrKC5PbH7KiELiSEIKv3tuov
-What is Prometheus?
-Difference Between Nagios vs Prometheus
-Architecture
-Alertmanager
-Time series DB
-PromQL (Prometheus Query Language)
-Live Demo
-Grafana
Introducing BinarySortedMultiMap - A new Flink state primitive to boost your ...Flink Forward
Flink Forward San Francisco 2022.
Probably everyone who has written stateful Apache Flink applications has used one of the fault-tolerant keyed state primitives ValueState, ListState, and MapState. With RocksDB, however, retrieving and updating items comes at an increased cost that you should be aware of. Sometimes, these may not be avoidable with the current API, e.g., for efficient event-time stream-sorting or streaming joins where you need to iterate one or two buffered streams in the right order. With FLIP-220, we are introducing a new state primitive: BinarySortedMultiMapState. This new form of state offers you to (a) efficiently store lists of values for a user-provided key, and (b) iterate keyed state in a well-defined sort order. Both features can be backed efficiently by RocksDB with a 2x performance improvement over the current workarounds. This talk will go into the details of the new API and its implementation, present how to use it in your application, and talk about the process of getting it into Flink.
by
Nico Kruber
Flink Forward San Francisco 2022.
Resource Elasticity is a frequently requested feature in Apache Flink: Users want to be able to easily adjust their clusters to changing workloads for resource efficiency and cost saving reasons. In Flink 1.13, the initial implementation of Reactive Mode was introduced, later releases added more improvements to make the feature production ready. In this talk, we’ll explain scenarios to deploy Reactive Mode to various environments to achieve autoscaling and resource elasticity. We’ll discuss the constraints to consider when planning to use this feature, and also potential improvements from the Flink roadmap. For those interested in the internals of Flink, we’ll also briefly explain how the feature is implemented, and if time permits, conclude with a short demo.
by
Robert Metzger
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
Pulsar Summit San Francisco is the event dedicated to Apache Pulsar. This one-day, action-packed event will include 5 keynotes, 12 breakout sessions, and 1 amazing happy hour. Speakers are from top companies, including Google, AWS, Databricks, Onehouse, StarTree, Intel, ScyllaDB, and more! It’s the perfect opportunity to network with Pulsar thought leaders in person.
Join developers, architects, data engineers, DevOps professionals, and anyone who wants to learn about messaging and event streaming for this one-day, in-person event. Pulsar Summit San Francisco brings the Apache Pulsar Community together to share best practices and discuss the future of streaming technologies.
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
Lakehouses are quickly growing in popularity as a new approach to Data Platform Architecture bringing some of the long-established benefits from OLTP world to OLAP, including transactions, record-level updates/deletes, and changes streaming. In this talk, we will discuss Apache Hudi and how it unlocks possibilities of building your own fully open-source Lakehouse featuring a rich set of integrations with existing technologies, including Apache Pulsar. In this session, we will present: - What Lakehouses are, and why they are needed. - What Apache Hudi is and how it works. - Provide a use-case and demo that applies Apache Hudi’s DeltaStreamer tool to ingest data from Apache Pulsar.
Building Cloud-Native App Series - Part 11 of 11
Microservices Architecture Series
Service Mesh - Observability
- Zipkin
- Prometheus
- Grafana
- Kiali
Evening out the uneven: dealing with skew in FlinkFlink Forward
Flink Forward San Francisco 2022.
When running Flink jobs, skew is a common problem that results in wasted resources and limited scalability. In the past years, we have helped our customers and users solve various skew-related issues in their Flink jobs or clusters. In this talk, we will present the different types of skew that users often run into: data skew, key skew, event time skew, state skew, and scheduling skew, and discuss solutions for each of them. We hope this will serve as a guideline to help you reduce skew in your Flink environment.
by
Jun Qin & Karl Friedrich
Modern cloud-native applications are incredibly complex systems. Keeping the systems healthy and meeting SLAs for our customers is crucial for long-term success. In this session, we will dive into the three pillars of observability - metrics, logs, tracing - the foundation of successful troubleshooting in distributed systems. You'll learn the gotchas and pitfalls of rolling out the OpenTelemetry stack on Kubernetes to effectively collect all your signals without worrying about a vendor lock in. Additionally we will replace parts of the Prometheus stack to scrape metrics with OpenTelemetry collector and operator.
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
Data Orchestration Summit
www.alluxio.io/data-orchestration-summit-2019
November 7, 2019
Apache Iceberg - A Table Format for Hige Analytic Datasets
Speaker:
Ryan Blue, Netflix
For more Alluxio events: https://www.alluxio.io/events/
Introducing the Apache Flink Kubernetes OperatorFlink Forward
Flink Forward San Francisco 2022.
The Apache Flink Kubernetes Operator provides a consistent approach to manage Flink applications automatically, without any human interaction, by extending the Kubernetes API. Given the increasing adoption of Kubernetes based Flink deployments the community has been working on a Kubernetes native solution as part of Flink that can benefit from the rich experience of community members and ultimately make Flink easier to adopt. In this talk we give a technical introduction to the Flink Kubernetes Operator and demonstrate the core features and use-cases through in-depth examples."
by
Thomas Weise
Streaming Pipelines in Kubernetes Using Apache Pulsar, Heron and BookKeeperKarthik Ramasamy
Enterprises are increasingly building applications to take advantage of real-time workloads, and Kubernetes is fast becoming the de-facto scheduler and orchestrating system for distributed systems. In this session, Karthik Ramasamy will show how easy it is to deploy Apache Pulsar for queuing and streaming and Twitter Heron for stream processing using Kubernetes and show how end-to-end data pipelines are built. Pulsar and Heron are at the core of Streamlio's end-to-end real-time platform, which is ideally suited to enterprises building next generation real time pipelines
Real Time Processing Using Twitter Heron by Karthik RamasamyData Con LA
Abstract:- Today's enterprises are not only producing data in high volume but also at high velocity. With velocity comes the need to process the data in real time. To meet the real time needs, we developed and deployed Heron, the next generation streaming engine at Twitter. Heron processes billions and billions of events per day at Twitter and has been in production for nearly 3 years. Heron provides unparalleled performance at large scale and has been successfully meeting Twitter's strict performance requirements for various streaming and iOT applications. Heron is a open source project with several major contributors from various institutions. As the project, we identified and implemented several optimizations that improved throughput by additional 5x and further reduce latency by 50-60%. In this talk, we will describe Heron in detail, how the detailed profiling indicated the performance bottleneck areas such as multiple serializations/deserialization and immutable data structures. After mitigating these costs, we were able to show much higher throughput and latencies as low as 12ms.
"Traffic Speed Control System in the Cloud using Machine Learning" by Albert ...DevClub_lv
This talk will be about difficulties they have met on the road, and how they overcame some of them. It will reveal both - technical and management side.
Alberts Bušinskis (Garuda Dev Giri) - Yoga & Meditation teacher, IT enthusiast, software developer. Many years ago he had left a very good job in Switzerland and went to Indian Himalayas to practice Yoga. After he realized that Software Development is not different from Yoga, he came back to homeland trying to be helpful and inspiring.
Big Data & Machine Learning Pipelines: A Tale of Lambdas, Kappas and PancakesOsama Khan
Data Lake, Business Intelligence, Enterprise Data Warehouse, Big Data Pipeline, Online Machine Learning, Lambda Architecture, Streaming, Spark, Kafka, Storm, Flink, Hadoop, Mesos and SMACK stack are some of the things you hear about when you want to dive into building a data pipeline. The Big Data Landscape cannot fit on a single screen as seen in the presentation. This is in addition to all the Big Data & Machine Learning offerings AWS has been introducing over the past few years which address many pain points highlighted by the various communities and help you get up and running faster.
The objective of this talk is to provide the audience with a framework which helps them define their pipeline problems, isolate components and pick the right tools for the right job.
We will talk about:
1. A consistent definition of BIG in big data
2. The lineage of fundamental tools in the ecosystem
3. First principles of a big data pipeline based on the lambda (not lambda functions) and kappa architectures
4. Distinguishing between big data and online machine learning pipelines
5. Technology choices based on first principles, open source solutions and AWS offerings
6. Demo: Serverless, Managed Big Data Pipeline and real-time dashboard on AWS (orchestrated via Terraform)
Presented at: https://www.meetup.com/Vancouver-Amazon-Web-Services-User-Group/events/245946651/
Data pipelines are hard to build and maintain. This is due to complexity of big data open source ecosystem that has numerous software each specializing in solving one piece of the puzzle. In this talk, we will focus on three key open source software Apache Pulsar, Apache Heron and Apache BookKeeper and how are integrated to make it easy to build data pipelines.
EDA Meets Data Engineering – What's the Big Deal?confluent
Presenter: Guru Sattanathan, Systems Engineer, Confluent
Event-driven architectures have been around for many years, much like Apache Kafka®, which first open sourced in 2011. The reality is that the true potential of Kafka is only being realised now. Kafka is becoming the central nervous system of many of today’s enterprises. It is bringing a profound paradigm shift to the way we think about enterprise IT. What has changed in Kafka to enable this paradigm shift? Is it not just a message broker, and how are enterprises using it today? This session will explore these key questions.
Sydney: https://content.deloitte.com.au/20200221-tel-event-tech-community-syd-registration
Melbourne: https://content.deloitte.com.au/20200221-tel-event-tech-community-mel-registration
Big Data Day LA 2016/ Big Data Track - Twitter Heron @ Scale - Karthik Ramasa...Data Con LA
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. Twitter designed and deployed a new streaming system called Heron. Heron has been in production nearly 2 years and is widely used by several teams for diverse use cases. This talk looks at Twitter's operating experiences and challenges of running Heron at scale and the approaches taken to solve those challenges.
My perspective on the evolution of big data from the perspective of a distributed systems researcher & engineer -- the background of how it get started, the scale-out paradigm, industry use cases, open source development paradigm, and interesting future challenges.
4 TeraGrid Sites Have Focal Points:
SDSC – The Data Place
Large-scale and high-performance data analysis/handling
Every Cluster Node is Directly Attached to SAN
NCSA – The Compute Place
Large-scale, Large Flops computation
Argonne – The Viz place
Scalable Viz walls
Caltech – The Applications place
Data and flops for applications – Especially some of the GriPhyN Apps
Specific machine configurations reflect this
Seeks to give you enough information about how "Nosql databases" works, to let you answer the question: "Make sense for my next project considering a NoSQL database ?"
Introduction to Artificial Intelligence. Not complex and should be relative easy to follow. Be aware that due to its high levelness (and no voice over) some care should be taken by the simplified examples used.
The engineering teams within Splunk have been using several technologies Kinesis, SQS, RabbitMQ and Apache Kafka for enterprise wide messaging for the past few years but have recently made the decision to pivot toward Apache Pulsar, migrating both existing use cases and embedding it into new cloud-native service offerings such as the Splunk Data Stream Processor (DSP).
Data processing use cases, from transformation to analytics, perform tasks that require various combinations of queuing, streaming & lightweight processing steps. Until now, supporting all of those needs has required different systems for each task--stream processing engines, messaging queuing middleware, & streaming messaging systems. That has led to increased complexity for development & operations.
In this session, well discuss the need to unify these capabilities in a single system & how Apache Pulsar was designed to address that. Apache Pulsar is a next generation distributed pub-sub system that was developed & deployed at Yahoo. Streamlios Karthik Ramasamy, will explain how the architecture & design of Pulsar provides the flexibility to support developers & applications needing any combination of queuing, messaging, streaming & lightweight compute.
Presented at SF Big Analytics Meetup
Online event processing applications often require the ability to ingest, store, dispatch and process events. Until now, supporting all of these needs has required different systems for each task -- stream processing engines, messaging queuing middleware, and pub/sub messaging systems. This has led to the unnecessary complexity for the development of such applications and operations leading to increased barrier to adoption in the enterprises. In this talk, Karthik will outline the need to unify these capabilities in a single system and make it easy to develop and operate at scale. Karthik will delve into how Apache Pulsar was designed to address this need with an elegant architecture. Apache Pulsar is a next generation distributed pub-sub system that was originally developed and deployed at Yahoo and running in production in more than 100+ companies. Karthik will explain how the architecture and design of Pulsar provides the flexibility to support developers and applications needing any combination of queuing, messaging, streaming and lightweight compute for events. Furthermore, he will provide real life use cases how Apache Pulsar is used for event processing ranging from data processing tasks to web processing applications.
Creating Data Fabric for #IOT with Apache PulsarKarthik Ramasamy
In this presentation, we go into details about how a distributed data fabric using Apache Pulsar can be created in an IoT environment - acquire, filter, move and distribute from myriad of devices to cloud.
Linked In Stream Processing Meetup - Apache PulsarKarthik Ramasamy
Apache Pulsar is the next generation messaging system that uses a fundamentally different architecture to achieve durability, performance, scalability, efficiency, multi-tenancy and geo replication.
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...Karthik Ramasamy
Twitter generates billions and billions of events per day. Analyzing these events in real time presents a massive challenge. In order
to meet this challenge, Twitter designed an end to end real-time stack consisting of DistributedLog, the distributed and replicated messaging system system, and Heron, the streaming system for real time computation. DistributedLog is a replicated log service that is built on top of Apache BookKeeper, providing infinite, ordered, append-only streams that can be used for building robust real-time systems. It is the foundation of Twitter’s publish-subscribe system. Twitter Heron is the next generation streaming system built from ground up to address our scalability and reliability needs. Both the systems have been in production for nearly two years and is widely used at Twitter in a range of diverse applications such as search ingestion pipeline, ad analytics, image classification and more. These slides will describe Heron and DistributedLog in detail, covering a few use cases in-depth and sharing the operating experiences and challenges of running large-scale real time systems at scale.
This paper describes the use of Storm at Twitter. Storm is a realtime fault-tolerant and distributed stream data processing system.Storm is currently being used to run various critical computations in Twitter at scale, and in real-time. This paper describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This paper also describes how queries (aka.topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results
from an empirical evaluation demonstrating the resilience of
Storm in dealing with machine failures. Storm is under active
development at Twitter and we also present some potential
directions for future work.
Storm is a realtime fault-tolerant and distributed stream data processing system. Storm is used to run various critical computations in Twitter at scale, and in real-time. This talk was presented at the SIGMOD 2014 conference that accepted our paper Storm@Twitter. It describes the architecture of Storm and its methods for distributed scale-out and fault-tolerance. This presentation also describes how queries (aka. topologies) are executed in Storm, and presents some operational stories based on running Storm at Twitter. We also present results from an empirical evaluation demonstrating the resilience of Storm in dealing with machine failures.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
4. 4
Increasingly Connected World
Internet of Things
30 B connected devices by 2020
Health Care
153 Exabytes (2013) -> 2314 Exabytes (2020)
Machine Data
40% of digital universe by 2020
Connected Vehicles
Data transferred per vehicle per month
4 MB -> 5 GB
Digital Assistants (Predictive Analytics)
$2B (2012) -> $6.5B (2019) [1]
Siri/Cortana/Google Now
Augmented/Virtual Reality
$150B by 2020 [2]
Oculus/HoloLens/Magic Leap
Ñ
!+
>
7. 7
The New Era: Streaming Data/Fast Data
Events are analyzed and processed in real-<me as they drive
Decisions are <mely, contextual and based on fresh data
Decision latency is eliminated
Data in mo<on
13. 13
Real Time Use Cases
Algorithmic trading
Online fraud detec<on
Geo fencing
Proximity/loca<on tracking
Intrusion detec<on systems
Traffic management
Real <me recommenda<ons
Churn detec<on
Internet of things
Social media/data analy<cs
Gaming data feed
24. 24
Pulsar Producer
PulsarClient client = PulsarClient.create(
“http://broker.usw.example.com:8080”);
Producer producer = client.createProducer(
“persistent://my-property/us-west/my-namespace/my-topic”);
// handles retries in case of failure
producer.send("my-message".getBytes());
// Async version:
producer.sendAsync("my-message".getBytes()).thenRun(() -> {
// Message was persisted
});
25. 25
Pulsar Consumer
PulsarClient client = PulsarClient.create(
"http://broker.usw.example.com:8080");
Consumer consumer = client.subscribe(
"persistent://my-property/us-west/my-namespace/my-topic",
"my-subscription-name");
while (true) {
// Wait for a message
Message msg = consumer.receive();
System.out.println("Received message: " + msg.getData());
// Acknowledge the message so that it can be deleted by broker
consumer.acknowledge(msg);
}
28. 28
Broker Failure Recovery
• Topic is reassigned to
an available broker
based on load
• Can reconstruct the
previous state
consistently
• No data needs to be
copied
• Failover handled
transparently by client
library
29. 29
Bookie Failure Recovery
• After a write failure,
BookKeeper will
immediately switch write to
a new bookie, within the
same segment.
30. 30
Bookie Failure Recovery
• In background, starts a
many-to-many recovery
process to regain the
configured replication
factor
31. 31
Pulsar Architecture
Geo Replica@on
GEO REPLICATION
Asynchronous replication
Integrated in the broker message flow
Simple configuration to add/remove regions
Topic (T1) Topic (T1)
Topic (T1)
Subscrip<on
(S1)
Subscrip<on
(S1)
Producer
(P1)
Consumer
(C1)
Producer
(P3)
Producer
(P2)
Consumer
(C2)
Data Center A Data Center B
Data Center C
32. 32
Pulsar Use Cases - Message Queue
Online Events Topic (T)
Worker 1
Worker 2
Decouple Online/Offline
Topic (T)
Worker 3
MESSAGE QUEUES
Decouple online or background
High availability
Reliable data transport
No@fica@ons
Long running tasks
Low latency
publish
33. 33
Pulsar Use Cases - Feedback System
Event Topic (T)
Propagate States
Controller
Topic (T)
Serving System Serving System Serving System
FEEDBACK SYSTEM
Coordinate large number of machines
Propagate states
Examples
State propagation
Personalization
Ad-systems
Feedback
Updates
42. 42
Heron Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
44. 44
Writing Heron Topologies
Procedural - Low Level API
Directly write your spouts and
bolts
Functional - Mid Level API
Use of maps, flat maps, transform, windows
Declarative - SQL (coming)
Use of declara<ve language - specify what you
want, system will figure it out.
,
%
45. 45
Streamlet Functional API
Builder.newBuilder()
.newSource(() -> StreamletUtils.randomFromList(SENTENCES))
.flatMap(sentence -> Arrays.asList(sentence.toLowerCase().split("s+")))
.reduceByKeyAndWindow(word -> word, word -> 1,
WindowConfig.TumblingCountWindow(50),
(x, y) -> x + y);
64. 64
Reacting to Failures - Brokers
Brokers don’t have durable state
Easily replaceable
Topics are immediately reassigned to healthy brokers
Expanding capacity
Simply add new broker node
If other brokers are overloaded, traffic will be automa<cally assigned
Load manager
Monitor traffic load on all brokers (CPU, memory, network, topics)
Ini<ally place topics to least loaded brokers
Reassign topics when a broker is overloaded
65. 65
Reacting to Failures - Bookies
When a bookie fails, brokers will immediately con<nue on other bookies
Auto-Recovery mechanism will re-establish the replica<on factor in background
If a bookie keeps giving errors or <meouts, it will be “quaran<ned”
Not considered for new ledgers for some period of <me
71. 71
Heron Happy Facts :)
! No more pages during midnight for Heron team
" Very rare incidents for Heron customer teams
# Easy to debug during incident for quick turn around
$ Reduced resource u<liza<on saving cost
79. 79
Self Regulating Streaming Systems
Automate Tuning
SLO
Maintenance
Self Regula@ng
Streaming Systems
Tuning
Manual, <me-consuming and error-
prone task of tuning various systems
knobs to achieve SLOs
SLO
Maintenance of SLOs in the face of
unpredictable load varia<ons and
hardware or sobware performance
degrada<on
Self Regula@ng Streaming Systems
System that adjusts itself to the environmental changes and
con<nue to produce results
80. 80
Self Regulating Streaming Systems
Self tuning Self stabilizing Self healing
G "g
Several tuning knobs
Time consuming tuning phase
The system should take
as input an SLO and
automatically configure
the knobs.
The system should
react to external shocks
a n d a u t o m a t i c a l l y
reconfigure itself
Stream jobs are long running
Load variations are common
The system should
identify internal faults
and attempt to recover
from them
System performance affected
by hardware or software
delivering degraded quality of
service
81. 81
Enter Dhalion
Dhalion periodically executes
well-specified policies that
optimize execution based on
some objective.
We created policies that
dynamically provision resources
in the presence of load variations
and auto-tune streaming
applications so that a throughput
SLO is met.
Dhalion is a policy based
framework integrated into Heron
82. 82
Dhalion Policy Framework
Symptom
Detector 1
Symptom
Detector 2
Symptom
Detector 3
Symptom
Detector N
....
Diagnoser 1
Diagnoser 2
Diagnoser M
....
Resolver
Invocation
D
iagnosis
1
Diagnosis 2
D
iagnosis
M
Symptom 1
Symptom 2
Symptom 3
Symptom N
Symptom
Detection
Diagnosis
Generation
Resolution
Resolver 1
Resolver 2
Resolver M
....
Resolver
Selection
Metrics
92. 92
DATA SKETCHES
Early work
The space complexity of approximating
the frequency moments
Counting
Frequent Elements
[Misra and Gries, 1982]
Flajolet and Martin 1985]
Computing on Data Streams
[Henzinger et al. 1998]
[Alon et al. 1996]
Counting
[Morris, 1977]
Median of a sequence
[Munro and Paterson, 1980]
Membership
[Bloom, 1970]
121. Count-min Sketch [1]
A two-dimensional array counts with w columns and d rows
Each entry of the array is ini<ally zero
d hash func<ons are chosen uniformly at random from a pairwise independent
family
Update
For a new element i, for each row j and k = hj(i), increment the kth column by one
Point query where, sketch is the table
Parameters
121
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
[1] Cormode, Graham; S. Muthukrishnan (2005). "An Improved Data Stream Summary: The Count-Min Sketch and its Applica<ons". J. Algorithms 55: 29–38.
Frequent Elements
125. 125
Very rich - over 150 yrs - history
Anomalies are contextual in nature
“DISCORDANT observations may be defined as those which
present the appearance of differing in respect of their law of
frequency from other observations with which they ale
combined. In the treatment of such observations there is great
diversity between authorities ; but this discordance of methods
may be reduced by the following reflection. Different methods
are adapted to different hypotheses about the cause of a
discordant observation; and different hypotheses are true, or
appropriate, according as the subject-matter, or the degree of
accuracy required, is different.”
F. Y. Edgeworth, “On Discordant Observations”, 1887.
Anomaly Detection
128. 128
Anomaly Detection
ROBUST MEASURES
MEDIAN MAD [1] MCD [2] MVEE [3,4]
Median Absolute
Deviation
Minimum Covariance
Determinant
Minimum Volume
Enclosing Ellipsoid
[1]P. J. Rousseeuw and C. Croux, “Alterna<ves to the Median Absolute Devia<on”, 1993.
[2] hgp://onlinelibrary.wiley.com/wol1/doi/10.1002/wics.61/abstract
[3] P. J. Rousseeuw and A. M. Leroy.,“Robust Regression and Outlier Detec<on”, 1987.
[4] M. J.Todda and E. A. Yıldırım , “On Khachiyan's algorithm for the computa<on of minimum-volume enclosing ellipsoids”, 2007.
143. 143
Apache BookKeeper - Stream Storage
A storage for log streams
Replicated, durable storage of log streams
Provide fast tailing/streaming facility
Op<mized for immutable data
Low-latency durability
Simple repeatable read consistency
High write and read availability
153. 153
What do we really mean by Compute?
Compute Unifica@on
Aim is to react to events as they happen in real-<me
Where do Events happen/arrive?
Message Bus
Whats a reac<on
An ac<on/transforma<on/func<on
161. 161
Whats needed:- Stream-Native Compute
Insight
Simplest possible API
Method/Procedure/Func<on
Mul< Language API
Scale developers
Message bus na<ve concepts
Input/Output/Log as topics
Flexible run<me
Simple standalone applica<ons vs system managed applica<ons
168. 168
Pulsar Functions
WordCount Topology
import org.apache.pulsar.functions.api.Context;
import org.apache.pulsar.functions.api.PulsarFunction;
public class CounterFunction implements PulsarFunction<String, Void> {
@Override
public Void process(String input, Context context) throws Exception {
for (String word : input.split(".")) {
context.incrCounter(word, 1);
}
return null;
}
}
169. 169
Pulsar Functions
Running as a standalone applica@on
bin/pulsar-admin functions localrun
--input persistent://sample/standalone/ns1/test_input
--output persistent://sample/standalone/ns1/test_result
--className org.mycompany.ExclamationFunction
--jar myjar.jar
Runs as a standalone process
Run as many instances as you want. Framework automa<cally balances data
Run and manage via Mesos/K8/Nomad/your favorite tool
179. 179
BookKeeper - Use Cases
Combine messaging and storage
Stream Storage combines the func<onality of streaming and storage
WAL - Write Ahead Log Message Store Object Store
SnapshotsStream Processing
180. 180
Readings
FOCS’00
Clustering Data Streams
SIGMOD’02
Querying and mining data streams:
You only get one look
SIAM Journal of Computing’09
Stream Order and Order Statistics: Quantile
Estimation in Random-Order Streams
PODS’02
Models and Issues in Data Stream Systems
SIGMOD’07
Statistical Analysis of Sketch Estimators
PODS’10
An optimal algorithm for the distinct
elements problem
181. 181
Readings
SODA’10
Coresets and Sketches for high dimensional
subspace approximation problems
SIGMOD’16
Time Adaptive Sketches (Ada-Sketches) for
Summarizing Data Streams
SOSR’17
Heavy-Hitter Detection Entirely in the Data
Plane
PODS’12
Graph Sketches: Sparsification, Spanners, and
Subgraphs
Arxiv’16
Coresets and Sketches
ACM Queue’17
Data Sketching: The approximate approach is
often faster and more efficient
182.
183. 183
GET IN TOUCH
C O N T A C T U S
@arun_kejariwal
@kramasamy, @sanjeerk
@sijieg, @merlimat
@nlu90
karthik@stremlio.io
arun_kejariwal@acm.org
184.
185. E N J O Y T H E P R E S E N T A T I O N
The End