Pinot is a real-time OLAP data store that can support multiple analytics use cases like interactive dashboards, site facing queries, and anomaly detection in a single system. It achieves this through features like configurable indexes, dynamic query planning and execution, smart data partitioning and routing, and pre-materialized indexes like star-trees that optimize for latency and throughput across different workloads. The document discusses Pinot's architecture and optimizations that enable it to meet the performance requirements of these different use cases.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Two of the most frequently asked questions about Pinot’s history are “Why did LinkedIn build Pinot?”, “How is it different from Druid, ElasticSearch, Kylin?”. In this talk, we will go over the use cases that motivated us to build Pinot and how it has changed the analytics landscape at LinkedIn, Uber, and other companies.
Introducing new features in Apache Pinot. In this talk, we will go over indexing support in Pinot, recently added text indexing feature, SQL support, and cloud readiness.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
LinkedIn's is the most advantageous social networking tool available to job seekers and business professionals today, with 610+ million members creating millions of posts, videos, and articles that generate tens of millions of shares, comments, and likes per day. LinkedIn has leveraged this activity data to build rich interactive user-facing analytics applications like “Who Viewed My Profile”, Talent Insights, Ad Analytics, and Publisher Analytics, among others. These applications are all powered by Pinot, as are internal dashboards, anomaly detection and root cause analysis platform like ThirdEye. This talk will present how Pinot has become the de-facto solution for serving analytic queries in milliseconds, ad-hoc reporting, monitoring & Anomaly Detection on multidimensional data.
Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | ...HostedbyConfluent
At Stripe, we operate a general ledger modeled as double-entry bookkeeping for all financial transactions. Warehousing such data is challenging due to its high volume and high cardinality of unique accounts.
aFurthermore, it is financially critical to get up-to-date, accurate analytics over all records. Due to the changing nature of real time transactions, it is impossible to pre-compute the analytics as a fixed time series. We have overcome the challenge by creating a real time key-value store inside Pinot that can sustain half million QPS with all the financial transactions.
We will talk about the details of our solution and the interesting technical challenges faced.
Two of the most frequently asked questions about Pinot’s history are “Why did LinkedIn build Pinot?”, “How is it different from Druid, ElasticSearch, Kylin?”. In this talk, we will go over the use cases that motivated us to build Pinot and how it has changed the analytics landscape at LinkedIn, Uber, and other companies.
Introducing new features in Apache Pinot. In this talk, we will go over indexing support in Pinot, recently added text indexing feature, SQL support, and cloud readiness.
Building real time analytics applications using pinot : A LinkedIn case studyKishore Gopalakrishna
LinkedIn's is the most advantageous social networking tool available to job seekers and business professionals today, with 610+ million members creating millions of posts, videos, and articles that generate tens of millions of shares, comments, and likes per day. LinkedIn has leveraged this activity data to build rich interactive user-facing analytics applications like “Who Viewed My Profile”, Talent Insights, Ad Analytics, and Publisher Analytics, among others. These applications are all powered by Pinot, as are internal dashboards, anomaly detection and root cause analysis platform like ThirdEye. This talk will present how Pinot has become the de-facto solution for serving analytic queries in milliseconds, ad-hoc reporting, monitoring & Anomaly Detection on multidimensional data.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Real-time Analytics with Presto and Apache PinotXiang Fu
Presto Con 2021
In this world, most analytics products either focus on ad-hoc analytics, which requires query flexibility without guaranteed latency, or low latency analytics with limited query capability.
In this talk, we will explore how to get the best of both worlds using Apache Pinot and Presto:
1. How people do analytics today to trade-off Latency and Flexibility: Comparison over analytics on raw data vs pre-join/pre-cube dataset.
2. Introduce Apache Pinot as a column store for fast real-time data analytics and Presto Pinot Connector to cover the entire landscape.
3. Deep dive into Presto Pinot Connector to see how the connector does predicate and aggregation push down.
4. Benchmark results for Presto Pinot connector.
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Personalizing "The Netflix Experience" with Deep LearningAnoop Deoras
These are the slides from my talk presented at AI Next Con conference in Seattle in Jan 2019. Here I talk in a bit more detail about the intuition behind collaborative filtering and go a bit deeper into the details of non linear deep learned models.
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
Opera chose Scylla over Cassandra to sync the data of millions of browsers to a back-end data repository. The results of the migration and further optimizations they made in their stack helped Opera to gain better latency/throughput and lower resources usage beyond their expectations.
Attend this session to learn how to
Migrate your data in a sane way, without any downtime
Connect a Python+Django web app to Scylla, how to use intranode sharding to improve your application
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
Real-time Analytics with Trino and Apache PinotXiang Fu
Trino summit 2021:
Overview of Trino Pinot Connector, which bridges the flexibility of Trino's full SQL support to the power of Apache Pinot's realtime analytics, giving you the best of both worlds.
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
We built Apache Pinot - a real-time distributed OLAP datastore - for low-latency analytics at scale. This is heavily used at companies such as LinkedIn, Uber, Slack, where Kafka serves as the backbone for capturing vast amounts of data. Pinot ingests millions of events per sec from Kafka, builds indexes in real-time and serves 100K+ queries per second while ensuring latency SLA of millisecond to sub second.
In the first implementation, we used the Consumer Group feature to manage the offsets and checkpoints across multiple Kafka Consumers. However, to achieve fault tolerance and scalability, we had to run multiple consumer groups for the same topic. This was our initial strategy to maintain the SLA at high query workload. But this model posed other challenges - since Kafka maintains offset per consumer group, achieving data consistency across multiple consumer groups was not possible. Also, a failure of a single node in a consumer group meant the entire consumer group was unavailable for query processing. Restarting the failed node needed lot of manual operations to ensure data is consumed exactly once. This resulted in management overhead and inefficient hardware utilization.
While taking inspiration from the Kafka consumer group implementation, we redesigned the real-time consumption in Pinot to maintain consistent offset across multiple consumer groups. This allowed us to guarantee consistent data across all replicas. This enabled us to copy data from another consumer group during node addition, node failure or increasing the replication group.
In this talk, we will deep dive into the various challenges faced and considerations that went into this design, and learn what makes Pinot resilient to failures both in Kafka Brokers and Pinot Components. We will introduce the new concept of ""lockstep"" sequencing where multiple consumer groups can synchronize checkpoints periodically and maintain consistency. We'll describe how we achieve this while maintaining strict freshness SLAs, and also withstanding high throughput and ingestion.
Building a Real-Time Analytics Application with Apache Pulsar and Apache PinotAltinity Ltd
Building a Real-Time Analytics Application with
Apache Pulsar and Apache Pinot
While the demands for real-time analytics are growing in leaps and bounds, the analytics software must rely on streaming platforms for ingesting high volumes of data that's traveling in lightning speed down the pipeline. We will take a look at 2 powerful open source Apache platforms: Pulsar and Pinot, that work hand-in-hand together to deliver the analytical results which bring great value to your systems.
Presenters: Mary Grygleski - Streaming Developer Advocate &
Mark Needham - Developer Relations Engineer at StarTree
Note: This webinar will be recorded and later posted on our Webinar page (https://altinity.com/webinarspage/) or Altinity official Youtube channel (https://www.youtube.com/@Altinity).
Pinot is a realtime distributed OLAP datastore, which is used at LinkedIn to deliver scalable real time analytics with low latency. It can ingest data from offline data sources (such as Hadoop and flat files) as well as online sources (such as Kafka). Pinot is designed to scale horizontally.
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021StreamNative
You may be familiar with the Presto plugin used to run fast interactive queries over Pulsar using ANSI SQL and can be joined with other data sources. This plugin will soon get a rename to align with the rename of the PrestoSQL project to Trino. What is the purpose of this rename and what does it mean for those using the Presto plugin? We cover the history of the community shift from PrestoDB to PrestoSQL, as well as, the future plans for the Pulsar community to donate this plugin to the Trino project. One of the connector maintainers will then demo the connector and show what is possible when using Trino and Pulsar!
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Flink Forward
Flink Forward San Francisco 2022.
Being in the payments space, Stripe requires strict correctness and freshness guarantees. We rely on Flink as the natural solution for delivering on this in support of our Change Data Capture (CDC) infrastructure. We heavily rely on CDC as a tool for capturing data change streams from our databases without critically impacting database reliability, scalability, and maintainability. Data derived from these streams is used broadly across the business and powers many of our critical financial reporting systems totalling over $640 Billion in payment volume annually. We use many components of Flink’s flexible DataStream API to perform aggregations and abstract away the complexities of stream processing from our downstreams. In this talk, we’ll walk through our experience from the very beginning to what we have in production today. We’ll share stories around the technical details and trade-offs we encountered along the way.
by
Jeff Chao
This introductory level talk is about Apache Flink: a multi-purpose Big Data analytics framework leading a movement towards the unification of batch and stream processing in the open source.
With the many technical innovations it brings along with its unique vision and philosophy, it is considered the 4 G (4th Generation) of Big Data Analytics frameworks providing the only hybrid (Real-Time Streaming + Batch) open source distributed data processing engine supporting many use cases: batch, streaming, relational queries, machine learning and graph processing.
In this talk, you will learn about:
1. What is Apache Flink stack and how it fits into the Big Data ecosystem?
2. How Apache Flink integrates with Hadoop and other open source tools for data input and output as well as deployment?
3. Why Apache Flink is an alternative to Apache Hadoop MapReduce, Apache Storm and Apache Spark.
4. Who is using Apache Flink?
5. Where to learn more about Apache Flink?
Real-time Analytics with Presto and Apache PinotXiang Fu
Presto Con 2021
In this world, most analytics products either focus on ad-hoc analytics, which requires query flexibility without guaranteed latency, or low latency analytics with limited query capability.
In this talk, we will explore how to get the best of both worlds using Apache Pinot and Presto:
1. How people do analytics today to trade-off Latency and Flexibility: Comparison over analytics on raw data vs pre-join/pre-cube dataset.
2. Introduce Apache Pinot as a column store for fast real-time data analytics and Presto Pinot Connector to cover the entire landscape.
3. Deep dive into Presto Pinot Connector to see how the connector does predicate and aggregation push down.
4. Benchmark results for Presto Pinot connector.
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
Building data pipelines is pretty hard! Building a multi-datacenter active-active real time data pipeline for multiple classes of data with different durability, latency and availability guarantees is much harder.
Real time infrastructure powers critical pieces of Uber (think Surge) and in this talk we will discuss our architecture, technical challenges, learnings and how a blend of open source infrastructure (Apache Kafka and Samza) and in-house technologies have helped Uber scale.
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...HostedbyConfluent
Apache Kafka is used as the primary message bus for propagating events and logs across Uber. In particular, it pairs with Apache Pinot, a real-time distributed OLAP datastore, to deliver real-time insights seconds after the messages produced to Kafka.
One challenge we faced was to update existing data in Pinot with the changelog in Kafka, and deliver an accurate view in the real-time analytical results. For example, the financial dashboard can report gross booking with the corrected Ride fares. And restaurant owners can analyze the UberEats orders with their latest delivery status.
Implementing upserts in an immutable real-time OLAP store like Pinot is nontrivial. We need to make architectural changes in how data is distributed via Kafka amongst the server nodes, how it's indexed and queried in a distributed fashion. In this talk I will discuss how we leveraged Kafka's partition-by-key feature to this end and how we added this ability in Pinot without any performance degradation.
Extending Flink SQL for stream processing use casesFlink Forward
Flink Forward San Francisco 2022.
Apache Flink is a powerful stream processing platform that enables users to build complex real time applications. Flink SQL provides a SQL interface that implements standard SQL. While the standard SQL provides a perfect interface for batch processing, in stream processing context, it can result is ambiguity and complex syntax. As an example, consider these three types of streams: Append-only stream, Retract stream and Upsert stream. Using standard SQL, we would represent all of these streams as Table along with the Table concept in batch processing. Such overloading of concepts can result in ambiguity in SQL statements in streaming context. In this talk, we will present extensions to the Flink SQL that simplify SQL statements in the context of stream processing. We will show how such extensions work in the context of a Flink application using different use cases. These extensions are only sugar syntax and users should be able to use Flink SQL as is if they desire.
by
Hojjat Jafarpour
Near real-time statistical modeling and anomaly detection using Flink!Flink Forward
Flink Forward San Francisco 2022.
At ThousandEyes we receive billions of events every day that allow us to monitor the internet; the most important aspect of our platform is to detect outages and anomalies that have a potential to cause serious impact to customer applications and user experience. Automatic detection of such events at lowest latency and highest accuracy is extremely important for our customers and their business. After launching several resilient and low latency data pipelines in production using Flink we decided to take it up a notch; we leveraged Flink to build statistical models in near real-time and apply them on incoming stream of events to detect anomalies! In this session we will deep dive into the design as well as discuss pitfalls and learnings while developing our real-time platform that leverages Debezium, Kafka, Flink, ElasticCache and DynamoDB to process events at scale!
by
Kunal Umrigar & Balint Kurnasz
Mario Molina, Software Engineer
CDC systems are usually used to identify changes in data sources, capture and replicate those changes to other systems. Companies are using CDC to sync data across systems, cloud migration or even applying stream processing, among others.
In this presentation we’ll see CDC patterns, how to use it in Apache Kafka, and do a live demo!
https://www.meetup.com/Mexico-Kafka/events/277309497/
Lambda architecture is a popular technique where records are processed by a batch system and streaming system in parallel. The results are then combined during query time to provide a complete answer. Strict latency requirements to process old and recently generated events made this architecture popular. The key downside to this architecture is the development and operational overhead of managing two different systems.
There have been attempts to unify batch and streaming into a single system in the past. Organizations have not been that successful though in those attempts. But, with the advent of Delta Lake, we are seeing lot of engineers adopting a simple continuous data flow model to process data as it arrives. We call this architecture, The Delta Architecture.
Personalizing "The Netflix Experience" with Deep LearningAnoop Deoras
These are the slides from my talk presented at AI Next Con conference in Seattle in Jan 2019. Here I talk in a bit more detail about the intuition behind collaborative filtering and go a bit deeper into the details of non linear deep learned models.
How Opera Syncs Tens of Millions of Browsers and Sleeps Well at NightScyllaDB
Opera chose Scylla over Cassandra to sync the data of millions of browsers to a back-end data repository. The results of the migration and further optimizations they made in their stack helped Opera to gain better latency/throughput and lower resources usage beyond their expectations.
Attend this session to learn how to
Migrate your data in a sane way, without any downtime
Connect a Python+Django web app to Scylla, how to use intranode sharding to improve your application
Impatience is a Virtue: Revisiting Disorder in High-Performance Log AnalyticsBadrish Chandramouli
There is a growing interest in processing real-time queries over out-of-order streams in this big data era. This paper presents a comprehensive solution to meet this requirement. Our solution is based on Impatience sort, an online sorting technique that is based on an old technique called Patience sort. Impatience sort is tailored for incrementally sorting streaming datasets that present themselves as almost sorted, usually due to network delays and machine failures. With several optimizations, our solution can adapt to both input streams and query logic. Further, we develop a new Impatience framework that leverages Impatience sort to reduce the latency and memory usage of query execution, and supports a range of user latency requirements, without compromising on query completeness and throughput, while leveraging existing efficient in-order streaming engines and operators. We evaluate our proposed solution in Trill, a high-performance streaming engine, and demonstrate that our techniques significantly improve sorting performance and reduce memory usage – in some cases, by over an order of magnitude.
Tutorial: The Role of Event-Time Analysis Order in Data StreamingVincenzo Gulisano
Slides for our tutorial, titled “The Role of Event-Time Analysis Order in Data Streaming”, presented at the 14th ACM International Conference on Distributed and Event-Based Systems (DEBS) conference. We have recorded the tutorial, and you can find the videos at the following links:
Part 1: https://youtu.be/SW_WS6ULsdY
Part 2: https://youtu.be/bq3ECNvPwOU
You can find this slides, as well as the code examples, at https://github.com/vincenzo-gulisano/debs2020_tutorial_event_time and at SlideS
Organizations continue to adopt Solr because of its ability to scale to meet even the most demanding workflows. Recently, LucidWorks has been leading the effort to identify, measure, and expand the limits of Solr. As part of this effort, we've learned a few things along the way that should prove useful for any organization wanting to scale Solr. Attendees will come away with a better understanding of how sharding and replication impact performance. Also, no benchmark is useful without being repeatable; Tim will also cover how to perform similar tests using the Solr-Scale-Toolkit in Amazon EC2.
Clojure is a modern dynamically typed lisp. Dynamical typing is ofter associated with poor performance and runtime failures. In this talk, I'll present some of the lessons learned on building Clojure/Script systems that are both ridiculously fast and will fail fast on errors. Will compare the performance of mutable, persistent & zero-copy data structures and show how we can use interpreters and compilers to build beautiful and performant abstractions. A quick demo on how to build a simple non-blocking web server that runs idiomatic Clojure to serve millions of requests per sec.
Clojure is awesome, and it can be fast too.
Video: https://www.youtube.com/watch?v=3SSHjKT3ZmA
There are at least 40 to 50 different formats of GC logs. Here, we explained the commonly used GC log formats, tricks, patterns and tools to analyze them effectively.
This presentation is from the Gophercon-India where we talked about how to design a concurrent high performance database client in go language. We talked about how we use goroutines and channels to our advantages. we also talked about how to use pools for efficient memory utilization.
Dataflow - A Unified Model for Batch and Streaming Data ProcessingDoiT International
Batch and Streaming Data Processing and Vizualize 300Tb in 5 Seconds meetup on April 18th, 2016 (http://www.meetup.com/Big-things-are-happening-here/events/229532500)
Learn about the various approaches to sharding your data with MongoDB. This presentation will help you answer questions such as when to shard and how to choose a shard key.
On the way to low latency (2nd edition)Artem Orobets
This is the second edition of the story about how we struggled to implement strict latency requirements in a service implemented with Java and how we managed to do that.
The most common latency contributors are an in-process locking, thread scheduling, I/O, algorithmic inefficiencies and, of course, garbage collector.
I will share our experience of dealing with the causes. And tell what you can do to prevent them from affecting the production.
Cassandra Tools and Distributed Administration (Jeffrey Berger, Knewton) | C*...DataStax
At Knewton we operate across five different VPCs a total of 29 clusters, each ranging from 3 nodes to 24 nodes. For a team of three to maintain this is not herculean, however good tools to diagnose issues and gather information in a distributed manner are vital to moving quickly and minimizing engineering time spent.
The database team at Knewton has been successfully using a combination of Ansible and custom open sourced tools to maintain and improve the Cassandra deployment at Knewton. I will be talking about several of these tools and giving examples of how we are using them. Specifically I will discuss the cassandra-tracing tool, which analyzes the contents of the system_traces keyspace, and the cassandra-stat tool, which gives real-time output of the operations of a cassandra cluster. Distributed administration with ad-hoc Ansible will also be covered and I will walk through examples of using these commands to identify and remediate clusterwide issues.
About the Speaker
Jeffrey Berger Lead Database Engineer, Knewton
Dr. Jeffrey Berger is currently the lead database engineer at Knewton, an education tech startup in NYC. He joined the tech scene in NYC in 2013 and spent two years working with MongoDB, becoming a certified MongoDB administrator and a MongoDB Master. He received his Cassandra Administrator certification at Cassandra Summit 2015. He holds a Ph.D. in Theoretical Physics from Penn State and spent several years working on high energy nuclear interactions.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
3. Analytics Use Case: Interactive Dashboard
select sum(pageView), time from T
where country = us,
browser = chrome,…
group by time
Slice and dice over arbitrary dimensions
Human driven queries
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store
4. Analytics Use Case: Site Facing
select sum(pageView) from T
where memberId = 456,
pageKey = “profilePage”,
privacySettings in (…)
group by time,[title|geo|industry]
Pre-defined query format with different
primary key values
Use Case Response Latency Query Rate Possible Solutions
Site facing 100ms (99 percentile) 1000s qps KV Store
5. Analytics Use Case: Anomaly Detection
for d1 in [us, ca, … ]
for d2 in [chrome, ie, … ]
…
select sum(pageView), time from T
where country = d1, browser = d2
group by time
Identifying all issues requires us to monitor
all possible combinations
Periodic machine generated queries (bursty)
Use Case Response Latency Query Rate Possible Solutions
Anomaly Detection
sub-second to
few seconds
10-100s qps Streaming Engine
6. Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
What makes these use cases use different solutions?
Different solutions based on
different workload
characteristics
Can we support all these use cases in one single system?
7. What is Pinot?
SQL-like interface with predictable latency (no joins)
Batch Data Ingestion (Hadoop)
Realtime Data Ingestion (Kafka)
Distributed, horizontally scalable
Open source! (https://github.com/linkedin/pinot)
8. Pinot @ LinkedIn
+50
Site Facing Use cases
+60k
Queries per second Records ingested
per second
+2000
Tables
+1.4m
• 300B documents
per data center
• 2 trillion documents
for internal use case
10. Architecture Overview
• Controller - handles cluster-wide
coordination using Apace Helix and
Zookeeper
• Broker - handles query fan out and
query routing to servers
• Server - responds to query requests
originating from the brokers
15. Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5 1-2t = 1 2
16. Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
t = 1 2 1-2
17. Query Execution: Hybrid Querying
time
offline server
Broker
time
realtime server
Time boundary: 2
3 4 5
select sum(m) from T
where t <= 2
select sum(m) from T
where t > 2
select sum(m) from T
1-2t = 1 2
18. Query Execution: Single Node
Query Optimization
select max(col) from T Use metadata instead of scanning
select sum(metric) from T
where country = us and accountId = x
Reorders filter for better performance
(apply accountId before country predicate)
Dynamic query planning based on column metadata, index, and dictionary
19. Anatomy of Pinot Segment
Dictionary Forward Index
Metadata
start/end time
available indexes
partitioning info
min/max value
…
Inverted
Sorted
Startree
Indexes
docId country code
0 us 002
1 ca 001
2 jp 003
… … …
country
ca
jp
us
…
dictId docId
code
001
002
003
…
country
2
0
1
…
code
1
0
2
…
Raw Data
21. Recap: Analytics Use Cases
Use Case
Response
Latency
Query Rate
Possible
Solutions
Interactive
dashboard
sub-second to
few seconds
~1 qps
Columnar
Store
Site facing
100ms
(99 percentile)
1000s qps
KV Store
(pre-cube)
Anomaly
detection
sub-second to
few seconds
10-100s qps
Streaming
Engine
Same input data (Pageview)
Same OLAP style query
Different solutions based on
different workload
characteristics
22. Interactive Dashboard
Use Case Response Latency Query Rate Possible Solutions
Interactive dashboard
sub-second to
few seconds
~1 qps Columnar Store
select sum(pageView), time from T
where country = us, browser = chrome,…
group by time
0 100 200 300 400 500
Latency (milliseconds)
Frequency
pinot
druid
23. Site Facing
Use Case Response Latency Query Rate Possible Solutions
Site facing 100ms (99 percentile) 1000s qps KV Store
select sum(pageView) from T
where memberId = xx, privacySettings in…
group by time,[title|geo|industry]
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●●●●
● ● ● ●
● ●
●● ●●100
1000
10000
10 1000
Queries per second
Latency(milliseconds)
druid
pinot
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
Latency(milliseconds)
pinot
druid
24. Pinot Optimizations For Site Facing Use Cases
• Optimizing Query Processing
1. Sorted Index + Dynamic execution planning
• Optimizing Scatter and Gather
1. Smart segment assignment and routing
2. Data partitioning and pruning
25. Optimizing Query Processing: Sorted Index
• Access to both forward/inverted index
• Fetch contiguous block, benefit from locality
• For item filtering, pick scanning or inverted index based on cardinality of
sorted column
memberId
start
docId
end
docId
123 0 100
456 101 300
… … …
docId memberId
0 123
... …
100 123
101 456
… …
300 456
… …
select …
where memberId = 456, item in(…)
group by …
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
● ● ● ● ● ● ● ●
●
●
100
1000
10 100 1000
Queries per second
Latency(milliseconds)
sorted index
inverted index
26. Optimizing Scatter and Gather: Querying All Servers
Replica group: a set of servers that contains a complete set of all segments.
2 3
1 4
2 3
1 4
query 1
query 2
4 2
1 3
1 2
3 4
query 1
query 2
RG1
RG2
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
● ● ● ● ● ●●● ●●●●●
100
1000
10000
10 100 1000
Queries per second
Latency(milliseconds)
without routing
optimization
with routing
optimization
Problem Impact Solution
Querying all servers
99% is impacted by
the slowest server (e.g. gc)
Control the number of servers to fan-out
27. Optimizing Scatter and Gather: Querying All Segments
S1
S3
query 1
query 2
S2
S4
S1
(p=1)
S3
(p=2)
query 1
(mid = p1)
query 2
(mid = p2)
S2
(p=1)
S4
(p=2)
Problem Impact Solution
Querying all segments More CPU work on server
Minimize the number of segment
(partitioning and pruning)
select …
where memberId = 456, item in(…)
group by …
28. Anomaly Detection: Challenge
for d1 in [us, ca, …]
for d2 in [key1, key2,…]
…
select sum(pageViews) from T
where country=d1, page_key=d2,
source_app=d3, device_name=d4…
group by country, time
…
Filter Aggregation Latency
select …
where country = us,…
Slow, scan 60-70% data high
select …
where country = kenya,…
Scan less than 1% low
• Latency not predictable depends on the query predicate
• Monitoring all possible combinations makes the problem worse!
29. Time vs Space Trade-off
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
variable latency
low storage overhead
low latency
high storage overhead
30. Startree Index Generation
1. Multidimensional sort
2. Split on the column and create a node
for each value
3. Create star node (aggregate metric after
removing the split column)
4. Apply 1,2,3 for each node recursively
and stop when number of records in
node < SplitThreshold
root
*
docId country browser
…
other
dimensions
impre
ssion
0 al ie 10
1 ca safari 10
2 … … …
… us chrome 10
… us chrome 10
… us ie 10
N us safari 10
Raw records
Aggregated records
N+1 * chrome 40
N+2 * ie 20
N+3 * safari 20
caal … us *country
browser chrome … safari
31. Time vs Space Trade-off with Startree
latency
storage requirement
Columnar Store
KV Store (Pre-computed)
Startree Index
SplitThreshold= infinity,
No prematerialization
SplitThreshold= 1,
Full materialiation
SplitThreshold= 100,000,
Partial data aware materialiation
32. Startree Query Execution
select sum(pageViews)from T
where country = AL
select sum(pageViews) from T
where browser = Chrome
select sum(pageViews) from T
select sum(X)
from T
where d1=v1 and d2=v2 and …
Any query pattern will scan
less than SplitThreshold records
root
*
caal … us *country
browser chrome … safari*chrome … safari
select sum(pageViews)from T
where country = CA
Raw docs
Aggregated docs
33. ● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
100
1000
10000
1 10 100
Queries per second
Latency(milliseconds)
Anomaly Detection
druid
pinot with
inverted index
pinot with
startree index
Use Case Response Latency Query Throughput Possible Solutions
Anomaly detection
sub-second to
few seconds
10-100s queries
per second
Streaming Engine
34. Pinot vs Druid
Druid Pinot
Inverted Index Always on all columns, fixed Configurable on per column basis
Query Execution Layer Fixed Plan Split into planning and execution
Data Organization N/A Sorted column
Partitioning
Only available for
time column
Available for any column
Controlling query fan-out N/A
Replica group based segment
assignment and routing
Smart pre-matrialization N/A Star-tree
35. Can we support all these use cases in one single system?
Use Case Response Latency Query Rate Solution
Interactive dashboard
sub-second to
few seconds
~1 qps Pinot
Site facing
100ms
(99 percentile)
1000s qps Pinot
Anomaly detection
sub-second to
few seconds
10-100s qps Pinot