In this tutorial, an in-depth overview of streaming analytics -- applications, algorithms and platforms -- landscape is presented. We walk through how the field has evolved over the last decade and then discuss the current challenges -- the impact of the other three Vs, viz., Volume, Variety and Veracity, on Big Data streaming analytics.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Unique ID generation in distributed systemsDave Gardner
The document discusses different strategies for generating unique IDs in a distributed system. It covers using auto-incrementing numeric IDs in MySQL, which are not resilient, and various solutions like UUIDs, Twitter Snowflake IDs, and Flickr ticket servers that generate IDs in a distributed and ordered way without coordination between data centers. It also provides code examples of generating Twitter Snowflake-like IDs in PHP without coordination using ZeroMQ.
InfluxDB is an open source time series database written in Go that stores metric data and performs real-time analytics. It has no external dependencies. InfluxDB stores data as time series with measurements, tags, and fields. Data is written using a line protocol and can be visualized using Grafana, an open source metrics dashboard.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
Rainbird: Realtime Analytics at Twitter (Strata 2011)Kevin Weil
Introducing Rainbird, Twitter's high volume distributed counting service for realtime analytics, built on Cassandra. This presentation looks at the motivation, design, and uses of Rainbird across Twitter.
Unique ID generation in distributed systemsDave Gardner
The document discusses different strategies for generating unique IDs in a distributed system. It covers using auto-incrementing numeric IDs in MySQL, which are not resilient, and various solutions like UUIDs, Twitter Snowflake IDs, and Flickr ticket servers that generate IDs in a distributed and ordered way without coordination between data centers. It also provides code examples of generating Twitter Snowflake-like IDs in PHP without coordination using ZeroMQ.
InfluxDB is an open source time series database written in Go that stores metric data and performs real-time analytics. It has no external dependencies. InfluxDB stores data as time series with measurements, tags, and fields. Data is written using a line protocol and can be visualized using Grafana, an open source metrics dashboard.
This document provides an overview of patterns for scalability, availability, and stability in distributed systems. It discusses general recommendations like immutability and referential transparency. It covers scalability trade-offs around performance vs scalability, latency vs throughput, and availability vs consistency. It then describes various patterns for scalability including managing state through partitioning, caching, sharding databases, and using distributed caching. It also covers patterns for managing behavior through event-driven architecture, compute grids, load balancing, and parallel computing. Availability patterns like fail-over, replication, and fault tolerance are discussed. The document provides examples of popular technologies that implement many of these patterns.
Aljoscha Krettek is the PMC chair of Apache Flink and Apache Beam, and co-founder of data Artisans. Apache Flink is an open-source platform for distributed stream and batch data processing. It allows for stateful computations over data streams in real-time and historically. Flink supports batch and stream processing using APIs like DataSet and DataStream. Data Artisans originated Flink and provides an application platform powered by Flink and Kubernetes for building stateful stream processing applications.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
This document discusses Apache Airflow and Google Cloud Composer. It begins by providing background on Apache Airflow, including that it is an open source workflow engine contributed by Airbnb. It then discusses how Codementor uses Airflow for ETL pipelines and machine learning workflows. The document mainly focuses on comparing self-hosting Airflow versus using Google Cloud Composer. Cloud Composer reduces efforts around hosting, permissions management, and monitoring. However, it has some limitations like occasional zombie tasks and higher costs. Overall, Cloud Composer allows teams to focus more on data logic and performance versus infrastructure maintenance.
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it. Learn more at: https://kyligence.io/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Presentation for Pervasive Systems class lectured by prof. Ioannis Chatzigiannakis, a.y. 2015-16, about the No-SQL database InfluxDB. The course is intended for students of MS in Engineering in Computer Science at Sapienza - University of Rome.
The complete code for the demo is available on Github:
https://github.com/RobGaud/PervasiveSystemsPersonal
You can also find me on LinkedIn:
https://www.linkedin.com/in/roberto-gaudenzi-4b0422116
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022HostedbyConfluent
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
For 40 years SQL has been the dominant language for data access and manipulation. Now that an increasing proportion of data is being processed in a streaming way, tool vendors (commercial and open source) have begun using SQL-like syntax in their event stream processing tools.
Over the last couple of years, several of these vendors - including AWS, Confluent, Google, IBM, Microsoft, Oracle, Snowflake and SQLstream - have got together with the Data Management group at INCITS (who maintain the SQL standard) to work on streaming extensions.
INCITS -- the InterNational Committee for Information Technology Standards -- is the central U.S. forum dedicated to creating technology standards for the next generation of innovation. INCITS is accredited by the American National Standards Institute (ANSI).
This talk will look at:
o Why is this happening?
o Who is involved?
o How does the process work?
o What progress has been made?
o When can we expect to see a standard?
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Microservices architecture involves many services that are being distributed over the network resulting in many more ways of failure. This session will try to cover the available tools that can help you when designing/building such distributed system in Go
In this training webinar, we will walk you through the basics of InfluxDB – the purpose-built time series database. InfluxDB has everything you need from a time series platform in a single binary – a multi-tenanted time series database, UI and dashboarding tools, background processing and monitoring agent. This one-hour session will include the training and time for live Q&A.
What you will learn
Core concepts of time series databases
An overview of the InfluxDB platform
How to ingesting and query data in InfluxDB
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming, in part, from the deluge of high-velocity data streams as well as the need for instant data-driven insights, and there has been a proliferation of messaging and streaming frameworks that enterprises utilize to satisfy the needs of various applications.
Drawing on their experience operating streaming systems at Twitter scale, Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. They also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, they explore the interplay between storage and stream processing and speculate about future developments.
Topics include:
Basic requirements of stream processing
Streaming and one-pass algorithms
Different types of streaming architectures
An in-depth review of streaming frameworks
Deploying and operating stream processing applications
Lessons learned from building a real-time stack using Apache Pulsar and Apache Heron at Twitter Scale
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
Grafana is an open source analytics and monitoring tool that uses InfluxDB to store time series data and provide visualization dashboards. It collects metrics like application and server performance from Telegraf every 10 seconds, stores the data in InfluxDB using the line protocol format, and allows users to build dashboards in Grafana to monitor and get alerts on metrics. An example scenario is using it to collect and display load time metrics from a QA whitelist VM.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
This is the slide I presented at PyCon SG 2019. I talked about overview of Airflow and how we can use Airflow and the other data engineering services on AWS and GCP to build data pipelines.
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
"This is a technical architect's case study of how Loggly has employed the latest social-media-scale technologies as the backbone ingestion processing for our multi-tenant, geo-distributed, and real-time log management system. This presentation describes design details of how we built a second-generation system fully leveraging AWS services including Amazon Route 53 DNS with heartbeat and latency-based routing, multi-region VPCs, Elastic Load Balancing, Amazon Relational Database Service, and a number of pro-active and re-active approaches to scaling computational and indexing capacity.
The talk includes lessons learned in our first generation release, validated by thousands of customers; speed bumps and the mistakes we made along the way; various data models and architectures previously considered; and success at scale: speeds, feeds, and an unmeltable log processing engine."
This document discusses Apache Airflow and Google Cloud Composer. It begins by providing background on Apache Airflow, including that it is an open source workflow engine contributed by Airbnb. It then discusses how Codementor uses Airflow for ETL pipelines and machine learning workflows. The document mainly focuses on comparing self-hosting Airflow versus using Google Cloud Composer. Cloud Composer reduces efforts around hosting, permissions management, and monitoring. However, it has some limitations like occasional zombie tasks and higher costs. Overall, Cloud Composer allows teams to focus more on data logic and performance versus infrastructure maintenance.
Snowflake: The Good, the Bad, and the UglyTyler Wishnoff
Learn how to solve the top 3 challenges Snowflake customers face, and what you can do to ensure high-performance, intelligent analytics at any scale. Ideal for those currently using Snowflake and those considering it. Learn more at: https://kyligence.io/
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
Recently, a set of modern table formats such as Delta Lake, Hudi, Iceberg spring out. Along with Hive Metastore these table formats are trying to solve problems that stand in traditional data lake for a long time with their declared features like ACID, schema evolution, upsert, time travel, incremental consumption etc.
Presentation for Pervasive Systems class lectured by prof. Ioannis Chatzigiannakis, a.y. 2015-16, about the No-SQL database InfluxDB. The course is intended for students of MS in Engineering in Computer Science at Sapienza - University of Rome.
The complete code for the demo is available on Github:
https://github.com/RobGaud/PervasiveSystemsPersonal
You can also find me on LinkedIn:
https://www.linkedin.com/in/roberto-gaudenzi-4b0422116
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022HostedbyConfluent
SQL Extensions to Support Streaming Data With Fabian Hueske | Current 2022
For 40 years SQL has been the dominant language for data access and manipulation. Now that an increasing proportion of data is being processed in a streaming way, tool vendors (commercial and open source) have begun using SQL-like syntax in their event stream processing tools.
Over the last couple of years, several of these vendors - including AWS, Confluent, Google, IBM, Microsoft, Oracle, Snowflake and SQLstream - have got together with the Data Management group at INCITS (who maintain the SQL standard) to work on streaming extensions.
INCITS -- the InterNational Committee for Information Technology Standards -- is the central U.S. forum dedicated to creating technology standards for the next generation of innovation. INCITS is accredited by the American National Standards Institute (ANSI).
This talk will look at:
o Why is this happening?
o Who is involved?
o How does the process work?
o What progress has been made?
o When can we expect to see a standard?
This document provides an overview of a talk on Apache Spark. It introduces the speaker and their background. It acknowledges inspiration from a previous Spark training. It then outlines the structure of the talk, which will include: a brief history of big data; a tour of Spark including its advantages over MapReduce; and explanations of Spark concepts like RDDs, transformations, and actions. The document serves to introduce the topics that will be covered in the talk.
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
This presentation about Apache Spark covers all the basics that a beginner needs to know to get started with Spark. It covers the history of Apache Spark, what is Spark, the difference between Hadoop and Spark. You will learn the different components in Spark, and how Spark works with the help of architecture. You will understand the different cluster managers on which Spark can run. Finally, you will see the various applications of Spark and a use case on Conviva. Now, let's get started with what is Apache Spark.
Below topics are explained in this Spark presentation:
1. History of Spark
2. What is Spark
3. Hadoop vs Spark
4. Components of Apache Spark
5. Spark architecture
6. Applications of Spark
7. Spark usecase
What is this Big Data Hadoop training course about?
The Big Data Hadoop and Spark developer course have been designed to impart an in-depth knowledge of Big Data processing using Hadoop and Spark. The course is packed with real-life projects and case studies to be executed in the CloudLab.
What are the course objectives?
Simplilearn’s Apache Spark and Scala certification training are designed to:
1. Advance your expertise in the Big Data Hadoop Ecosystem
2. Help you master essential Apache and Spark skills, such as Spark Streaming, Spark SQL, machine learning programming, GraphX programming and Shell Scripting Spark
3. Help you land a Hadoop developer job requiring Apache Spark expertise by giving you a real-life industry project coupled with 30 demos
What skills will you learn?
By completing this Apache Spark and Scala course you will be able to:
1. Understand the limitations of MapReduce and the role of Spark in overcoming these limitations
2. Understand the fundamentals of the Scala programming language and its features
3. Explain and master the process of installing Spark as a standalone cluster
4. Develop expertise in using Resilient Distributed Datasets (RDD) for creating applications in Spark
5. Master Structured Query Language (SQL) using SparkSQL
6. Gain a thorough understanding of Spark streaming features
7. Master and describe the features of Spark ML programming and GraphX programming
Who should take this Scala course?
1. Professionals aspiring for a career in the field of real-time big data analytics
2. Analytics professionals
3. Research professionals
4. IT developers and testers
5. Data scientists
6. BI and reporting professionals
7. Students who wish to gain a thorough understanding of Apache Spark
Learn more at https://www.simplilearn.com/big-data-and-analytics/apache-spark-scala-certification-training
Microservices architecture involves many services that are being distributed over the network resulting in many more ways of failure. This session will try to cover the available tools that can help you when designing/building such distributed system in Go
In this training webinar, we will walk you through the basics of InfluxDB – the purpose-built time series database. InfluxDB has everything you need from a time series platform in a single binary – a multi-tenanted time series database, UI and dashboarding tools, background processing and monitoring agent. This one-hour session will include the training and time for live Q&A.
What you will learn
Core concepts of time series databases
An overview of the InfluxDB platform
How to ingesting and query data in InfluxDB
Building Data Product Based on Apache Spark at Airbnb with Jingwei Lu and Liy...Databricks
Building data product requires having Lambda Architecture to bridge the batch and streaming processing. AirStream is a framework built on top of Apache Spark to allow users to easily build data products at Airbnb. It proved Spark is impactful and useful in the production for mission-critical data products.
On the streaming side, hear how AirStream integrates multiple ecosystems with Spark Streaming, such as HBase, Elasticsearch, MySQL, DynamoDB, Memcache and Redis. On the batch side, learn how to apply the same computation logic in Spark over large data sets from Hive and S3. The speakers will also go through a few production use cases, and share several best practices on how to manage Spark jobs in production.
Apache Cassandra is a free, distributed, open source, and highly scalable NoSQL database that is designed to handle large amounts of data across many commodity servers. It provides high availability with no single point of failure, linear scalability, and tunable consistency. Cassandra's architecture allows it to spread data across a cluster of servers and replicate across multiple data centers for fault tolerance. It is used by many large companies for applications that require high performance, scalability, and availability.
Tutorial - Modern Real Time Streaming ArchitecturesKarthik Ramasamy
Across diverse segments in industry, there has been a shift in focus from big data to fast data, stemming, in part, from the deluge of high-velocity data streams as well as the need for instant data-driven insights, and there has been a proliferation of messaging and streaming frameworks that enterprises utilize to satisfy the needs of various applications.
Drawing on their experience operating streaming systems at Twitter scale, Karthik Ramasamy, Sanjeev Kulkarni, Arun Kejariwal, and Sijie Guo walk you through state-of-the-art streaming architectures, streaming frameworks, and streaming algorithms, covering the typical challenges in modern real-time big data platforms and offering insights on how to address them. They also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, they explore the interplay between storage and stream processing and speculate about future developments.
Topics include:
Basic requirements of stream processing
Streaming and one-pass algorithms
Different types of streaming architectures
An in-depth review of streaming frameworks
Deploying and operating stream processing applications
Lessons learned from building a real-time stack using Apache Pulsar and Apache Heron at Twitter Scale
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
This document summarizes Netflix's use of Kafka in their data pipeline. It discusses how Netflix evolved from using S3 and EMR to introducing Kafka and Kafka producers and consumers to handle 400 billion events per day. It covers challenges of scaling Kafka clusters and tuning Kafka clients and brokers. Finally, it outlines Netflix's roadmap which includes contributing to open source projects like Kafka and testing failure resilience.
Grafana is an open source analytics and monitoring tool that uses InfluxDB to store time series data and provide visualization dashboards. It collects metrics like application and server performance from Telegraf every 10 seconds, stores the data in InfluxDB using the line protocol format, and allows users to build dashboards in Grafana to monitor and get alerts on metrics. An example scenario is using it to collect and display load time metrics from a QA whitelist VM.
Applying DevOps to Databricks can be a daunting task. In this talk this will be broken down into bite size chunks. Common DevOps subject areas will be covered, including CI/CD (Continuous Integration/Continuous Deployment), IAC (Infrastructure as Code) and Build Agents.
We will explore how to apply DevOps to Databricks (in Azure), primarily using Azure DevOps tooling. As a lot of Spark/Databricks users are Python users, will will focus on the Databricks Rest API (using Python) to perform our tasks.
Apache Spark is a In Memory Data Processing Solution that can work with existing data source like HDFS and can make use of your existing computation infrastructure like YARN/Mesos etc. This talk will cover a basic introduction of Apache Spark with its various components like MLib, Shark, GrpahX and with few examples.
This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
This document discusses real-time big data applications and provides a reference architecture for search, discovery, and analytics. It describes combining analytical and operational workloads using a unified data model and operational database. Examples are given of organizations using this approach for real-time search, analytics and continuous adaptation of large and diverse datasets.
Many safety leading indicators fall short of helping your account for areas of risk within your organization. The best leading indicator of the health of your safety program and predictor of safety risk, is safety culture. Since measuring safety culture is an intricated matter, we will discuss during this presentation an approach to measure safety culture, through the Safety Culture Index (SCI). This presentation will elaborate on how we identify the aspects that measure the SCI in your organization as well as to help the safety professional to understand, interpret, and influence SCI trends.
Virtual Gov Day - Introduction & Keynote - Alan Webber, IDC Government InsightsSplunk
The document outlines an agenda for a Virtual Gov Day event hosted by Splunk. The agenda includes a welcome and keynote presentation, customer use case presentations on security and business analytics, and concurrent breakout sessions on Splunk for security, IT operations, and application delivery. It also includes a presentation by an IDC analyst on challenges governments face with big data and how operational intelligence can help address issues around data management, timely decision-making, and use cases in security, IT operations, and industrial/IoT applications.
Streaming and Visual Data Discovery for the Internet of ThingsDatawatchCorporation
Sensor devices and their associated data streams are rapidly becoming a big source of differentiation for organizations that can effectively harness this information to drive new insights and take action. The breakthrough is enabled by new solutions for applying visual data discovery to streaming data in motion. This session will focus on industrial analytics and how best to apply new technologies that drive synergies between IT and OT.
Democratization - New Wave of Data Science (홍운표 상무, DataRobot) :: AWS Techfor...Amazon Web Services Korea
This document discusses the democratization of data science and machine learning using automated machine learning tools. It provides examples of how DataRobot has helped customers in various industries build predictive models faster and with less coding than traditional approaches. Specifically, it summarizes how DataRobot has helped customers in banking, insurance, retail, and other industries with use cases like predictive maintenance, sales forecasting, fraud detection, customer churn prediction, and insurance underwriting.
IoT: How Data Science Driven Software is Eating the Connected WorldDataWorks Summit
The document discusses how data science can be used to improve operations in the oil and gas industry through the Internet of Things. Large amounts of sensor data are generated during drilling operations that can be used to build predictive models to optimize drilling and predict equipment failures. Examples of opportunities include using models to predict drill rate of penetration to lower costs and failure prediction to allow for early warning and reduce downtime. The challenges of working with large sensor datasets and building accurate models at scale are also covered.
Smarter Analytics: Supporting the Enterprise with AutomationInside Analysis
The Briefing Room with Barry Devlin and WhereScape
Live Webcast on June 10, 2014
Watch the archive:
https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=5230c31ab287778c73b56002bc2c51a
The data warehouse is intended to support analysis by making the right data available to the right people in a timely fashion. But conditions change all the time, and when data doesn’t keep up with the business, analysts quickly turn to workarounds. This leads to ungoverned and largely un-managed side projects, which trade short-term wins for long-term trouble. One way to keep everyone happy is by creating an integrated environment that pulls data from all sources, and is capable of automating both the model development and delivery of analyst-ready data.
Register for this episode of The Briefing Room to hear data warehousing pioneer and Analyst Barry Devlin as he explains the critical components of a successful data warehouse environment, and how traditional approaches must be augmented to keep up with the times. He’ll be briefed by WhereScape CEO Michael Whitehead, who will showcase his company’s data warehousing automation solutions. He’ll discuss how a fast, well-managed and automated infrastructure is the key to empowering faster, smarter, repeatable decision making.
Visit InsideAnlaysis.com for more information.
“Real Time Machine Learning Architecture and Sentiment Analysis Applied to Fi...Quantopian
1. Infotrie provides real-time machine learning architecture and sentiment analysis services for financial news and signals.
2. Their platform FinSentS processes millions of news sources in multiple languages through APIs or as SaaS/on-premises software. It provides real-time alerts, signals, and historical sentiment data.
3. Infotrie also offers consultancy and training in areas like trading technology, algorithmic trading, big data, natural language processing, and machine learning.
Internet Of Things: How Data Science Driven Software is Eating the Connected ...Sarah Aerni
Hadoop Summit Talk on IoT and Data Science:
The Internet of Things (IoT) will forever change the way businesses interact with consumers and other businesses. Pivotal will present a series of use cases illustrating how such devices and the data from these devices drives real impact across industries. From smart sensors to connected hospitals, each example will highlight the fundamental concepts to success. · Starting with the basics: How data science drives action and outcomes · Avoiding the obstacles: How to avoid the pitfalls that prevent models from driving real action · Building your toolbox: What tools are available
Internet Of Things: How Data Science Driven Software is Eating the Connected ...VMware Tanzu
Hadoop Summit Talk on IoT and Data Science:
The Internet of Things (IoT) will forever change the way businesses interact with consumers and other businesses. Pivotal will present a series of use cases illustrating how such devices and the data from these devices drives real impact across industries. From smart sensors to connected hospitals, each example will highlight the fundamental concepts to success. · Starting with the basics: How data science drives action and outcomes · Avoiding the obstacles: How to avoid the pitfalls that prevent models from driving real action · Building your toolbox: What tools are available
Creating an IT Revolution within your Organization - QuickBase, Inc. at CIO V...QuickBase, Inc.
CIO Visions presentation by QuickBase, Inc. highlighting the major market trends of Digital Transformation, Citizen Development, Unification of Business & IT, and the Broadening App Ecosystem. Ankit Shah, Senior Manager, Product Marketing
Technology trends are rapidly changing, and the public sector is set to experience a more disruptive future. Along with many other high priority items, such as repairing or replacing infrastructure, new housing developments to meet the need of the growing population, and first responder initiatives, technology trends should be on your list of priorities in order to save money, secure data, and improve productivity. Here is a quick summary of 5 key public sector technology trends in 2018.
Microservices And Fast Data: Industry And Architecture Trends [with 451 Resea...Lightbend
This document discusses trends in microservices and fast data architectures. It summarizes Lightbend's Fast Data Platform, which provides streaming engines, pluggable machine learning libraries, a reactive platform, operational tooling, and infrastructure support to enable fast data applications. The platform aims to support real-time use cases like predictive analytics, personalization, financial processes, and industrial IoT by moving from batch to streaming data processing.
1) In-memory computing is growing rapidly, with the total data market expected to grow from $69 billion in 2015 to $132 billion in 2020.
2) In-memory databases are gaining popularity for applications that require fast response times, like telecommunications and mobile advertising, as memory access is faster than disk access.
3) Modern applications are driving adoption of in-memory solutions as they generate more data from more users and transactions and require faster performance to handle growing traffic.
4) Two examples presented were DellEMC using MemSQL for a real-time customer 360 application and an IoT logistics application called MemEx that processes sensor data from warehouses for predictive analytics.
This document provides an agenda for a presentation on harnessing the power of big data with Oracle. The agenda includes introductions to big data and market trends, defining big data, an overview of the Oracle Big Data Appliance, Oracle's integrated software solution, and a demonstration. The presentation aims to show how Oracle can help organizations access and analyze large, diverse datasets to drive innovation.
Benchmarking Digital Readiness: Moving at the Speed of the MarketApigee | Google Cloud
This document discusses how companies can benchmark their digital readiness and move faster in the digital market. It finds that digital leaders who adopt apps, APIs, and data analytics outperform digital laggards. To move up, companies need business and technology leadership. They should think strategically about customer experience, operations, data, and innovation to access new revenue channels beyond direct monetization. Technologically, companies should take a "cloud first" and "outside in" approach to deliver fast, differentiated customer experiences through systems of engagement built on APIs and backends.
This document summarizes a presentation by PwC on data and analytics in the digital age. PwC consists of data professionals who help clients leverage their data and manage risks. Recent projects include analyzing payroll, designing websites, and building systems to visualize customer orders. The presentation covers how digital transformation allows companies to use analytics to stay competitive. It also demonstrates a data visualization tool to support digital transformations.
Agile Data Science is a lean methodology that is adopted from Agile Software Development. At the core it centers around people, interactions, and building minimally viable products to ship fast and often to solicit customer feedback. In this presentation, I describe how this work was done in the past with examples. Get started today with our help by visiting http://www.alpinenow.com
Similar to Real Time Analytics: Algorithms and Systems (20)
In the wake of IoT becoming ubiquitous, there has been a large interest in the industry to develop novel techniques for anomaly detection at the Edge. Example applications include, but not limited to, smart cities/grids of sensors, industrial process control in manufacturing, smart home, wearables, connected vehicles, agriculture (sensing for soil moisture and nutrients). What makes anomaly detection at the Edge different? The following constraints be it due to the sensors or the applications necessitate the need for the development of new algorithms for AD.
* Very low power and low compute/memory resources
* High data volume making centralized AD infeasible owing to the communication overhead
* Need for low latency to drive fast action taking
Guaranteeing privacy In this talk we shall throw light on the above in detail. Subsequently, we shall walk through the algorithm design process for anomaly detection at the Edge. Specifically, we shall dive into the need to build small models/ensembles owing to limited memory on the sensors. Further, how to training data in an online fashion as long term historical data is not available due to limited storage. Given the need for data compression to contain the communication overhead, can one carry out anomaly detection on compressed data? We shall throw light on building of small models, sequential and one-shot learning algorithms, compressing the data with the models and limiting the communication to only the data corresponding to the anomalies and model description. We shall illustrate the above with concrete examples from the wild!
Serverless Streaming Architectures and Algorithms for the EnterpriseArun Kejariwal
In recent years, serverless has gained momentum in the realm of cloud computing. Broadly speaking, it comprises function as a service (FaaS) and backend as a service (BaaS). The distinction between the two is that under FaaS, one writes and maintains the code (e.g., the functions) for serverless compute; in contrast, under BaaS, the platform provides the functionality and manages the operational complexity behind it. Serverless provides a great means to boost development velocity. With greatly reduced infrastructure costs, more agile and focused teams, and faster time to market, enterprises are increasingly adopting serverless approaches to gain a key advantage over their competitors.
Example early use cases of serverless include, for example, data transformation in batch and ETL scenarios and data processing using MapReduce patterns. As a natural extension, serverless is being used in the streaming context such as, but not limited to, real-time bidding, fraud detection, intrusion detection. Serverless is, arguably, naturally suited to extracting insights from fast data, that is, high-volume, high-velocity data. Example tasks in this regard include filtering and reducing noise in the data and leveraging machine learning and deep learning models to provide continuous insights about business operations.
We walk the audience through the landscape of streaming systems for each stage of an end-to-end data processing pipeline—messaging, compute, and storage. We overview the inception and growth of the serverless paradigm. Further, we deep dive into Apache Pulsar, which provides native serverless support in the form of Pulsar functions, and paint a bird’s-eye view of the application domains where Pulsar functions can be leveraged.
Baking in intelligence in a serverless flow is paramount from a business perspective. To this end, we detail different serverless patterns—event processing, machine learning, and analytics—for different use cases and highlight the trade-offs. We present perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of serverless streaming architectures and algorithms. The topics covered include an introduction to st
reaming, an introduction to serverless, serverless and streaming requirements, Apache Pulsar, application domains, serverless event processing patterns, serverless machine learning patterns, and serverless analytics patterns.
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
In this talk we overview Sequence-2-Sequence (S2S) and explore its early use cases. We walk the audience through how to leverage S2S modeling for several use cases, particularly with regard to real-time anomaly detection and forecasting.
Sequence-to-Sequence Modeling for Time SeriesArun Kejariwal
This document provides an overview of time series forecasting using deep learning techniques. It discusses recurrent neural networks (RNNs) and their application to time series forecasting, including different RNN architectures like LSTMs and attention mechanisms. It also summarizes various approaches to training RNNs, such as backpropagation through time, and regularization techniques. Finally, it lists several examples of time series forecasting applications and provides references for further reading on the topic.
In this talk we walk through an architecture in which models are served in real time and the models are updated, using Apache Pulsar, without restarting the application at hand. They then describe how to apply Pulsar functions to support two example use—sampling and filtering—and explore a concrete case study of the same.
Designing Modern Streaming Data ApplicationsArun Kejariwal
Many industry segments have been grappling with fast data (high-volume, high-velocity data). The enterprises in these industry segments need to process this fast data just in time to derive insights and act upon it quickly. Such tasks include but are not limited to enriching data with additional information, filtering and reducing noisy data, enhancing machine learning models, providing continuous insights on business operations, and sharing these insights just in time with customers. In order to realize these results, an enterprise needs to build an end-to-end data processing system, from data acquisition, data ingestion, data processing, and model building to serving and sharing the results. This presents a significant challenge, due to the presence of multiple messaging frameworks and several streaming computing frameworks and storage frameworks for real-time data.
In this tutorial we lead a journey through the landscape of state-of-the-art systems for each stage of an end-to-end data processing pipeline, messaging frameworks, streaming computing frameworks, storage frameworks for real-time data, and more. We also share case studies from the IoT, gaming, and healthcare as well as their experience operating these systems at internet scale at Twitter and Yahoo. We conclude by offering their perspectives on how advances in hardware technology and the emergence of new applications will impact the evolution of messaging systems, streaming systems, storage systems for streaming data, and reinforcement learning-based systems that will power fast processing and analysis of a large (potentially of the order of hundreds of millions) set of data streams.
Topics include:
* An introduction to streaming
* Common data processing patterns
* Different types of end-to-end stream processing architectures
* How to seamlessly move data across data different frameworks
* Case studies: Healthcare and the IoT
* Data sketches for mining insights from data streams
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights.
One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster.
In this talk, we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.
Topics include:
* An overview correlation analysis
* Robust correlation analysis
* Overview of alternative measures, such as co-median
* Trade-offs between speed and accuracy
* Correlation analysis in large dimensions
In this talk we walk the audience through how to marry correlation analysis with anomaly detection, discuss how the topics are intertwined, and detail the challenges one may encounter based on production data. We also showcase how deep learning can be leveraged to learn nonlinear correlation, which in turn can be used to further contain the false positive rate of an anomaly detection system. Further, we provide an overview of how correlation can be leveraged for common representation learning.
There has been a shift from big data to live streaming data to facilitate faster data-driven decision making. As the number of live data streams grow—partly a result of the expanding IoT—it is critical to develop techniques to better extract actionable insights.
One current application, anomaly detection, is a necessary but insufficient step, due to the fact that anomaly detection over a set of live data streams may result in an anomaly fatigue, limiting effective decision making. One way to address the above is to carry out anomaly detection in a multidimensional space. However, this is typically very expensive computationally and hence not suitable for live data streams. Another approach is to carry out anomaly detection on individual data streams and then leverage correlation analysis to minimize false positives, which in turn helps in surfacing actionable insights faster.
In this talk we explain how marrying correlation analysis with anomaly detection can help and share techniques to guide effective decision making.
Topics include:
* An overview correlation analysis
* Robust correlation analysis
* Trade-offs between speed and accuracy
* Multi-modal correlation analysis
compute tier. Detection and filtering of anomalies in live data is of paramount importance for robust decision making. To this end, in this talk we share techniques for anomaly detection in live data.
In this tutorial we walk through state-of-the-art streaming systems, algorithms, and deployment architectures and cover the typical challenges in modern real-time big data platforms and offering insights on how to address them. We also discuss how advances in technology might impact the streaming architectures and applications of the future. Along the way, we explore the interplay between storage and stream processing and discuss future developments.
Anomaly detection in real-time data streams using HeronArun Kejariwal
Twitter has become the de facto medium for consumption of news in real time, and billions of events are generated and analyzed on a daily basis. To analyze these events, Twitter designed its own next-generation streaming system, Heron. Arun Kejariwal and Karthik Ramasamy walk you through how Heron is used to detect anomalies in real-time data streams. Although there’s been over 75 years of prior work in anomaly detection, most of the techniques cannot be used off the shelf because they’re not suitable for high-velocity data streams. Arun and Karthik explain how to make trade-offs between accuracy and speed and discuss incremental approaches that marry sampling with robust measures such as median and MCD for anomaly detection.
Data Data Everywhere: Not An Insight to Take Action UponArun Kejariwal
The big data era is characterized by ever-increasing velocity and volume of data. Over the last two or three years, several talks at Velocity have explored how to analyze operations data at scale, focusing on anomaly detection, performance analysis, and capacity planning, to name a few topics. Knowledge sharing of the techniques for the aforementioned problems helps the community to build highly available, performant, and resilient systems.
A key aspect of operations data is that data may be missing—referred to as “holes”—in the time series. This may happen for a wide variety of reasons, including (but not limited to):
# Packets being dropped due to unresponsive downstream services
# A network hiccup
# Transient hardware or software failure
# An issue with the data collection service
“Holes” in the time series on data analysis can potentially skew the analysis of data. This in turn can materially impact decision making. Arun Kejariwal presents approaches for analyzing operations data in the presence of “holes” in the time series, highlighting how missing data impacts common data analysis such as anomaly detection and forecasting, discussing the implications of missing data on time series of different granularities, such as minutely and hourly, and exploring a gamut of techniques that can be used to address the missing data issue (e.g., approximate the data using interpolation, regression, ensemble methods, etc.). Arun then walks you through how the techniques can be leveraged using real data.
Finding bad apples early: Minimizing performance impactArun Kejariwal
The big data era is characterized by the ever-increasing velocity and volume of data. In order to store and analyze the ever-growing data, the operational footprint of data stores and Hadoop have also grown over time. (As per a recent report from IDC, the spending on big data infrastructure is expected to reach $41.5 billion by 2018.) The clusters comprise several thousands of nodes. The high performance of such clusters is vital for delivering the best user experience and productivity of teams.
The performance of such clusters is often limited by slow/bad nodes. Finding slow nodes in large clusters is akin to finding a needle in a haystack; hence, manual identification of slow/bad nodes is not practical. To this end, we developed a novel statistical technique to automatically detect slow/bad nodes in clusters comprising hundreds to thousands of nodes. We modeled the problem as a classification problem and employed a simple, yet very effective, distance measure to determine slow/bad nodes. The key highlights of the proposed technique are the following:
# Robustness against anomalies (note that anomalies may occur, for example, due to an ad-hoc heavyweight job on a Hadoop cluster)
# Given the varying data characteristics of different services, no one model fits all. Consequently, we parameterized the threshold used for classification
The proposed technique works well with both hourly and daily data, and has been in use in production by multiple services. This has not only eliminated manual investigation efforts, but has also mitigated the impact of slow nodes, which used to get detected after several weeks/months of lag!
We shall walk the audience through how the techniques are being used with REAL data.
This document discusses stream processing and anomaly detection. It covers real-time analytics using streaming systems like Storm. Storm provides a framework for processing streaming data reliably and at scale. The document describes Storm's architecture and data model. It also discusses how Twitter uses Storm to process billions of messages daily. The document then covers anomaly detection in Storm systems, including identifying performance bottlenecks, anomalous nodes, and input traffic spikes in real-time. Statistical and correlation techniques are used to detect anomalies while minimizing false positives.
Statistical Learning Based Anomaly Detection @ TwitterArun Kejariwal
This document discusses Twitter's approach to statistical learning based anomaly detection. It begins with an overview of anomaly detection challenges at scale given Twitter's massive time series data. It then reviews traditional approaches and their limitations, particularly in dealing with seasonality. The document proposes addressing seasonality through time series decomposition before applying a robust statistical approach like ESD on the residual. It provides an example and discusses applications and production deployment at Twitter. In closing, it promotes joining Twitter's efforts in open sourcing their anomaly detection work.
Days In Green (DIG): Forecasting the life of a healthy serviceArun Kejariwal
This document describes Twitter's Days In Green (DIG) methodology for forecasting the lifespan of a healthy service before it exceeds a predefined capacity threshold. It involves collecting time series data on a service's key performance metric, detecting anomalies and breakouts, fitting an ARIMA model to capture trends and seasonality, and forecasting the number of days before the threshold is breached to determine capacity needs. The methodology has been deployed at Twitter to help plan capacity for hundreds of services and detect those nearing disaster recovery thresholds.
Gimme More! Supporting User Growth in a Performant and Efficient FashionArun Kejariwal
This document discusses capacity planning approaches for supporting user growth at Twitter. It describes the need to plan capacity proactively through forecasting to ensure good user experience without overprovisioning resources. The document evaluates several forecasting models like linear regression, splines, Holt-Winters, and ARIMA and their suitability for Twitter's data based on characteristics like outliers, seasonality, and boundary conditions. It emphasizes that accurate forecasting requires continuous refinement of models as the data stream evolves over time.
This document summarizes Twitter's approach to capacity planning for large events like the Super Bowl. It discusses using historical traffic patterns to predict capacity needs, analyzing key metrics like tweets per second, and planning for potential traffic spikes through statistical analysis and scenario modeling. For Super Bowl 2013, Twitter's models predicted a traffic spike could push tweets per second into the 20,000+ range, higher than previous years, and the company was able to maintain high availability during the game despite the brief blackout.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Kaxil Naik
Navigating today's data landscape isn't just about managing workflows; it's about strategically propelling your business forward. Apache Airflow has stood out as the benchmark in this arena, driving data orchestration forward since its early days. As we dive into the complexities of our current data-rich environment, where the sheer volume of information and its timely, accurate processing are crucial for AI and ML applications, the role of Airflow has never been more critical.
In my journey as the Senior Engineering Director and a pivotal member of Apache Airflow's Project Management Committee (PMC), I've witnessed Airflow transform data handling, making agility and insight the norm in an ever-evolving digital space. At Astronomer, our collaboration with leading AI & ML teams worldwide has not only tested but also proven Airflow's mettle in delivering data reliably and efficiently—data that now powers not just insights but core business functions.
This session is a deep dive into the essence of Airflow's success. We'll trace its evolution from a budding project to the backbone of data orchestration it is today, constantly adapting to meet the next wave of data challenges, including those brought on by Generative AI. It's this forward-thinking adaptability that keeps Airflow at the forefront of innovation, ready for whatever comes next.
The ever-growing demands of AI and ML applications have ushered in an era where sophisticated data management isn't a luxury—it's a necessity. Airflow's innate flexibility and scalability are what makes it indispensable in managing the intricate workflows of today, especially those involving Large Language Models (LLMs).
This talk isn't just a rundown of Airflow's features; it's about harnessing these capabilities to turn your data workflows into a strategic asset. Together, we'll explore how Airflow remains at the cutting edge of data orchestration, ensuring your organization is not just keeping pace but setting the pace in a data-driven future.
Session in https://budapestdata.hu/2024/04/kaxil-naik-astronomer-io/ | https://dataml24.sessionize.com/session/667627
Beyond the Basics of A/B Tests: Highly Innovative Experimentation Tactics You...Aggregage
This webinar will explore cutting-edge, less familiar but powerful experimentation methodologies which address well-known limitations of standard A/B Testing. Designed for data and product leaders, this session aims to inspire the embrace of innovative approaches and provide insights into the frontiers of experimentation!
6. 6
Large
variety
of
media
Blogs,
reviews,
news
arAcles,
streaming
content
> 500M
Tweets
everyday
Challenge: Surfacing Relevant Content
Explosive Content Creation
[1]
hPp://www.kpcb.com/blog/2014-‐internet-‐trends
> 300 hrs
Video
uploaded
every
minute
> 1.8 B
Photos
uploaded
online
in
2014
[1]
7. 7
High Volume
Content Consumption
WhatsApp
Messages
per
day
[1]
Pandora
Listener
hours
(Q2
2015)
[3]
Skype
Calls
per
month
E-mails
Per
second
Google
Searches
/year
[2]
Netflix
Hours
streamed
per
month
>30B
5.3B
4.76B
>
1T
>2.2M
>
1B
!
É
[1]
hPps://www.facebook.com/jan.koum/posts/10152994719980011?pnref=story
[2]
hPp://searchengineland.com/google-‐1-‐trillion-‐searches-‐per-‐year-‐212940
[3]
hPp://press.pandora.com/phoenix.zhtml?c=251764&p=irol-‐newsArAcle&ID=2070623
]
9
8. 8
A New World
Mobile, Mobile, Mobile
5.4
B
Mobile
Phone
Users
[1]
69%
Y/Y
Growth
Data
Traffic
55%
Mobile
Video
Traffic
34%
Global
e-‐Commerce
[2]
AVAILABILITY
PERFORMANCE
RELIABILITY
Anywhere, Anytime, Any Device
Smartphone
Subscrip`ons
in
2014
[1]
2.1B
[1]
hPp://www.kpcb.com/blog/2015-‐internet-‐trends
[2]
hPp://www.criteo.com/media/1894/criteo-‐state-‐of-‐mobile-‐commerce-‐q1-‐2015-‐ppt.pdf
f
K
.
9. 9
Market pulse
Finance/Investing
[1]
Image
borrowed
from
hPp://www.bloomberg.com/bw/arAcles/2013-‐06-‐06/how-‐the-‐robots-‐lost-‐high-‐frequency-‐tradings-‐rise-‐and-‐fall
[2]
hPp://arAcles.economicAmes.indiaAmes.com/2014-‐12-‐26/news/57420480_1_ravi-‐varanasi-‐mobile-‐plaeorm-‐nse
1
minute
bids
and
offers
March
8,
2011
[1]
Mobile
trading
on
the
rise
[2]
NSE
48%
increase
in
turnover,
Jan’14
-‐>
Dec’14
BSE
0.25%
(Jan’14)
-‐>
0.5%
(Nov’14)
of
total
volume
10. 10
Entertainment: MMOs
Game of War
Largest single world concurrent mobile game in the world
“Real-‐`me
Many-‐to-‐Many
is
Tomorrow's
Internet”
-‐
Francois
Orsini
-‐
Global scale
CollaboraAve:
make
alliances
Real-time messaging
Chat
translaAon
in
mulAple
languages
11. 11
On
the rise
Cybersecurity
2014
Staples
Dec’14
JP
Morgan
Oct’14
New
York
July’14
Michaels
Jan’14
PF
Changs
June’14
Home
Depot
Sept’14
UPS
Aug’14
Sony
Nov’14
OPM,
Anthem,
UCLA
2015
2015
[1]
hPp://www.mcafee.com/us/resources/reports/rp-‐economic-‐impact-‐cybercrime2.pdf
400 B [1]
12. 12
Supporting higher volume and speed
Hardware Innovations
Massively parallel
Intel’s “Knights Landing” Xeon Phi - 72 cores [1]
High speed
Low Power
“…
quickly
idenAfy
fraud
detecAon
paPerns
in
financial
transacAons;
healthcare
researchers
could
process
and
analyze
larger
data
sets
in
real
Ame,
acceleraAng
complex
tasks
such
as
geneAc
analysis
and
disease
tracking.”
[3]
Intel and Micron’s 3D XPoint Technology
1000x faster than NAND
[1]
hPp://www.anandtech.com/show/9436/quick-‐note-‐intel-‐knights-‐landing-‐xeon-‐phi-‐omnipath-‐100-‐isc-‐2015
[2]
Intel
IDS’15
[3]
hPp://newsroom.intel.com/community/intel_newsroom/blog/2015/07/28/intel-‐and-‐micron-‐produce-‐breakthrough-‐memory-‐technology
[2]
Q
13. 13
Hardware support for apps
Hardware Innovations
[1]
Images
borrowed
from
Julius
Madelblat’s
and
Andy
Vargas,
Rajeev
Nalawadi
and
Shane
Abreu’s
Technology
Insight
at
IDF’15.
Image and Touch processing support in Intel’s Skylake [1]
15. 15
Real time
User Experience, Productivity
Real-time Video Streams
N E W S
Drones Robotics
I N D U S T R Y
$ 4 0
B
b y
2 0 2 0
[ 3 ]
D E L I V E R Y / M O N i T O R I N G
$ 1 . 7 B
f o r
2 0 1 5 [ 1 ]
[1]
hPp://www.kpcb.com/blog/2015-‐internet-‐trends
[2]
hPp://www.bostondynamics.com/robot_Atlas.html
[3]
hPp://www.marketsandmarkets.com/Market-‐Reports/Industrial-‐RoboAcs-‐Market-‐643.html
[2]
16. 16
$1.9
T
in
value
by
2020
-‐
Mfg
(15%),
Health
Care
(15%),
Insurance
(11%)
26
B
-‐
75
B
units
[2,
3,
4,
5]
[1]
Background
image
taken
from
hPps://www.uspsoig.gov/sites/default/files/document-‐library-‐files/2015/rarc-‐wp-‐15-‐013.pdf
[2]
hPp://www.gartner.com/newsroom/id/2636073
[3]
hPps://www.abiresearch.com/press/more-‐than-‐30-‐billion-‐devices-‐will-‐wirelessly-‐conne
[4]
hPp://newsroom.cisco.com/feature-‐content?type=webcontent&arAcleId=1208342
[5]
hPp://www.businessinsider.com/75-‐billion-‐devices-‐will-‐be-‐connected-‐to-‐the-‐internet-‐by-‐2020-‐2013-‐10
[6]
hPps://www.abiresearch.com/press/ibeaconble-‐beacon-‐shipments-‐to-‐break-‐60-‐million-‐by/
Improve
operaAonal
efficiencies,
customer
experience,
new
business
modelsY
Beacons:
Retailers
and
bank
branches
60M
units
market
by
2019
[6]
Smart
buildings:
Reduce
energy
costs,
cut
maintenance
costs
Increase
safety
&
security
Large Market Potential
Internet of Things
17. 17
The Future
Biostamps [2]
Mobile
Sensor Network
Exponential growth [1]
[1]
hPp://opensignal.com/assets/pdf/reports/2015_08_fragmentaAon_report.pdf
[2]
hPp://www.ericsson.com/thinkingahead/networked_society/stories/#/film/mc10-‐biostamp
18. 18
Continuous Monitoring
Intelligent Health Care
Tracking Movements
Measure
effect
of
social
influences
Google Lens
Measure
glucose
level
in
tears
Watch/Wristband
Smart Textiles
Skin
temperature
PerspiraAon
Ingestible Sensors
MedicaAon
compliance
[1]
Heart
funcAon
[1]
hPp://www.hhnmag.com/Magazine/2015/Apr/cover-‐medical-‐technology
!
!
19. 19
Connected World
Internet of Things
30
B
connected
devices
by
2020
Health Care
153
Exabytes
(2013)
-‐>
2314
Exabytes
(2020)
Machine Data
40%
of
digital
universe
by
2020
Connected Vehicles
Data
transferred
per
vehicle
per
month
4
MB
-‐>
5
GB
Digital Assistants (Predictive Analytics)
$2B
(2012)
-‐>
$6.5B
(2019)
[1]
Siri/Cortana/Google
Now
Augmented/Virtual Reality
$150B
by
2020
[2]
Oculus/HoloLens/Magic
Leap
Ñ
!+
>
[1]
hPp://www.siemens.com/innovaAon/en/home/pictures-‐of-‐the-‐future/digitalizaAon-‐and-‐so{ware/digital-‐assistants-‐trends.html
[2]
hPp://techcrunch.com/2015/04/06/augmented-‐and-‐virtual-‐reality-‐to-‐hit-‐150-‐billion-‐by-‐2020/#.7q0heh:oABw
21. 21
What is Analytics?
According to wikipedia
DISCOVERY
Ability
to
idenAfy
paPerns
in
data
COMMUNICATION
Provide
insights
in
a
meaningful
way
"
"
22. 22
Types of Analytics
" E
CUBE ANALYTICS
Business
Intelligence
PREDICTIVE ANALYTICS
StaAsAcs
and
Machine
learning
23. 23
What is Real-Time Analytics?
BATCH
high throughput
> 1 hour
monthly active users
relevance for ads
adhoc
queries
NEAR
REAL TIME
low latency
< 1 ms
Financial
Trading
ad impressions count
hash tag trends
approximate
> 1 sec
Online
Non-Transactional
latency sensitive
< 500 ms
fanout Tweets
search for Tweets
deterministic
workflows
Online
Transactional
It’s contextual
28. 28
It’s different
Key Characteristics
FAULT TOLERANCE [1]
A V A I L A B I L I T Y
SCALE OUT
H I G H
P E R F O R M A N C E
ROBUST
I N C O M P L E T E
D A T A
[1]
ByzanAne
failures
are
described
in
the
following
journal
paper:
J.
Driscoll,
Kevin;
Hall,
Brendan;
Sivencrona,
Håkan;
Zumsteg,
Phil
(2003).
"ByzanAne
Fault
Tolerance,
from
Theory
to
Reality"
2788.
pp.
235–248.
33. 33
Sampling
Obtain
a
representaAve
sample
from
a
data
stream
Maintain
dynamic
sample
A
data
stream
is
a
conAnuous
process
Not
known
in
advance
how
many
points
may
elapse
before
an
analyst
may
need
to
use
a
representaAve
sample
Reservoir
sampling
[1]
ProbabilisAc
inserAons
and
deleAons
on
arrival
of
new
stream
points
Probability
of
successive
inserAon
of
new
points
reduces
with
progression
of
the
stream
An
unbiased
sample
contains
a
larger
and
larger
fracAon
of
points
from
the
distant
history
of
the
stream
PracAcal
perspecAve
Data
stream
may
evolve
and
hence,
the
majority
of
the
points
in
the
sample
may
represent
the
stale
history
[1]
J.
S.
ViPer.
Random
Sampling
with
a
Reservoir.
ACM
TransacAons
on
MathemaAcal
So{ware,
Vol.
11(1):37–57,
March
1985.
34. 34
Sampling
Sliding
window
approach
(sample
size
k,
window
width
n)
Sequence-‐based
Replace
expired
element
with
newly
arrived
element
Disadvantage:
highly
periodic
Chain-‐sample
approach
Select
element
ith
with
probability
Min(i,n)/n
Select
uniformly
at
random
an
index
from
[i+1,
i+n]
of
the
element
which
will
replace
the
ith
item
Maintain
k
independent
chain
samples
Timestamp-‐based
#
elements
in
a
moving
window
may
vary
over
Ame
Priority-‐sample
approach
[1]
B.
Babcock.
Sampling
From
a
Moving
Window
Over
Streaming
Data.
In
Proceedings
of
SODA,
2002.
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
3 5 1 4 6 2 8 5 2 3 5 4 2 2 5 0 9 8 4 6 7 3
35. 35
Sampling
Biased
Reservoir
Sampling
[1]
Use
a
temporal
bias
funcAon
-‐
recent
points
have
higher
probability
of
being
represented
in
the
sample
reservoir
Memory-‐less
bias
funcAons
Future
probability
of
retaining
a
current
point
in
the
reservoir
is
independent
of
its
past
history
or
arrival
Ame
Probability
of
an
rth
point
belonging
to
the
reservoir
at
the
Ame
t
is
proporAonal
to
the
bias
funcAon
ExponenAal
bias
funcAons
for
rth
data
point
at
Ame
t,
where,
r
≤
t,
λ
[0,
1]
is
the
bias
rate
Maximum
reservoir
requirement
R(t)
is
bounded
[1]
C.
C.
Aggarwal.On
Biased
Reservoir
Sampling
in
the
presence
of
Stream
EvoluAon.
in
Proceedings
of
VLDB,
2006.
36. 36
Sampling
General problem
Input:
Tuples
of
n
components
Subset
are
key
components
-‐
basis
for
sampling
Sample
of
size
a/b
Hash
key
to
b
buckets
Accept
a
tuple
if
hash
value
<
a
Space
constraint
a
<-‐
a
-‐
1
Remove
tuples
whose
keys
hash
to
a
37. 37
Set Membership
Filtering
Determine,
with
some
false
probability,
if
an
item
in
a
data
stream
has
been
seen
before
Databases
(e.g.,
speed
up
semi-‐join
operaAons),
Caches,
Routers,
Storage
Systems
Reduce
space
requirement
in
probabilisAc
rouAng
tables
Speedup
longest-‐prefix
matching
of
IP
addresses
Encode
mulAcast
forwarding
informaAon
in
packets
Summarize
content
to
aid
collaboraAons
in
overlay
and
peer-‐to-‐peer
networks
Improve
network
state
management
and
monitoring
38. 38
Set Membership
Filtering
[1]
IllustraAon
borrowed
from
hPp://www.eecs.harvard.edu/~michaelm/postscripts/im2005b.pdf
[1]
ApplicaAon
to
hyphenaAon
programs
Early
UNIX
spell
checkers
39. 39
Set Membership
Filtering
Natural
generalizaAon
of
hashing
False
posiAves
are
possible
No
false
negaAves
No
deleAons
allowed
For
false
posiAve
rate
ε,
#
hash
funcAons
=
log2(1/ε)
where,
n
=
#
elements,
k
=
#
hash
funcAons
m
=
#
bits
in
the
array
40. 40
Set Membership
Filtering
Minimizing
false
posiAve
rate
ε
w.r.t.
k
[1]
k
=
ln
2
*
(m/n)
ε
=
(1/2)k
≈
(0.6185)m/n
1.44
*
log2(1/ε)
bits
per
item
Independent
of
item
size
or
#
items
InformaAon-‐theoreAc
minimum:
log2(1/ε)
bits
per
item
44%
overhead
X
=
#
0
bits
where
[1]
A.
Broder
and
M.
Mitzenmacher.
Network
ApplicaAons
of
Bloom
Filters:
A
Survey.
In
Internet
MathemaAcs
Vol.
1,
No.
4,
2005.
41. 41
Set Membership
Filtering
DerivaAves
CounAng
Bloom
filters:
Support
deleAon
Bit
-‐>
small
counter
Typically,
4
bits
per
counter
suffice
Increment,
Decrement
Blocked
Bloom
filters
d-‐le{
CounAng
Bloom
filters
QuoAent
filters
Rank-‐Indexed
Hashing
42. 42
Set Membership
Filtering
Cuckoo Filter [1]
Key
Highlights
Add
and
remove
items
dynamically
For
false
posiAve
rate
ε
<
3%,
more
space
efficient
than
Bloom
filter
Higher
performance
than
Bloom
filter
for
many
real
workloads
AsymptoAcally
worse
performance
than
Bloom
filter
Min
fingerprint
size
α
log
(#
entries
in
table)
Overview
Stores
only
a
fingerprint
of
an
item
inserted
Original
key
and
value
bits
of
each
item
not
retrievable
Set
membership
query
for
item
x:
search
hash
table
for
fingerprint
of
x
[1]
Fan
et
al.,
Cuckoo
Filter:
PracAcally
BePer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
InternaAonal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.
43. 43
Set Membership
Filtering
[1]
R.
Pagh
and
F.
Rodler.
Cuckoo
hashing.
Journal
of
Algorithms,
51(2):122-‐144,
2004.
[2]
IllustraAon
borrowed
from
“Fan
et
al.,
Cuckoo
Filter:
PracAcally
BePer
Than
Bloom.
In
Proceedings
of
the
10th
ACM
InternaAonal
on
Conference
on
Emerging
Networking
Experiments
and
Technologies,
2014.”
[2]
IllustraAon
of
Cuckoo
hashing
[2]
Cuckoo Hashing [1]
High
space
occupancy
PracAcal
implementaAons:
mulAple
items/bucket
Example
uses:
So{ware-‐based
Ethernet
switches
Cuckoo Filter
Uses
a
mulA-‐way
associaAve
Cuckoo
hash
table
Employs
parAal-‐key
cuckoo
hashing
Relocate
exisAng
fingerprints
to
their
alternaAve
locaAons
[2]
44. 44
Set Membership
Filtering
Cuckoo Filter
ParAal-‐key
cuckoo
hashing
Fingerprint
hashing
ensures
uniform
distribuAon
of
items
in
the
table
Length
of
fingerprint
<<
Size
of
h1
or
h2
Possible
to
have
mulAple
entries
of
a
fingerprint
in
a
bucket
DeleAon
Item
must
have
been
previously
inserted
Comparison
45. 45
Estimating Cardinality
Large
set
of
real-‐world
applica`ons
Database
systems/Search
engines
#
disAnct
queries
Network
monitoring
applicaAons
Natural
language
processing
#
disAnct
moAfs
in
a
DNA
sequence
#
disAnct
elements
of
RFID/sensor
networks
# Distinct Elements
46. 46
Estimating Cardinality
Historical
context
ProbabilisAc
counAng
[Flajolet
and
MarAn,
1983]
LogLog
counAng
[Durand
and
Flajolet,
2003]
HyperLogLog
[Flajolet
et
al.,
2007]
Sliding
HyperLogLog
[Chabchoub
and
Hebrail,
2010]
HyperLogLog
in
PracAce
[Heule
et
al.,
2013]
Self-‐Organizing
Bitmap
[Chen
and
Cao,
2009]
Discrete
Max-‐Count
[Ting,
2014]
Sequence
of
sketches
forms
a
Markov
chain
when
h
is
a
strong
universal
hash
EsAmate
cardinality
using
a
marAngale
# Distinct Elements
N
≤
109
47. 47
Estimating Cardinality
Hyperloglog
Apply
hash
funcAon
h
to
every
element
in
a
mulAset
Cardinality
of
mulAset
is
2max(ϱ)
where
0ϱ-‐11
is
the
bit
paPern
observed
at
the
beginning
of
a
hash
value
Above
suffers
with
high
variance
Employ
stochasAc
averaging
ParAAon
input
stream
into
m
sub-‐streams
Si
using
first
p
bits
of
hash
values
(m
=
2p)
# Distinct Elements
where
48. 48
Estimating Cardinality
Hyperloglog
in
Prac`ce:
Op`miza`ons
Use
of
64-‐bit
hash
funcAon
Total
memory
requirement
5
*
2p
-‐>
6
*
2p,
where
p
is
the
precision
Empirical
bias
correcAon
Uses
empirically
determined
data
for
cardinaliAes
smaller
than
5m
and
uses
the
unmodified
raw
esAmate
otherwise
Sparse
representaAon
For
n≪m,
store
an
integer
obtained
by
concatenaAng
the
bit
paPerns
for
idx
and
ϱ(w)
Use
variable
length
encoding
for
integers
that
uses
variable
number
of
bytes
to
represent
integers
Use
difference
encoding
-‐
store
the
difference
between
successive
elements
Other
opAmizaAons
[1,
2]
# Distinct Elements
[1]
hPp://druid.io/blog/2014/02/18/hyperloglog-‐opAmizaAons-‐for-‐real-‐world-‐systems.html
[2]
hPp://anArez.com/news/75
49. 49
Estimating Cardinality
Self-‐Learning
Bitmap
(S-‐bitmap)
[1]
Achieve
constant
relaAve
esAmaAon
errors
for
unknown
cardinaliAes
in
a
wide
range,
say
from
10s
to
>106
Bitmap
obtained
via
adapAve
sampling
process
Bits
corresponding
to
the
sampled
items
are
set
to
1
Sampling
rates
are
learned
from
#
disAnct
items
already
passed
and
reduced
sequenAally
as
more
bits
are
set
to
1
For
given
input
parameters
Nmax
and
esAmaAon
precision
ε,
size
of
bit
mask
For
r
=
1
-‐2ε2(1+ε2)-‐1
and
sampling
probability
pk
=
m
(m+1-‐k)-‐1(1+ε2)rk,
where
k
∈
[1,m]
RelaAve
error
≣
ε
# Distinct Elements
[1]
Chen
et
al.
“DisAnct
counAng
with
a
self-‐learning
bitmap”.
Journal
of
the
American
StaAsAcal
AssociaAon,
106(495):879–890,
2011.
50. 50
Estimating Quantiles
Large
set
of
real-‐world
applica`ons
Database
applicaAons
Sensor
networks
OperaAons
ProperAes
Provide
tunable
and
explicit
guarantees
on
the
precision
of
approximaAon
Single
pass
Early
work
[Greenwald
and
Khanna,
2001]
-‐
worst
case
space
requirement
[Arasu
and
Manku,
2004]
-‐
sliding
window
based
model,
worst
case
space
requirement
Quantiles, Histograms, Icebergs
51. 51
Estimating Quantiles
q-‐digest
[1]
Groups
values
in
variable
size
buckets
of
almost
equal
weights
Unlike
a
tradiAonal
histogram,
buckets
can
overlap
Key
features
Detailed
informaAon
about
frequent
values
preserved
Less
frequent
values
lumped
into
larger
buckets
Using
message
of
size
m,
answer
within
an
error
of
Except
root
and
leaf
nodes,
a
node
v
∈
q-‐digest
iff
Quantiles, Histograms, Icebergs
[1]
Shrivastava
et
al.,
Medians
and
Beyond:
New
AggregaAon
Techniques
for
Sensor
Networks.
In
Proceedings
of
SenSys,
2004.
Max
signal
value
#
Elements
Compression
Factor
Complete
binary
tree
52. 52
Estimating Quantiles
q-‐digest
Building
a
q-‐digest
q-‐digests
can
be
constructed
in
a
distributed
fashion
Merge
q-‐digests
Quantiles, Histograms, Icebergs
53. Applica`ons
Track
bandwidth
hogs
Determine
popular
tourist
desAnaAons
Itemset
mining
Entropy
esAmaAon
Compressed
sensing
Search
log
mining
Network
data
analysis
DBMS
opAmizaAon
53
Frequent Elements
A core streaming problem
54. Count-‐min
Sketch
[1]
A
two-‐dimensional
array
counts
with
w
columns
and
d
rows
Each
entry
of
the
array
is
iniAally
zero
d
hash
funcAons
are
chosen
uniformly
at
random
from
a
pairwise
independent
family
Update
For
a
new
element
i,
for
each
row
j
and
k
=
hj(i),
increment
the
kth
column
by
one
Point
query
where,
sketch
is
the
table
Parameters
54
Frequent Elements
A core streaming problem
[1]
Cormode,
Graham;
S.
Muthukrishnan
(2005).
"An
Improved
Data
Stream
Summary:
The
Count-‐Min
Sketch
and
its
ApplicaAons".
J.
Algorithms
55:
29–38.
),( δε
}1{}1{:,,1 wnhh d ……… →
!
!
"
#
#
$
=
ε
e
w
!
!
"
#
#
$
=
δ
1
lnd
sketch
55. Variants
of
Count-‐min
Sketch
[1]
Count-‐Min
sketch
with
conservaAve
update
(CU
sketch)
Update
an
item
with
frequency
c
Avoid
unnecessary
updaAng
of
counter
values
=>
Reduce
over-‐esAmaAon
error
Prone
to
over-‐esAmaAon
error
on
low-‐frequency
items
Lossy
ConservaAve
Update
(LCU)
-‐
SWS
Divide
stream
into
windows
At
window
boundaries,
∀
1
≤
i
≤
w,
1
≤
j
≤
d,
decrement
sketch[i,j]
if
0
<
sketch[i,j]
≤
55
Frequent Elements
A core streaming problem
[1]
Cormode,
G.
2009.
Encyclopedia
entry
on
’Count-‐MinSketch’.
In
Encyclopedia
of
Database
Systems.
Springer.,
511–516.
56. 56
Anomaly Detection
Large
set
of
real-‐world
applica`ons
Social
media:
Trending
analysis
Fraud
detecAon:
Insurance,
E-‐commerce,
MarkeAng
Network
intrusion
detecAon
Health
care
Sensor
networks
Anomalous
state
detecAon
(e.g.,
wind
turbines)
OperaAons
Metric
space:
System,
ApplicaAon,
Data
Center
PotenAally
impact
performance,
availability,
reliability
Researched over > 50 yrs
57. 57
Anomaly Detection
Anomaly
is
contextual
Manufacturing
StaAsAcs
Econometrics,
Financial
engineering
Signal
processing
Control
systems,
Autonomous
systems
-‐
fault
detecAon
[1]
Networking
ComputaAonal
biology
(e.g.,
microarray
analysis)
Computer
vision
Researched over > 50 yrs
[1]
A.
S.
Willsky,
“A
survey
of
design
methods
for
failure
detecAon
systems,”
AutomaAca,
vol.
12,
pp.
601–611,
1976.
59. 59
Anomaly Detection
Tradi`onal
Approaches
Rule
based:
μ
±
σ
Manufacturing,
StaAsAcal
Process
Control
[1]
Moving
averages
SMA
EWMA
PEWMA
AssumpAon:
Normal
distribuAon
Mostly
does
not
hold
in
real
life
Researched over > 50 yrs
[1]
W.
A.
Shewhart.
Economic
Quality
Control
of
Manufactured
Product,
The
Bell
Labs
Technical
Journal,
9(2):364-‐389,
1930.
[1]
60. 60
Anomaly Detection
In
Prac`ce
Robustness
μ
and
σ
are
not
robust
in
presence
of
anomalies
Use
median
and
MAD
(Median
Absolute
DeviaAon)
Seasonality
Trend
MulA-‐modal
distribuAon
Time
series
decomposiAon
AnomalyDetecAon
R
package
[1]
Researched over > 50 yrs
[1]
hPps://github.com/twiPer/AnomalyDetecAon
61. Marrying
Time
Series
Decomposi`on
and
Robust
Sta`s`cs
61
Anomaly Detection
Researched over > 50 yrs
Trend Smoothing Distortion
Creates “Phantom” Anomalies
Median is Free from Distortion
62. 62
Anomaly Detection
Real-‐Time
Challenges
AdapAve
learning
Automated
modeling
Marrying
theory
with
contextual
relevance
OperaAons
Large
set
of
different
services
in
a
technology
stack
Different
stacks
use
different
services
Promising
products
such
as
Opsclarity
Researched over > 50 yrs
63. 63
Anomaly Detection
Researched over > 50 yrs
Anomalies
in
opera`onal
data:
Challenges
Contextual Application Topology Map
Hierarchical
Datacenter ! Applications ! Services ! Hosts
• Automatically discover Developer / Architect’s view of the
application - for the Operations team
- Framework for system config and context
• Real-time, streaming architecture
- Keeps up with today’s elastic infrastructure
• Scale to 1000s of hosts, 100s of (micro) services
• Present evolution of system state over time
- DVR-like replay of health, system changes, failures
Evolving Needs of Modern Operations
64. 64
Anomaly Detection
Researched over > 50 yrs
Anomalies
in
opera`onal
data:
Challenges
AutomaAcally
learn
base-‐lines
for
metrics
Data
variety
requires
advanced
staAsAcal
approaches
Detect
issues
earlier,
proacAve
alerAng
Example: Detecting Disk Full Issues Early
66. 66
The Key Aspects
Requirements of Stream Processing
In-stream Handle imperfections Predictable Performance
Process
data
as
it
is
passes
by
Delayed,
missing
and
out-‐of-‐order
data
and
Repeatable and
Scalability
I
8
Requirements
of
Stream
Processing,
Mike
Stonebraker
et.
al,
SIGMOD
Record
2005
67. 67
The Key Aspects
Requirements of Stream Processing
High level languages Integrate stored and
streaming data
Data safety and
availability
Process and respond
SQL
or
DSL
for
comparing
present
with
the
past
and
Repeatable
ApplicaAon
should
keep
at
high
volumes
8
Requirements
of
Stream
Processing,
Mike
Stonebraker
et.
al,
SIGMOD
Record
2005
# # $ %
68. 68
Window Processing
Stream Processing
T.
Akidau
et
al.,
The
Dataflow
Model:
A
PracAcal
Approach
to
Balancing
Correctness,
Latency,
and
Cost
in
Massive-‐Scale,
Unbounded,
Out-‐of-‐Order
Data
Processing,
In
VLDB,
2015.
&
# $
69. 69
Three Generations
First Generation
Extensions
to
exisAng
database
engines
or
simplisAc
engines
Dedicated
to
specific
applicaAons
or
use
cases
Second Generation
Enhanced
methods
regarding
language
expressiveness
Distributed
processing,
load
balancing
and
fault
tolerance
Third Generation
Massive
parallelizaAon
for
processing
large
data
sets
Dedicated
towards
cloud
compuAng
,
%
hPp://www.slideshare.net/zbigniew.jerzak/cloudbased-‐data-‐stream-‐processing
73. 73
Notable features
1st Generation Systems
Early: Active DBs, ECA rules, triggers,
publish-subscribe
Event-Condition-Action
)
'
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Event
Occurrences
Triggered
Rules
Evaluated
Rules
Selected
Rules
Event
Source
Signaling Triggering
EvaluaAon
SchedulingExecuAon
G Systems - HiPAC, Starbust, Postgres, ODE
“AcAve
Database
Systems”,
Paton
and
Diaz,
ACM
CompuAng
Surveys,
1999
74. 74
Notable features
1st Generation Applications
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Actuation (also IoT?)
Finance
Enforcing database integrity constraints
Monitoring the physical world (IoT?)
Supply chain
News and update dissemination
(
#)
#
Battlefield awarenessHealth monitoring
-
d
75. 75
Issues
1st Generation Systems
Rules were (are) hard to program
or understand
Smart engineering of traditional approaches
can get you close enough?!
Little commercial activity
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
#
77. 77
Early 2000s Late 2000s
2nd Generation Systems
Niagara CQ
[Jianjun
Chun
et
al.,
2000]
Telegraph, Telegraph CQ
[Hellerstein
et
al.,
2000]
[Chandrasekaran
et
al.,
2003]
!
80. Repeatedly apply generic SQL to the results of window operators
80
The basic idea
Stream Query Processing
Support full SQL language and eco system
A table is a set of records and a stream is an unbounded
sequence of records
SQL
g
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Each window outputs a set of records
Window operators convert streams to
tablesÄ
Rstream
semanAcs
in
CQL,
Arvind
Arasu
et
al.
VLDB
Journal
2006
Streams Tables
Window
Operators
3
#
$
81. 81
Telegraph CQ
Data
stream
query
processor
Con`nuous
and
adap`ve
query
processing
Built
by
modifying
PostgreSQL
01
02
03
Developed at University of California, Berkeley
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
82. 82
Niagara CQ
Incremental
group
opAmizaAon
strategy
Incremental
evaluaAon
of
conAnuous
queries
A
distributed
database
system
for
conAnuous
queries
using
a
query
language
like
XML-‐QL
for
changing
data
sets
Query
Grouping
Allows
for
sharing
common
parts
of
two
or
more
queries
Caching
For
performance
Push/Pull
data
inges`on
for
detected
changes
in
data
Change
based
and
Timer
CQ
ConAnuous
queries
to
trigger
on
data
changes
and
regular
Amed
based
01
02
03
04
Developed at UW-Madison
84. 84
Borealis
Load
aware
distribuAon
Fine
grained
high
availability
Load
shredding
mechanisms
A
low
latency
stream
processing
engine
with
a
focus
on
fault
tolerance
and
distribuAon
Distributed
stream
engine
Allows
for
sharing
common
parts
of
two
or
more
queries
Dynamic
query
modifica`on
For
performance
Dynamic
system
op`miza`on
for
detected
changes
in
data
Dynamic
revision
of
results
ConAnuous
queries
to
trigger
on
data
changes
and
regular
Amed
based
01
02
03
04
Developed at MIT, Brown and Brandeis
85. 85
Summary
2nd Generation Systems
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
Can reuse many of relational operators
Historical comparison becomes a join
of a stream and its history table
Views on streams can be created
Streams can be processed using
relational operators
Can leverage an RDMS system
Stream and stream results can be
stored in tables for later querying +
(,
g$
G
86. 86
Issues
2nd Generation Systems
Despite significant commercial activity,
no real breakout
No standardization and comprehensive
benchmarks
6
%
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
& Value proposition for learning new concepts
was not clear
88. 88
The last decade
Streaming Platforms
S4
Yahoo!
Flink
Apache
Storm
TwiPer
Spark
Databricks
Samza
LinkedIn
Heron
TwiPer
MillWheel
Google
Pulsar
eBay
%%
S-Store
ISTC,
Intel,
MIT,
Brown,
CMU,
Portland
State
S
Trill
Microso{
T
89. 89
Earliest distributed stream system
Apache S4
Scalable
Throughput
is
linear
as
addiAonal
nodes
are
added
Cluster management
Hides
managements
using
a
layer
in
ZooKeeper
Decentralized
All
nodes
are
symmetric
and
no
centralized
service
Extensible
Building
blocks
of
plaeorm
can
be
replaced
by
custom
implementaAons
Fault tolerance
Standby
servers
take
over
when
a
node
fails
$
(,
g#
G
Proven
Deployed
in
Yahoo
processing
thousands
of
search
queries
per
second
91. 91
Storm Terminology
Topology
Directed
acyclic
graph
verAces
=
computaAon,
and
edges
=
streams
of
data
tuples
Spouts
Sources
of
data
tuples
for
the
topology
Examples
-‐
Ka•a/Kestrel/MySQL/Postgres
Bolts
Process
incoming
tuples,
and
emit
outgoing
tuples
Examples
-‐
filtering/aggregaAon/join/any
funcAon
,
%
93. 93
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
Live stream of Tweets
#worldcup : 1M
soccer: 400K
….
94. 94
Tweet Word Count Topology
% %
Tweet Spout Parse Tweet Bolt Word Count Bolt
When
a
parse
tweet
bolt
task
emits
a
tuple
which
word
count
bolt
task
should
it
send
to?
% %% %% %% %
95. 95
Storm Groupings
01 02 03 04
Shuffle Grouping
Random distribution of tuples
Fields Grouping
Group tuples by a field or
multiple fields
All Grouping
Replicates tuples to all tasks
Global Grouping
Send the entire stream to one
task
/
.
-
,
108. 108
Some experiments
Storm Overheads
Read
from
Ka•a
cluster
and
serialize
in
a
loop
Sustain
input
rates
of
300K
msgs/sec
from
Ka•a
topic
Java program
No
acks
to
achieve
at
least
once
semanAcs
Storm
processes
were
co-‐located
using
isolaAon
scheduler
1-stage topology
Enable
acks
for
at
least
once
semanAcs
1-stage topology
with acks
115. 115
MillWheel
Computations
Arbitrary
User
Logic
Per
Key
OperaAon
Persistent State
Key/Value
API
Backed
by
BigTable
Streams
IdenAfied
By
Names
Unbounded
Keys
Per
Key
OperaAon
Serial
Different
Keys
Parallel
Core Concepts
L
f
⚿
t
116. 116
MillWheel
Caught up Time
Defined
per
computaAon
Discard Late Data
~0.001%
at
Google
Seeded by Injectors
Input
Sources
Monotonic
Makes
life
easy
for
users
Low Watermark: The Concept of Time
Ê
4 6
u
117. 117
MillWheel
Checkpoint
Same
Ame
as
User
State
DoubleCount
No
Dedup
Seeded by Injectors
Input
Sources
No checkpoint
Simpler
API
Strong And Week: Productions
'
4
(
q
125. 125
One Size Fits All
Apache Flink
General
Purpose
Analy`cs
Engine
Open
Source
and
Community
Driven
Works
well
with
Hadoop
Ecosystem
K
Came
out
of
Stratosphere
n
126. 126
Apache Flink
Fast RunTime
Complex
DAG
Operators
Streamed
Data
to
Op
Iterative Algorithms
Much
Faster
In-‐
Memory
OperaAons
Intuitive APIs
Java/Scala/Python
Concise
Query
Coming
from
OLTP
World
% !
2 b
Ambitious Goal: One Size Fits All
129. 129
One system to replace them all!
General
purpose
Compute
Engine
Open
Source/Big
Community
K
MapReduce,
Streaming,
SQL,
…!
Integrates
well
with
Hadoop
Ecosystem(
130. 130
Lots
Huge
CollecAon
with
Lineage
info
Resilient
Lost
DataSets
are
re-‐
computed
Distributed
Across
the
cluster
Core Concept: Lots of RDDS
t
(
)DataSet
Input
Data
divided
into
Batches
$
Streaming
134. 134
T0
to
T1 T1
to
T2 T2
to
T3
T0
to
T1 T1
to
T2 T2
to
T3
lines
words
flatMap
Series of RDDs
5
Window FunctionsA
Can Create other Dstreamsq
Streaming: With Dstreams
Streaming
136. 136
Basic Sources
HDFS,
S3,
…
É
Reliability
ack
vs
noAck
sources
VCustom
Implement
Interface
J
^ Advanced
Ka•a,
TwiPerUAls
u
Input DStreams: Sources of Data
Streaming
137. 137
Exaclty Once
Confident
about
results
4
Ecosystem
Hadoop,Yarn,
Ka•a,
…
K
Scalable
RDDs
as
scale
unit
Single System
Batch
+
Streaming
v
Basic Premise: One Size Fits All
Streaming
138. 138
Annota`on
plugin
framework
to
extend
SQL
Stream Processing: With SQL
Processing
logic
in
SQL
%
Clustering
with
elas`c
scaling
No
down`me
during
upgrades(
141. 141
Messaging Models
Used
for
low
latency.
Producer
pushes
data
to
consumer.
Write
to
Kakfla
if
consumer
down
or
unable
to
keep
up
for
replay
later
Push
Atmost once
/
Producer
writes
events
to
Ka•a
Consumer
consumes
Ka•a
Storing
to
Ka•a
allows
for
replay
Pull
Atleast once
/
142. 142
Deployment Architecture
Events are partitioned
All
events
with
the
same
key
are
routed
to
the
same
cell
Scaling
More
cells
are
added
to
the
pipeline
for
scaling
Pulsar
automaAcally
detects
new
cells
and
rebalances
traffic
147. 147
Heron
Batching of tuples
AmorAzing
the
cost
of
transferring
tuples $
Task isolation
Ease
of
debug-‐ability/isolaAon/profiling
(Fully API compatible with Storm
Directed
acyclic
graph
Topologies,
Spouts
and
Bolts
,
Support for back pressure
Topologies
should
self
adjusAng
gUse of main stream languages
C++,
Java
and
Python #
Efficiency
Reduce resource consumption
G
Design: Goals
168. 168
Issues
3rd Generation Systems
Bit early to tell
Still no standardization and too many systems
6
%
Slide
from
Mike
Franklin,
VLDB
2015
BIRTE
Talk
on
Real
Time
AnalyAcs
172. 172
Lambda Architecture - The Good
Message
Broker
CollecAon
Pipeline
Lambda
Architecture
AnalyAcs
Pipeline
Results
173. 173
Lambda Architecture - The Bad
Have to fix everything (may be twice)!
How much Duct Tape required?
Have to write everything twice!
Subtle differences in semantics
What about Graphs, ML, SQL, etc?
$
*,
7#
176. Auto scaling the system in the presence of unpredictability
176
Technology Challenges
The Road Ahead
Auto tuning of real time analytics jobs/queries
Exploiting faster networks for efficiently moving data
Ä
Ü
J
177. Real-time personalization
177
Applications
The Road Ahead
Preferences,
Ame,
locaAon
and
social
Wearable computing
Screen
size
fragmentaAon
Analytics: Image, Video, Touch
PaPern
RecogniAon,
Anomaly
DetecAon
+