Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand
Sid Anand presented on building resilient predictive data pipelines. The key challenges discussed were the "blast radius" problem where bugs can affect downstream jobs and data, and ensuring timeliness as pipelines and algorithms change over time. Design goals for resilient pipelines include operability, correctness, timeliness and cost. AWS services like SQS, SNS, EMR and auto scaling groups were presented as ways to address these goals by enabling quick recoverability, pay as you go costs, and auto scaling to improve timeliness. Apache Airflow was also discussed as a way to provide workflow automation, scheduling, performance insights and integration with monitoring tools to improve operability and correctness.
Kafka and Storm - event processing in realtimeGuido Schmutz
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
This document discusses building resilient predictive data pipelines. It begins by distinguishing between ETL and predictive data pipelines, noting that predictive pipelines require high availability with downtimes of less than an hour. The document then outlines design goals for resilient data pipelines, including being scalable, available, instrumented/monitored/alert-enabled, and quickly recoverable. It proposes using AWS services like SQS, SNS, S3, and Auto Scaling Groups to build such pipelines. The document also recommends using Apache Airflow for workflow automation and scheduling to reliably manage pipelines as directed acyclic graphs. It presents an architecture using these techniques and assesses how well it meets the outlined design goals.
The document discusses Spark Streaming and its execution model. It describes how Spark Streaming receives data streams, divides them into micro-batches, and processes the micro-batches using Spark. It also covers transformations and actions that can be performed on DStreams, window operations, and deployment options for Spark Streaming including local, standalone, and cluster modes.
Building & Operating High-Fidelity Data Streams - QCon Plus 2021Sid Anand
The world we live in today is fed by data. From self-driving cars and route planning to fraud prevention, to content and network recommendations, to ranking and bidding, our world not only consumes low-latency data streams, it adapts to changing conditions modeled by that data.
While software engineering has settled on best practices for developing and managing both stateless service architectures and database systems, the ecosystem of data infrastructure still presents a greenfield opportunity. To thrive, this field borrows from several disciplines : distributed systems, database systems, operating systems, control systems, and software engineering to name a few.
Of particular interest to me is the sub field of data streams, specifically regarding how to build high-fidelity nearline data streams as a service within a lean team. To build such systems, human operations is a non-starter. All aspects of operating streaming data pipelines must be automated. Come to this talk to learn how to build such a system soup-to-nuts.
This is a talk about Netflix's path to Cassandra. The first few slides may look similar to previous presentations, but they are just to set the context. Most the content is brand new!
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand
Sid Anand presented on building resilient predictive data pipelines. The key challenges discussed were the "blast radius" problem where bugs can affect downstream jobs and data, and ensuring timeliness as pipelines and algorithms change over time. Design goals for resilient pipelines include operability, correctness, timeliness and cost. AWS services like SQS, SNS, EMR and auto scaling groups were presented as ways to address these goals by enabling quick recoverability, pay as you go costs, and auto scaling to improve timeliness. Apache Airflow was also discussed as a way to provide workflow automation, scheduling, performance insights and integration with monitoring tools to improve operability and correctness.
Kafka and Storm - event processing in realtimeGuido Schmutz
Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. It is designed to allow a single cluster to serve as the central data backbone for a large organization. It can be elastically and transparently expanded without downtime. Storm is a distributed realtime computation system. Storm makes it easy to reliably process unbounded streams of data, doing for realtime processing what Hadoop did for batch processing. Storm has many use cases: realtime analytics, online machine learning, continuous computation, distributed RPC, ETL, and more. This session presents the main concepts of Kafka and Storm and then shows how a simple stream processing application is implemented using these two technologies.
This session will go into best practices and detail on how to architect a near real-time application on Hadoop using an end-to-end fraud detection case study as an example. It will discuss various options available for ingest, schema design, processing frameworks, storage handlers and others, available for architecting this fraud detection application and walk through each of the architectural decisions among those choices.
Many architectures include both real-time and batch processing components. This often results in two separate pipelines performing similar tasks, which can be challenging to maintain and operate. We'll show how a single, well designed ingest pipeline can be used for both real-time and batch processing, making the desired architecture feasible for scalable production use cases.
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
This document discusses building resilient predictive data pipelines. It begins by distinguishing between ETL and predictive data pipelines, noting that predictive pipelines require high availability with downtimes of less than an hour. The document then outlines design goals for resilient data pipelines, including being scalable, available, instrumented/monitored/alert-enabled, and quickly recoverable. It proposes using AWS services like SQS, SNS, S3, and Auto Scaling Groups to build such pipelines. The document also recommends using Apache Airflow for workflow automation and scheduling to reliably manage pipelines as directed acyclic graphs. It presents an architecture using these techniques and assesses how well it meets the outlined design goals.
The document discusses Spark Streaming and its execution model. It describes how Spark Streaming receives data streams, divides them into micro-batches, and processes the micro-batches using Spark. It also covers transformations and actions that can be performed on DStreams, window operations, and deployment options for Spark Streaming including local, standalone, and cluster modes.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Agari uses Apache Airflow to automate and orchestrate their data pipelines. They have two main classes of orchestration - operational automation and building new products. One use case described is message scoring, where Airflow manages a batch pipeline to score messages from multiple enterprises on S3, run Spark jobs to score the messages, write outputs to S3/DB, and ingest the results. Airflow allows them to monitor SLAs for correctness and timeliness and integrate with monitoring tools to alert on SLA misses. They operate Airflow in AWS across multiple environments for security, fault tolerance and production deployments.
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
Whether you are developing a greenfield data project or migrating a legacy system,
there are many critical design decisions to be made. Often, it is advantageous to not only
consider immediate requirements, but also the future requirements and technologies you may
want to support. Your project may start out supporting batch analytics with the vision of adding
realtime support. Or your data pipeline may feed data to one technology today, but tomorrow
an entirely new system needs to be integrated. Apache Kafka can help decouple these
decisions and provide a flexible core to your data architecture. This talk will show how building
Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also
cover a brief overview of Kafka, its architecture, and terminology.
This document discusses challenges with writing streaming data from Kafka to Parquet files stored in HDFS. It evaluates several approaches: 1) windowing, which works but uses too much memory; 2) bucketing data into time-based files, which works for failures but requires modifying Flink; 3) closing files at checkpoints, which was later supported in Flink; and 4) hourly batch jobs, which avoids streaming complexity but limits real-time use. The conclusion is that streaming solutions are not trivial and it may be better to use a database or different tool instead of files for this use case. Supporting both real-time and batch processing is challenging.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Processing data from social media streams and sensors in real-time is becoming increasingly prevalent and there are plenty open source solutions to choose from. To help practitioners decide what to use when we compare three popular Apache projects allowing to do stream processing: Apache Storm, Apache Spark and Apache Samza.
RedisConf18 - Fail-Safe Starvation-Free Durable Priority Queues in RedisRedis Labs
This document describes an approach called "Ick" that uses Redis to provide a durable, starvation-free priority queue for asynchronous job processing. Ick uses two Redis sorted sets - a producer set and a consumer set. Messages are added to the producer set and the lowest scoring messages are moved to the consumer set in batches. This ensures messages are not lost on failure and hot data is not starved. Ick operations like ICKADD, ICKRESERVE, and ICKCOMMIT are implemented in Lua to provide atomicity. Ick addresses issues with basic queue approaches like message loss, unbounded backlogs, and hot data starvation.
This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.
Apache Samza is a framework for reliable stream processing using Apache Kafka and Hadoop YARN. It provides low-latency stream processing by allowing users to write stream processing jobs that consume messages from Kafka topics and process them using simple process functions. Samza jobs are distributed and run across clusters using YARN to provide reliability and scalability. The process functions in Samza allow users to easily integrate stream processing with state storage and message output to other Kafka topics.
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
Apache Kafka is now nearly ubiquitous in modern data pipelines and use cases. While the Kafka development model is elegantly simple, operating Kafka clusters in production environments is a challenge. It’s hard to troubleshoot misbehaving Kafka clusters, especially when there are potentially hundreds or thousands of topics, producers and consumers and billions of messages.
The root cause of why real-time applications is lag may be due to an application problem – like poor data partitioning or load imbalance – or due to a Kafka problem – like resource exhaustion or suboptimal configuration. Therefore getting the best performance, predictability, and reliability for Kafka-based applications can be difficult. In the end, the operation of your Kafka powered analytics pipelines could themselves benefit from machine learning (ML).
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner
Talk from Kafka Summit San Francisco 2019 (https://kafka-summit.org/sessions/event-driven-model-serving-stream-processing-vs-rpc-kafka-tensorflow/). Video recording will be available for free on the Summit website.
Event-based stream processing is a modern paradigm to continuously process incoming data feeds, e.g. for IoT sensor analytics, payment and fraud detection, or logistics. Machine Learning / Deep Learning models can be leveraged in different ways to do predictions and improve the business processes. Either analytic models are deployed natively in the application or they are hosted in a remote model server. In the latter you combine stream processing with RPC / Request-Response paradigm instead of direct doing direct inference within the application. This talk discusses the pros and cons of both approaches and shows examples of stream processing vs. RPC model serving using Kubernetes, Apache Kafka, Kafka Streams, gRPC and TensorFlow Serving. The trade-offs of using a public cloud service like AWS or GCP for model deployment are also discussed and compared to local hosting for offline predictions directly “at the edge”.
Key takeaways
• Machine Learning / Deep Learning models can be used in different ways to do predictions. Scalability and loose coupling are important success factors
• Stream processing vs. RPC / Request-Response for model serving has many trade-offs – learn about alternatives and best practices for your different scenarios
• Understand the alternatives and trade-offs of model deployment in modern infrastructures like Kubernetes or Cloud Services like AWS or GCP
• See live demos with Java, gRPC, Apache Kafka, KSQL and TensorFlow Serving to understand the trade-offs
The document discusses using Apache Kafka for event detection pipelines. It describes how Kafka can be used to decouple data pipelines and ingest events from various source systems in real-time. It then provides an example use case of using Kafka, Hadoop, and machine learning for fraud detection in consumer banking, describing the online and offline workflows. Finally, it covers some of the challenges of building such a system and considerations for deploying Kafka.
PayPal has seen tremendous growth in recent years, processing over 7.8 billion payments transactions annually for over 227 million active customer accounts across 200+ markets and currencies. To support this scale, PayPal's data infrastructure includes over 2,000 database instances, 116 billion database calls per day, and over 74 petabytes of total storage. PayPal continues enhancing its data infrastructure to meet growing analytics and machine learning needs through technologies like Kafka, Hadoop, graph databases and real-time OLAP engines.
Lambda at Weather Scale by Robbie StricklandSpark Summit
This document discusses The Weather Company's use of Cassandra and data analytics. Some key points:
- TWC collects ~30 billion API requests and ~360 PB of data daily from 120 million mobile users.
- Early attempts involved batch loading large datasets into Cassandra, which was slow and expensive. Streaming data via Kafka and REST services was also unnecessary.
- The improved architecture uses Cassandra for streaming data with individual tables for each event type. All other data is stored in S3. Amazon SQS replaces Kafka for reliable streaming ingestion.
- Data exploration is critical and is now done in minutes using tools like Zeppelin, rather than over a month as before.
The Netflix Way to deal with Big Data ProblemsMonal Daxini
The document discusses Netflix's approach to handling big data problems. It summarizes Netflix's data pipeline system called Keystone that was built in a year to replace a legacy system. Keystone ingests over 1 trillion events per day and processes them using technologies like Kafka, Samza and Spark Streaming. The document emphasizes Netflix's culture of freedom and responsibility and how it helped the small team replace the legacy system without disruption while achieving massive scale.
Building High Fidelity Data Streams (QCon London 2023)Sid Anand
The document discusses building reliable data streams. It begins by describing PayPal's need for a change data capture system to offload database queries. The author then built their own solution at PayPal to meet requirements like high availability and scalability.
Next, the document discusses building a simple initial streaming system with a source, destination, and messaging system between them. It emphasizes making non-functional requirements like reliability first-class citizens.
The document then explores how to make the system reliable by ensuring at-least-once delivery across each link. It proposes using transactions and auto-scaling groups. Finally, it discusses how to measure reliability using lag and loss metrics to track message delays across the system.
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianDataStax Academy
This session will focus on our approach to building a scalable TimeSeries database for financial data using Cassandra 1.2 and CQL3. We will discuss how we deal with a heavy mix of reads and writes as well as how we monitor and track performance of the system.
Apache Spark Streaming: Architecture and Fault ToleranceSachin Aggarwal
Agenda:
• Spark Streaming Architecture
• How different is Spark Streaming from other streaming applications
• Fault Tolerance
• Code Walk through & demo
• We will supplement theory concepts with sufficient examples
Speakers :
Paranth Thiruvengadam (Architect (STSM), Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/paranth-thiruvengadam-2567719
Sachin Aggarwal (Developer, Analytics Platform at IBM Labs)
Profile : https://in.linkedin.com/in/nitksachinaggarwal
Github Link: https://github.com/agsachin/spark-meetup
Agari uses Apache Airflow to automate and orchestrate their data pipelines. They have two main classes of orchestration - operational automation and building new products. One use case described is message scoring, where Airflow manages a batch pipeline to score messages from multiple enterprises on S3, run Spark jobs to score the messages, write outputs to S3/DB, and ingest the results. Airflow allows them to monitor SLAs for correctness and timeliness and integrate with monitoring tools to alert on SLA misses. They operate Airflow in AWS across multiple environments for security, fault tolerance and production deployments.
Why is My Stream Processing Job Slow? with Xavier LeauteDatabricks
The goal of most streams processing jobs is to process data and deliver insights to the business – fast. Unfortunately, sometimes our streams processing jobs fall short of this goal. Or perhaps the job used to run fine, but one day it just isn’t fast enough? In this talk, we’ll dive into the challenges of analyzing performance of real-time stream processing applications. We’ll share troubleshooting suggestions and some of our favorite tools. So next time someone asks “why is this taking so long?”, you’ll know what to do.
Whether you are developing a greenfield data project or migrating a legacy system,
there are many critical design decisions to be made. Often, it is advantageous to not only
consider immediate requirements, but also the future requirements and technologies you may
want to support. Your project may start out supporting batch analytics with the vision of adding
realtime support. Or your data pipeline may feed data to one technology today, but tomorrow
an entirely new system needs to be integrated. Apache Kafka can help decouple these
decisions and provide a flexible core to your data architecture. This talk will show how building
Kafka into your pipeline can provide the flexibility to experiment, evolve and grow. It will also
cover a brief overview of Kafka, its architecture, and terminology.
This document discusses challenges with writing streaming data from Kafka to Parquet files stored in HDFS. It evaluates several approaches: 1) windowing, which works but uses too much memory; 2) bucketing data into time-based files, which works for failures but requires modifying Flink; 3) closing files at checkpoints, which was later supported in Flink; and 4) hourly batch jobs, which avoids streaming complexity but limits real-time use. The conclusion is that streaming solutions are not trivial and it may be better to use a database or different tool instead of files for this use case. Supporting both real-time and batch processing is challenging.
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
Hari Shreedharan/Cloudera @Playtika. With its easy to use interfaces and native integration with some of the most popular ingest tools, such as Kafka, Flume, Kinesis etc, Spark Streaming has become go-to tool for stream processing. Code sharing with Spark also makes it attractive. In this talk, we will discuss the latest features in Spark Streaming and how it integrates with Kafka natively with no data loss, and even do exactly once processing!
Processing data from social media streams and sensors in real-time is becoming increasingly prevalent and there are plenty open source solutions to choose from. To help practitioners decide what to use when we compare three popular Apache projects allowing to do stream processing: Apache Storm, Apache Spark and Apache Samza.
RedisConf18 - Fail-Safe Starvation-Free Durable Priority Queues in RedisRedis Labs
This document describes an approach called "Ick" that uses Redis to provide a durable, starvation-free priority queue for asynchronous job processing. Ick uses two Redis sorted sets - a producer set and a consumer set. Messages are added to the producer set and the lowest scoring messages are moved to the consumer set in batches. This ensures messages are not lost on failure and hot data is not starved. Ick operations like ICKADD, ICKRESERVE, and ICKCOMMIT are implemented in Lua to provide atomicity. Ick addresses issues with basic queue approaches like message loss, unbounded backlogs, and hot data starvation.
This document discusses Apache Kafka and how it can be used by Oracle DBAs. It begins by explaining how Kafka builds upon the concept of a database redo log by providing a distributed commit log service. It then discusses how Kafka is a publish-subscribe messaging system and can be used to log transactions from any database, application logs, metrics and other system events. Finally, it discusses how schemas are important for Kafka since it only stores messages as bytes, and how Avro can be used to define and evolve schemas for Kafka messages.
Apache Samza is a framework for reliable stream processing using Apache Kafka and Hadoop YARN. It provides low-latency stream processing by allowing users to write stream processing jobs that consume messages from Kafka topics and process them using simple process functions. Samza jobs are distributed and run across clusters using YARN to provide reliability and scalability. The process functions in Samza allow users to easily integrate stream processing with state storage and message output to other Kafka topics.
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
Apache Kafka is now nearly ubiquitous in modern data pipelines and use cases. While the Kafka development model is elegantly simple, operating Kafka clusters in production environments is a challenge. It’s hard to troubleshoot misbehaving Kafka clusters, especially when there are potentially hundreds or thousands of topics, producers and consumers and billions of messages.
The root cause of why real-time applications is lag may be due to an application problem – like poor data partitioning or load imbalance – or due to a Kafka problem – like resource exhaustion or suboptimal configuration. Therefore getting the best performance, predictability, and reliability for Kafka-based applications can be difficult. In the end, the operation of your Kafka powered analytics pipelines could themselves benefit from machine learning (ML).
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner
Talk from Kafka Summit San Francisco 2019 (https://kafka-summit.org/sessions/event-driven-model-serving-stream-processing-vs-rpc-kafka-tensorflow/). Video recording will be available for free on the Summit website.
Event-based stream processing is a modern paradigm to continuously process incoming data feeds, e.g. for IoT sensor analytics, payment and fraud detection, or logistics. Machine Learning / Deep Learning models can be leveraged in different ways to do predictions and improve the business processes. Either analytic models are deployed natively in the application or they are hosted in a remote model server. In the latter you combine stream processing with RPC / Request-Response paradigm instead of direct doing direct inference within the application. This talk discusses the pros and cons of both approaches and shows examples of stream processing vs. RPC model serving using Kubernetes, Apache Kafka, Kafka Streams, gRPC and TensorFlow Serving. The trade-offs of using a public cloud service like AWS or GCP for model deployment are also discussed and compared to local hosting for offline predictions directly “at the edge”.
Key takeaways
• Machine Learning / Deep Learning models can be used in different ways to do predictions. Scalability and loose coupling are important success factors
• Stream processing vs. RPC / Request-Response for model serving has many trade-offs – learn about alternatives and best practices for your different scenarios
• Understand the alternatives and trade-offs of model deployment in modern infrastructures like Kubernetes or Cloud Services like AWS or GCP
• See live demos with Java, gRPC, Apache Kafka, KSQL and TensorFlow Serving to understand the trade-offs
The document discusses using Apache Kafka for event detection pipelines. It describes how Kafka can be used to decouple data pipelines and ingest events from various source systems in real-time. It then provides an example use case of using Kafka, Hadoop, and machine learning for fraud detection in consumer banking, describing the online and offline workflows. Finally, it covers some of the challenges of building such a system and considerations for deploying Kafka.
PayPal has seen tremendous growth in recent years, processing over 7.8 billion payments transactions annually for over 227 million active customer accounts across 200+ markets and currencies. To support this scale, PayPal's data infrastructure includes over 2,000 database instances, 116 billion database calls per day, and over 74 petabytes of total storage. PayPal continues enhancing its data infrastructure to meet growing analytics and machine learning needs through technologies like Kafka, Hadoop, graph databases and real-time OLAP engines.
Lambda at Weather Scale by Robbie StricklandSpark Summit
This document discusses The Weather Company's use of Cassandra and data analytics. Some key points:
- TWC collects ~30 billion API requests and ~360 PB of data daily from 120 million mobile users.
- Early attempts involved batch loading large datasets into Cassandra, which was slow and expensive. Streaming data via Kafka and REST services was also unnecessary.
- The improved architecture uses Cassandra for streaming data with individual tables for each event type. All other data is stored in S3. Amazon SQS replaces Kafka for reliable streaming ingestion.
- Data exploration is critical and is now done in minutes using tools like Zeppelin, rather than over a month as before.
The Netflix Way to deal with Big Data ProblemsMonal Daxini
The document discusses Netflix's approach to handling big data problems. It summarizes Netflix's data pipeline system called Keystone that was built in a year to replace a legacy system. Keystone ingests over 1 trillion events per day and processes them using technologies like Kafka, Samza and Spark Streaming. The document emphasizes Netflix's culture of freedom and responsibility and how it helped the small team replace the legacy system without disruption while achieving massive scale.
Building High Fidelity Data Streams (QCon London 2023)Sid Anand
The document discusses building reliable data streams. It begins by describing PayPal's need for a change data capture system to offload database queries. The author then built their own solution at PayPal to meet requirements like high availability and scalability.
Next, the document discusses building a simple initial streaming system with a source, destination, and messaging system between them. It emphasizes making non-functional requirements like reliability first-class citizens.
The document then explores how to make the system reliable by ensuring at-least-once delivery across each link. It proposes using transactions and auto-scaling groups. Finally, it discusses how to measure reliability using lag and loss metrics to track message delays across the system.
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianDataStax Academy
This session will focus on our approach to building a scalable TimeSeries database for financial data using Cassandra 1.2 and CQL3. We will discuss how we deal with a heavy mix of reads and writes as well as how we monitor and track performance of the system.
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
For one of our projects, we needed to improve the current content delivery system for terminals. In this talk, I will share our experience in building an efficient data replication system for thousands of terminals. We will touch on architecture decisions and tradeoffs, technologies that we used, and a bit of load testing.
Spoiler: We didn't use Kafka.
This document discusses various data link layer protocols. It begins by describing the services provided by the data link layer, including framing, error control, and flow control. It then discusses different types of framing such as fixed-size and variable-size. The document also covers different protocols for handling flow control and error control, including stop-and-wait, go-back-N ARQ, and selective repeat ARQ. It analyzes the performance of these protocols on both noiseless and noisy channels.
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
ndependent of the source of data, the integration of event streams into an Enterprise Architecture gets more and more important in the world of sensors, social media streams and Internet of Things. Events have to be accepted quickly and reliably, they have to be distributed and analysed, often with many consumers or systems interested in all or part of the events. How can me make sure that all these event are accepted and forwarded in an efficient and reliable way? This is where Apache Kafaka comes into play, a distirbuted, highly-scalable messaging broker, build for exchanging huge amount of messages between a source and a target.
This session will start with an introduction into Apache and presents the role of Apache Kafka in a modern data / information architecture and the advantages it brings to the table. Additionally the Kafka ecosystem will be covered as well as the integration of Kafka in the Oracle Stack, with products such as Golden Gate, Service Bus and Oracle Stream Analytics all being able to act as a Kafka consumer or producer.
Amazon Web Services offers several cloud computing services including Mechanical Turk (MTurk), Simple Queue Service (SQS), Simple Storage Service (S3), and Elastic Compute Cloud (EC2). MTurk allows users to submit human intelligence tasks and pay workers to complete them. SQS provides a queue system to communicate between distributed applications. S3 provides storage buckets for files. EC2 offers virtual servers that can be configured and deployed on demand in the cloud.
Synapse 2018 Guarding against failure in a hundred step pipelineCalvin French-Owen
Control store (ctlstore) is a new infrastructure component that solves the "n+1 problem" of independent database failures bringing down a distributed system. It replicates control data from a system of record to local SQLite databases on each service node. Ctlstore loads data transactionally from the system of record using a loader, executive, and control data log. It then replicates changes to local databases using a reflector. Snapshots to object storage allow new instances to start up with the latest data. This approach provides high availability and fast querying of shared control data across many services.
Deep Dive: AWS X-Ray London Summit 2017Randall Hunt
Instrument production applications (both in AWS and on prem) with x-ray to collect live telemetry and latency metrics on your applications. You can also use it to debug live!
Hard Truths About Streaming and Eventing (Dan Rosanova, Microsoft) Kafka Summ...confluent
Eventing and streaming open a world of compelling new possibilities to our software and platform designs. They can reduce time to decision and action while lowering total platform cost. But they are not a panacea. Understanding the edges and limits of these architectures can help you avoid painful missteps. This talk will focus on event driven and streaming architectures and how Apache Kafka can help you implement these. It will also discuss key tradeoffs you will face along the way from partitioning schemes to the impact of availability vs. consistency (CAP Theorem). Finally we’ll discuss some challenges of scale for patterns like Event Sourcing and how you can use other tools and even features of Kafka to work around them. This talk assumes a basic understanding of Kafka and distributed computing, but will include brief refresher sections.
This document discusses cloud native applications and service meshes. It notes the challenges of scaling monolithic applications to the cloud, including issues around redundancy, scheduling, service discovery, and resiliency. It then introduces containers as a way to address these challenges by pushing complexity between services. However, this introduces new challenges around observability, dependencies, failures, traffic flow, and security. The document proposes service meshes as a solution to these challenges, providing features like load balancing, failure handling, auto-scaling, and security across distributed services without requiring code changes. It provides examples using Consul Connect and discusses how service meshes can provide an "immune system" for cloud native applications.
1-1- What is traceroute- in Linux- or tracert- in Windows- useful- How.pdfaakarsalem
1.1. What is traceroute, in Linux, or tracert, in Windows, useful? How does this command work?
Which protocol is used in this command? Give an example of this command, explaining each
step it takes. [6] 1.2. Explain both TCP / IP and ISO/OSI communication protocol stacks,
identifying each of the protocol layers. How does encapsulation work in this layered design? In
terms of ratio header/payload, estimate the optimistic accumulate overhead these protocols incur
for each application "packet" that is transmitted over the Internet. [6] 1.3. What are the principles
of Reliable Data Transfer? Explain pipeline transmission strategy, as well as Go-BackN and
Selective transmissions. [6] 1.4. What is TCP silly window syndrome? Give an example to
illustrate the problem. [6] 1.5. Why does TCP operate poorly in Wireless Networks? Given an
example to illustrate the problem. [6] 1.6. What is Head-of-the-line (HOL) blocking in a router?
How does the router fabric influence the related delay and drop issues? [6] 1.7. Explain TDMA,
FDMA, CDMA, and CSMA protocols. [6] 1.8. What is ARP resolution protocol? Explain how
ARP works. Give an example showing how it works. What does an ARP table store (attributes)?
[6] 1.9. What is Ethernet? What is the role of Ethernet in the Internet? How does an Ethernet
switch work? What is the difference between a switch and a router? [6] 1.10. What is SNR?
What factors influence SNR? What is BER? What does SNR and BER have to do with each
other? [6] 1.11. Write a TCP/IP client application that should connect to a TCP/IP server running
at host 139.57.100.45 and listening at port 12000 . Your client must be implemented in Java.
Your client application should be simple and should conduct the following steps: connect to the
server, send a question in the form of a string over the connection, and read the answer sent by
the server. Your question should be simple and not too long (up to 100 characters, including the
question mark). Please note that the server is at Brock University, so you need to run your client
in one of the lab computers or sandcastle. The server will send a String as a response. For
completing this question, you will need to write here the following items: (i) the question you
have sent to the server, (ii) the exact answer of the server, (iii) your code (it should be concise e
include comments). [10].
Lessons learned migrating 100+ services to KubernetesJose Galarza
TransferWise migrated over 150 services from bare metal servers to Kubernetes in AWS over the course of a year. They developed tools and processes to automate cluster creation with Terraform, service deployment through custom manifest generation from YAML definitions, and a network mesh with Envoy to enable service-to-service communication. Some lessons learned included not exposing Kubernetes complexity, choosing the right CNI plugin, and installing node local DNS to reduce errors. Future plans include using EKS for simplified upgrades, enabling developer environments, and implementing network policies and advanced deployment pipelines.
Developing a Globally Distributed Purging SystemFastly
(Surge 2014) How do you build a distributed cache invalidation system that can invalidate content in 150 milliseconds across a global network of servers? Fastly CTO Tyler McMullen and engineer Bruce Spang will discuss the process of constructing a production-ready distributed system built on solid theoretical foundations. This talk will cover using research to design systems, the bimodal multicast algorithm, and the behavior of this system in production.
This document discusses Reactive Programming and Reactive Streams. It introduces Reactor, a reactive programming framework, and how it addresses issues like latency in microservices architectures. Reactive Streams provide an interoperable way to work with asynchronous data streams in a non-blocking manner. Streams represent sequences of data that can be processed reactively through operators like map and filter.
I am Norman H. I am a Computer Networking Assignment Expert at computernetworkassignmenthelp.com. I hold a Master's in Computer Science from, McMaster University, Canada. I have been helping students with their assignments for the past 15 years. I solve assignments related to Computer Networking.
Visit computernetworkassignmenthelp.com or email support@computernetworkassignmenthelp.com.
You can also call on +1 678 648 4277 for any assistance with Computer Networking Assignment.
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
Typesafe did a survey of Spark usage last year and found that a large percentage of Spark users combine it with Cassandra and Kafka. This talk focuses on streaming data scenarios that demonstrate how these three tools complement each other for building robust, scalable, and flexible data applications. Cassandra provides resilient and scalable storage, with flexible data format and query options. Kafka provides durable, scalable collection of streaming data with message-queue semantics. Spark provides very flexible analytics, everything from classic SQL queries to machine learning and graph algorithms, running in a streaming model based on "mini-batches", offline batch jobs, or interactive queries. We'll consider best practices and areas where improvements are needed.
101 mistakes FINN.no has made with Kafka (Baksida meetup)Henning Spjelkavik
1. The document summarizes 101 mistakes that Finn.no has made with Kafka. It discusses various configuration mistakes and operational mistakes made when initially adopting and using Kafka, and the consequences of those mistakes.
2. Common mistakes included not considering backwards compatibility of Kafka versions, treating Kafka like a database, not properly defining schemas, and not understanding client-side rebalancing.
3. Finn.no's solutions to address the mistakes included running multiple Kafka clusters during a migration, using fewer Kafka partitions as a default, and adopting better configuration practices tailored to their use cases.
Air traffic controller - Streams Processing meetupEd Yakabosky
ATC is a system built using Samza to manage communications with LinkedIn members. It aims to improve the member experience by applying common functionality across different communication types and use cases. It handles thousands of communications per second while maintaining a good understanding of members' states in near-real-time. ATC focuses on sending the right message to the right member through the right channel at the right time using techniques like filtering, aggregation, channel selection and delivery optimization. It was built to be highly scalable using streaming technologies like Kafka, RocksDB and host affinity to replicate state across datacenters for redundancy. Personalization is achieved through relevance scores computed offline and stored in RocksDB.
Sadayuki Furuhashi created Kumofs, a distributed key-value store, and MessagePack, a cross-language object serialization library. Kumofs is optimized for low latency with zero-hop reads and no single points of failure. It scales out linearly as servers are added without impacting applications. MessagePack is a compact binary format like JSON used for cross-language communication. MessagePack-RPC is a cross-language messaging library that uses an asynchronous, pipelined protocol over an event-driven I/O model.
- Sadayuki Furuhashi is a computer science student at the University of Tsukuba who created Kumofs, a distributed key-value store, and MessagePack, a cross-language communication library.
- Kumofs is optimized for low-latency and scalability through a consistent hashing algorithm and dynamic rebalancing without impacting applications. It has no single point of failure.
- MessagePack is a compact binary serialization format like JSON used for fast cross-language communication. MessagePack-RPC builds on this to enable asynchronous, pipelined RPC between languages.
Building Better Data Pipelines using Apache AirflowSid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs). It allows users to programmatically author DAGs in Python without needing to bundle many XML files. The UI provides a tree view to see DAG runs over time and Gantt charts to see performance trends. Airflow is useful for ETL pipelines, machine learning workflows, and general job scheduling. It handles task dependencies and failures, monitors performance, and enforces service level agreements. Behind the scenes, the scheduler distributes tasks from the metadata database to Celery workers via RabbitMQ.
Cloud Native Predictive Data Pipelines (micro talk)Sid Anand
This document summarizes Sid Anand's microtalk on cloud native predictive data pipelines at a PayPal risk infrastructure all-hands meeting in January 2018. It discusses Agari's approach to detecting spear phishing emails in near real-time by intercepting emails and sending metadata to AWS cloud services for trust modeling and scoring, then returning signals to quarantine, label, or pass through emails. The architecture uses microservices, decoupled services, immutable services, polyglot persistence leveraging various AWS services, and focuses on building trust models for scoring emails in near real-time and batch processing.
Cloud Native Data Pipelines (GoTo Chicago 2017)Sid Anand
Cloud Native Data Pipelines
The document discusses building data pipelines in a cloud native way using open source technologies and cloud native techniques. It describes a message scoring use case at Agari where data is ingested from multiple enterprises into S3 and then processed through a Spark job on EMR hourly. The results are written to S3 and trigger downstream processing. Design goals for resilient data pipelines include operability, correctness, timeliness and cost. Techniques discussed to achieve these goals include using Apache Airflow for workflow management, auto scaling groups, and leveraging serverless technologies where possible.
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
This document discusses cloud native data pipelines. It begins by introducing the speaker and their company, Agari, which applies trust models to email metadata to score messages. The document then discusses design goals for resilient data pipelines, including operability, correctness, timeliness and cost. It presents two use cases at Agari: batch message scoring and near real-time message scoring. For each use case, the pipeline architecture is shown including components like S3, SNS, SQS, ASGs, EMR and databases. The document discusses leveraging AWS services and tools like Airflow, Packer and Terraform to tackle issues like cost, timeliness, operability and correctness. It also introduces innovations like Apache Avro for
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
Slides from "Cloud Native Data Pipelines" talk given @ QCon Tokyo 2016. The slides are in both English and Japanese. Thanks to Kiro Harada (https://jp.linkedin.com/in/haradakiro) for the translation.
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
This document discusses cloud native data pipelines. It begins by describing the speaker and their work experience. Then, it outlines some key qualities of resilient data pipelines like operability, correctness, timeliness and cost. Two use cases at the speaker's company for applying trust models to messages are presented - one using batch processing and the other using near real-time processing. The document discusses how tools like Apache Airflow, auto-scaling groups, Amazon Kinesis and Avro can help achieve those qualities for data pipelines in the cloud.
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
Apache Airflow is a platform for authoring, scheduling, and monitoring workflows or directed acyclic graphs (DAGs) of tasks. It includes a DAG scheduler, web UI, and CLI. Airflow allows users to author DAGs in Python without needing to bundle many XML files. The UI provides tree and Gantt chart views to monitor DAG runs over time. Airflow was accepted into the Apache Incubator in 2016 and has over 300 users from 40+ companies. Agari uses Airflow to orchestrate message scoring pipelines across AWS services like S3, Spark, SQS, and databases to enforce SLAs on correctness and timeliness. Areas for further improvement include security, APIs, execution scaling, and on
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand
The document provides details about Sid Anand's career and then discusses LinkedIn's software development process and architecture when he was there. It notes that when Sid started at LinkedIn in 2011, compiling the code took a long time due to the large codebase and many dependencies. It then describes how LinkedIn scaled to support hundreds of millions of members and thousands of employees by splitting the monolithic codebase into individual Git repos, using intermediate JARs to reduce dependencies, and connecting development machines to test environments instead of deploying everything locally. It also discusses LinkedIn's use of Kafka, search federation, and not making web service calls between data centers to scale across multiple data centers.
Building a Modern Website for Scale (QCon NY 2013)Sid Anand
LinkedIn uses several technologies to scale its services and infrastructure to support over 200 million members. It uses a dynamic discovery and client-side load balancing approach for its web services to improve fault tolerance. The presentation tier is composed of various front-end frameworks while business logic is encapsulated in services. LinkedIn's databases Espresso and Oracle are scaled using techniques like data replication, read replicas and change data capture via Databus. Databus provides a consistent, real-time stream of database changes to power services like search, recommendations and standardization. Messaging is handled using Apache Kafka which provides pub-sub streaming capabilities.
Maven is a build system that provides:
1) A standard directory structure for projects;
2) A standard build lifecycle of phases like compile, test, package; and
3) The ability to override defaults through plugins.
The document then discusses several key aspects of Maven including:
1) The standard Maven directory structure for source code, resources, and test code;
2) The default Maven lifecycle phases like compile, test, package; and
3) The pom.xml file which is Maven's build specification and configuration file.
Git is a version control system that stores snapshots of files rather than tracking changes between file versions. It allows for offline work and nearly all operations are performed locally. Files can exist in three states - committed, modified, or staged. Commits create snapshots of the staged files. Branches act as pointers to commits, with the default branch being master.
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
Learn about Espresso, Databus, and Voldemort. LinkedIn Data Infrastructure Slides (Version 2). This talk was given in NYC on June 20, 2012
You can download the slides as PPT in order to see the transitions here :
http://bit.ly/LfH6Ru
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
LinkedIn uses DataBus to replicate data changes from Oracle in real-time. DataBus consists of relay and bootstrap services that capture changes from Oracle and distribute them to various services like search indexes, graphs, and read replicas to keep them updated in real-time. This allows users to immediately see profile updates or new connections in search results and feeds.
Keeping Movies Running Amid Thunderstorms!Sid Anand
How does Netflix strive to deliver an uninterrupted service? This talk, delivered for the first time in November, 2011, covers some engineering design concepts that help us deliver features at a rapid pace while assuring high availability.
OSCON Data 2011 -- NoSQL @ Netflix, Part 2Sid Anand
The document discusses translating concepts from relational databases to key-value stores. It covers normalizing data to avoid issues like data inconsistencies and loss. While key-value stores don't support relations, transactions, or SQL, the relationships can be composed in the application layer for smaller datasets. Picking the right data for key-value stores involves accessing data primarily by key lookups.
Intuit CTOF 2011 - Netflix for Mobile in the CloudSid Anand
Netflix provides concise summaries of its mobile app development strategy and tips for the Apple iPhone:
1. Netflix develops mobile web pages that run in the iPhone's Web Kit to enable A/B testing, fast deployments, and leveraging code across devices like iPhone and Android.
2. The Netflix iPhone app loads key data like the user's rental history, movie lists, and ratings in parallel to provide a smooth user experience with minimal loading delays.
3. Netflix prefetches non-essential data like movie images, titles, and synopses on list pages to reduce load times when users view individual movie pages.
Netflix migrated to the cloud to avoid single points of failure and to focus on their core competencies. They chose Amazon Web Services and migrated non-sensitive data and applications to the cloud. Netflix picked SimpleDB and S3 as their data stores in the cloud. Migrating from an RDBMS required translating relational concepts like normalization to key-value stores and working around issues with SimpleDB like lack of data types and transactions.
Netflix's Transition to High-Availability Storage (QCon SF 2010)Sid Anand
This talk focuses on Netflix's transition from Oracle to SimpleDB -- a cloud-hosted, key-value store -- during Netflix's transition to the cloud (i.e. AWS). Stay tuned for future talks as Netflix evaluates more technologies, e.g. Cassandra.
Understanding Inductive Bias in Machine LearningSUTEJAS
This presentation explores the concept of inductive bias in machine learning. It explains how algorithms come with built-in assumptions and preferences that guide the learning process. You'll learn about the different types of inductive bias and how they can impact the performance and generalizability of machine learning models.
The presentation also covers the positive and negative aspects of inductive bias, along with strategies for mitigating potential drawbacks. We'll explore examples of how bias manifests in algorithms like neural networks and decision trees.
By understanding inductive bias, you can gain valuable insights into how machine learning models work and make informed decisions when building and deploying them.
Introduction- e - waste – definition - sources of e-waste– hazardous substances in e-waste - effects of e-waste on environment and human health- need for e-waste management– e-waste handling rules - waste minimization techniques for managing e-waste – recycling of e-waste - disposal treatment methods of e- waste – mechanism of extraction of precious metal from leaching solution-global Scenario of E-waste – E-waste in India- case studies.
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
TIME DIVISION MULTIPLEXING TECHNIQUE FOR COMMUNICATION SYSTEMHODECEDSIET
Time Division Multiplexing (TDM) is a method of transmitting multiple signals over a single communication channel by dividing the signal into many segments, each having a very short duration of time. These time slots are then allocated to different data streams, allowing multiple signals to share the same transmission medium efficiently. TDM is widely used in telecommunications and data communication systems.
### How TDM Works
1. **Time Slots Allocation**: The core principle of TDM is to assign distinct time slots to each signal. During each time slot, the respective signal is transmitted, and then the process repeats cyclically. For example, if there are four signals to be transmitted, the TDM cycle will divide time into four slots, each assigned to one signal.
2. **Synchronization**: Synchronization is crucial in TDM systems to ensure that the signals are correctly aligned with their respective time slots. Both the transmitter and receiver must be synchronized to avoid any overlap or loss of data. This synchronization is typically maintained by a clock signal that ensures time slots are accurately aligned.
3. **Frame Structure**: TDM data is organized into frames, where each frame consists of a set of time slots. Each frame is repeated at regular intervals, ensuring continuous transmission of data streams. The frame structure helps in managing the data streams and maintaining the synchronization between the transmitter and receiver.
4. **Multiplexer and Demultiplexer**: At the transmitting end, a multiplexer combines multiple input signals into a single composite signal by assigning each signal to a specific time slot. At the receiving end, a demultiplexer separates the composite signal back into individual signals based on their respective time slots.
### Types of TDM
1. **Synchronous TDM**: In synchronous TDM, time slots are pre-assigned to each signal, regardless of whether the signal has data to transmit or not. This can lead to inefficiencies if some time slots remain empty due to the absence of data.
2. **Asynchronous TDM (or Statistical TDM)**: Asynchronous TDM addresses the inefficiencies of synchronous TDM by allocating time slots dynamically based on the presence of data. Time slots are assigned only when there is data to transmit, which optimizes the use of the communication channel.
### Applications of TDM
- **Telecommunications**: TDM is extensively used in telecommunication systems, such as in T1 and E1 lines, where multiple telephone calls are transmitted over a single line by assigning each call to a specific time slot.
- **Digital Audio and Video Broadcasting**: TDM is used in broadcasting systems to transmit multiple audio or video streams over a single channel, ensuring efficient use of bandwidth.
- **Computer Networks**: TDM is used in network protocols and systems to manage the transmission of data from multiple sources over a single network medium.
### Advantages of TDM
- **Efficient Use of Bandwidth**: TDM all
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Batteries -Introduction – Types of Batteries – discharging and charging of battery - characteristics of battery –battery rating- various tests on battery- – Primary battery: silver button cell- Secondary battery :Ni-Cd battery-modern battery: lithium ion battery-maintenance of batteries-choices of batteries for electric vehicle applications.
Fuel Cells: Introduction- importance and classification of fuel cells - description, principle, components, applications of fuel cells: H2-O2 fuel cell, alkaline fuel cell, molten carbonate fuel cell and direct methanol fuel cells.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
18. Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
19. Why Are Streams Hard?
In streaming architectures, implementation gaps in non-functional requirements can be
unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
… and this translates to unhappy customers and burnt-out team members!
20. Why Are Streams Hard?
Data Infrastructure is an iceberg
Your customers may only see 10% of your
effort — those that manifest in features
The remaining 90% of your work goes
unnoticed because it relates to keeping the
lights on
21. Why Are Streams Hard?
Data Infrastructure is an iceberg
Your customers may only see 10% of your
effort — those that manifest in features
The remaining 90% of your work goes
unnoticed because it relates to keeping the
lights on
In this talk, we will build high-
fi
delity
streams-as-a-service from the ground up!
23. Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
24. Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
But
fi
rst, let’s decouple S and D by putting messaging infrastructure between them
E
S D
Events topic
25. Start Simple
Make a few more implementation decisions about this system
E
S D
26. Start Simple
Make a few more implementation decisions about this system
E
S D
Run our system on a cloud platform (e.g. AWS)
27. Start Simple
Make a few more implementation decisions about this system
Operate at low scale
E
S D
Run our system on a cloud platform (e.g. AWS)
28. Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
E
S D
Run our system on a cloud platform (e.g. AWS)
29. Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
E
S D
Run our system on a cloud platform (e.g. AWS)
30. Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
Run S & D on single, separate EC2 Instances
E
S D
Run our system on a cloud platform (e.g. AWS)
31. Start Simple
T
o make things a bit more interesting, let’s provide our stream as a service
We de
fi
ne our system boundary using a blue box as shown below!
È
S D
34. Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
35. Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Once S has ACKd a message to a remote sender, D must deliver that message to
a remote receiver
39. Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
40. Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
Insight : If each process/link is transactional in nature, the chain will be
transactional!
In order to make this system reliable
41. Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
42. Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
How do we make each link transactional?
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
47. Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does A need to do?
• Receive a Request (e.g. REST)
• Do some processing
• Reliably send data to Kafka
• kProducer.send(topic, message)
• kProducer.
fl
ush()
• Producer Con
fi
g
• acks = all
• Send HTTP Response to caller
48. Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does C need to do?
• Read data (a batch) from Kafka
• Do some processing
• Reliably send data out
• ACK / NACK Kafka
• Consumer Con
fi
g
• enable.auto.commit = false
• ACK moves the read checkpoint
forward
• NACK forces a reread of the same data
49. Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
50. Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
51. Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
B needs to act like a reliable Kafka Consumer
52. `
A B C
m1 m1
Reliability
How reliable is our system now?
54. Reliability
How reliable is our system now?
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
55. Reliability
How reliable is our system now?
If C crashes, we will stop delivering messages to external consumers!
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
56. Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent
failures
57. Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent
failures
For now, we appear to have a pretty reliable data stream
61. Lag : What is it?
Lag is simply a measure of message delay in a system
62. Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
63. Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
64. Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
66. Lag : How do we compute it?
eventTime : the creation time of an event message
Lag can be calculated for any message m1 at any node N in the system as
lag(m1, N) = current_time(m1, N) - eventTime(m1)
`
A B C
m1 m1
T0
eventTime:
67. Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 5 ms)
C = T5 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
68. `
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @
A = T1 - T0 (e.g 1 ms)
B = T3 - T0 (e.g 5 ms)
C = T5 - T0 (e.g 10 ms)
Cumulative Lag
69. Lag : How do we compute it?
Lag-in @
A = T2 - T0 (e.g 3 ms)
B = T4 - T0 (e.g 8 ms)
C = T6 - T0 (e.g 12 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4 T6
T0
eventTime:
m1
70. Lag : How do we compute it?
Lag-in @
A = T2 - T0 (e.g 3 ms)
B = T4 - T0 (e.g 8 ms)
C = T6 - T0 (e.g 12 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4
T0
eventTime:
m1
E2E Lag
E2E Lag is the total time a message spent in the system
T6
71. Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
72. Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Instead, we prefer statistics (e.g. P95) to capture population behavior
73. Lag : How do we compute it?
Some useful Lag statistics are:
E2E Lag (p95) : 95th percentile time of messages spent in the system
Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
74. Lag : How do we compute it?
Some useful Lag statistics are:
E2E Lag (p95) : 95th percentile time of messages spent in the system
Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
Process_Duration(N, p95) : Lag_out(N, p95) - Lag_in(N, p95)
77. Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
78. Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
79. Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
80. Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Hence, our goal is to minimize loss in order to deliver high quality insights
82. Loss : How do we compute it?
Concepts : Loss
Loss can be computed as the set difference of messages between any 2
points in the system
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
83. Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
84. Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
85. Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution
Allocate messages to 1-minute wide time buckets using message eventTime
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
86. Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p
87. Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution
Allocate messages to 1-minute wide time buckets using message eventTime
Wait a few minutes for messages to transit, then compute loss
Raise alarms if loss occurs over a con
fi
gured threshold (e.g. > 1%)
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
88. We now have a way to measure the reliability (via Loss metrics) and latency (via Lag
metrics) of our system.
Loss : How do we compute it?
But wait…
90. Performance
Goal : Build a system that can deliver messages reliably from S to D with low latency
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
T
o understand streaming system performance, let’s understand the components of E2E Lag
93. Performance
Ingest Time Expel Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Expel Time : Time to process and egest a message at D.
94. Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
E2E Lag : T
otal time messages spend in the system from message ingest to expel!
95. Performance
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
T
ransit Time : Rest of the time spent in the data pipe (i.e. internal nodes)
97. Performance : Penalties
In order to have stream reliability, we must sacri
fi
ce latency!
How can we handle our performance penalties?
98. Performance
Challenge 1 : Ingest Penalty
In the name of reliability, S needs to call kProducer.
fl
ush() on every inbound API
request
S also needs to wait for 3 ACKS from Kafka before sending its API response
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
99. Performance
Challenge 1 : Ingest Penalty
Approach : Amortization
Support Batch APIs (i.e. multiple messages per web request) to amortize the
ingest penalty
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
100. Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty
Observations
Kafka is very fast — many orders of magnitude faster than HTTP RTT
s
The majority of the expel time is the HTTP RTT
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
101. Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty
Approach : Amortization
In each D node, add batch + parallelism
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
102. Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
103. Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
104. Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
105. Performance
Challenge 3 : Retry Penalty (@ D)
Concepts
In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
E.g. Any 4xx HTTP response code, except for 429 (T
oo Many Requests)
106. Performance
Challenge 3 : Retry Penalty
Approach
We pay a latency penalty on retry, so we need to smart about
What we retry — Don’t retry any non-recoverable failures
How we retry
107. Performance
Challenge 3 : Retry Penalty
Approach
We pay a latency penalty on retry, so we need to smart about
What we retry — Don’t retry any non-recoverable failures
How we retry — One Idea : Tiered Retries
108. Performance - Tiered Retries
Local Retries
T
ry to send message a
con
fi
gurable number of times @ D
Global Retries
109. Performance - Tiered Retries
Local Retries
T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
110. Performance - Tiered Retries
Local Retries
T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
The Global Retrier than retries
the message over a longer span of
time
115. Scalability
First, Let’s dispel a myth!
Each system is traf
fi
c-rated
The traf
fi
c rating comes from running load tests
There is no such thing as a system that can handle in
fi
nite scale
116. Scalability
First, Let’s dispel a myth!
Each system is traf
fi
c-rated
The traf
fi
c rating comes from running load tests
There is no such thing as a system that can handle in
fi
nite scale
We only achieve higher scale by iteratively running load tests & removing bottlenecks
117. Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
118. Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
119. Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
120. Scalability - Autoscaling
Autoscaling Goals (for data streams):
Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)
Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
121. Scalability - Autoscaling EC2
The most important part of autoscaling is picking the right metric to trigger
autoscaling actions
122. Scalability - Autoscaling EC2
Pick a metric that
Preserves low latency
Goes up as traf
fi
c increases
Goes down as the microservice scales out
123. Scalability - Autoscaling EC2
Pick a metric that
Preserves low latency
Goes up as traf
fi
c increases
Goes down as the microservice scales out
E.g.
Average CPU
124. Scalability - Autoscaling EC2
Pick a metric that
Preserves low latency
Goes up as traf
fi
c increases
Goes down as the microservice scales out
E.g.
Average CPU
What to be wary of
Any locks/code synchronization & IO Waits
Otherwise … As traf
fi
c increases, CPU will plateau, auto-
scale-out will stop, and latency (i.e. E2E Lag) will increase
125. What Next?
We now have a system with the Non-functional Requirements (NFRs) that we desire!
126. What Next?
What if we want to handle
• Different types of messages
• More complex processing ( i.e. more processing stages)
• More complex stream topologies (e.g. 1-1, 1-many, many-many)
127. What Next?
It will take a lot of work to rebuild our data pipe for each variation of customers’ needs!
What we need to do is build a more generic Streams-as-a-Service (ST
aaS) platform!
128. Building StaaS
Firstly, let’s make our pipeline a bit more realistic by adding more processing stages
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Ingest Normalize Enrich Route T
ransform T
ransmit
129. Building StaaS
And by handling more complex topologies (e.g. many-to-many)
Normalize Enrich Route T
ransform
T
ransmit
(T1)
T
ransmit
(T2)
T
ransmit
(T3)
T
ransmit
(T4)
T
ransmit
(T5)
Ingest (n1)
Ingest (n2)
Ingest (n3)
130. Building StaaS
This our data plane — it send messages from multiple sources to multiple destinations
Normalize Enrich Route T
ransform
T
ransmit
(T1)
T
ransmit
(T2)
T
ransmit
(T3)
T
ransmit
(T4)
T
ransmit
(T5)
Ingest (n1)
Ingest (n2)
Ingest (n3)
131. Building StaaS
But, we also want to allow users the ability to de
fi
ne their own data pipes in this data
plane
132. Building StaaS
Hence, we need a management plane to capture the intent of the users
Admin
FE
Admin
BE
140. Building StaaS
• The observer(O) is the source of truth for system health
• It is aware of D, P, and A activity & may quiet alarms
during certain actions
Control
Plane
O
D
A
P
141. Building StaaS
• The observer(O) is the source of truth for system health
• It is aware of D, P, and A activity & may quiet alarms
during certain actions
Control
Plane
O
D
A
P
• It can collect and monitor more complex health
metrics than lag and loss. For example, in ML
pipelines, it can track scoring skew
142. Building StaaS
• The observer(O) is the source of truth for system health
• It is aware of D, P, and A activity & may quiet alarms
during certain actions
Control
Plane
O
D
A
P
• The system can also detect common causes of non-
recoverable failures & alert customers
• It can collect and monitor more complex health
metrics than lag and loss. For example, in ML
pipelines, it can track scoring skew
143. Building StaaS
• The deployer(D) deploys new code to the data plane
• It can however not deploy if the system is unstable or
autoscaling
• It can also automatically roll back if the system becomes
unstable due to deployment
Control
Plane
O
D
A
P
144. Building StaaS
• The provisioner(P) deploys customer data pipes to the
system.
• It can pause if the system is unstable or autoscaling
Control
Plane
O
D
A
P
145. Building StaaS
• The provisioner(P) deploys customer data pipes to the
system.
• It can pause if the system is unstable or autoscaling
Control
Plane
O
D
A
P
• The provisioner(P) can also control things like phased
traf
fi
c ramp ups for new deployed pipelines
147. Conclusion
• We have built a Streams-as-a-Service system with many NFRs as
fi
rst class
citizens
• While we’ve covered many key elements, a few areas will be covered in future
talks (e.g. Isolation, Containerization, Caching)
• Should you have questions, join me for Q&A and follow for more on
(@r39132)
149. And thanks to the many people who
help build these systems with me..
• Vincent Chen
• Anisha Nainani
• Pramod Garre
• Harsh Bhimani
• Nirmalya Ghosh
• Yash Shah
• Aastha Sinha
• Esther Kent
• Dheeraj Rampali
• Deepak Chandramouli
• Prasanna Krishna
• Sandhu Santhakumar
• Maneesh CM & team at Active
Lobby
• Shiju & the team at Xminds
• Bob Carlson
• T
ony Gentille
• Josh Evans