SlideShare a Scribd company logo
1 of 131
Download to read offline
Building & Operating
fi
ghting
fi
res & keeping systems up
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
… and this translates to unhappy customers and burnt-out team members!
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
… and this translates to unhappy customers and burnt-out team members!
In this talk, we will focus on building high-
fi
delity streams from the ground up!
Start Simple
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
But
fi
rst, let’s decouple S and D by putting messaging infrastructure between them
E
S D
Events topic
Start Simple
Make a few more implementation decisions about this system
E
S D
Start Simple
Make a few more implementation decisions about this system
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
Run S & D on single, separate EC2 Instances
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
T
o make things a bit more interesting, let’s provide our stream as a service


We de
fi
ne our system boundary using a blue box as shown below!
È
S D
Reliability
(Is This System Reliable?)
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Once S has ACKd a message to a remote sender, D must deliver that message to
a remote receiver
Reliability
How do we build reliability into our system?
È
S D
Reliability
Let’s
fi
rst generalize our system!
`
A B C
m1 m1
Reliability
In order to make this system reliable
`
A B C
m1 m1
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
Insight : If each process/link is transactional in nature, the chain will be
transactional!
In order to make this system reliable
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
How do we make each link transactional?
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
A is an ingest node
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is an internal node
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
C is an expel node
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does A need to do?


• Receive a Request (e.g. REST)


• Do some processing


• Reliably send data to Kafka


• kProducer.send(topic, message)


• kProducer.
fl
ush()


• Producer Con
fi
g


• acks = all


• Send HTTP Response to caller
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does C need to do?


• Read data (a batch) from Kafka


• Do some processing


• Reliably send data out


• ACK / NACK Kafka


• Consumer Con
fi
g


• enable.auto.commit = false


• ACK moves the read checkpoint
forward


• NACK forces a reread of the same data
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
B needs to act like a reliable Kafka Consumer
`
A B C
m1 m1
Reliability
How reliable is our system now?
Reliability
How reliable is our system now?
What happens if a process crashes?
`
A B C
m1 m1
Reliability
How reliable is our system now?
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
Reliability
How reliable is our system now?
If C crashes, we will stop delivering messages to external consumers!
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent


failures
Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent


failures
For now, we appear to have a pretty reliable data stream
But how do we measure its reliability?
Observability
(A story about Lag & Loss Metrics)
(This brings us to …)
Lag : What is it?
Lag is simply a measure of message delay in a system
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
Lag : How do we compute it?
Lag : How do we compute it?
eventTime : the creation time of an event message


Lag can be calculated for any message m1 at any node N in the system as


lag(m1, N) = current_time(m1, N) - eventTime(m1)


`
A B C
m1 m1
T0
eventTime:
Lag : How do we compute it?
`
A B C
T0 = 12p
eventTime:
m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p
T0 = 12p
eventTime:
m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p T3 = 12:04p
T0 = 12p
eventTime:
m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p T3 = 12:04p T5 = 12:10p
T0 = 12p
eventTime:
m1 m1
Lag : How do we compute it?
`
A B C
T1 = 12:01p T3 = 12:04p T5 = 12:10p
T0 = 12p
eventTime:
m1 m1
lag(m1, A) = T1-T0 lag(m1, B) = T3-T0 lag(m1, C) = T5-T0
lag(m1, A) = 1m lag(m1, B) = 4m lag(m1, C) = 10m
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @


A = T1 - T0 (e.g 1 ms)


B = T3 - T0 (e.g 3 ms)


C = T5 - T0 (e.g 8 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @


A = T1 - T0 (e.g 1 ms)


B = T3 - T0 (e.g 3 ms)


C = T5 - T0 (e.g 8 ms)
A


B C
Observation: Lag is Cumulative
Lag : How do we compute it?
Lag-in @


A = T2 - T0 (e.g 2 ms)


B = T4 - T0 (e.g 4 ms)


C = T6 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4 T6
T0
eventTime:
m1
Lag : How do we compute it?
Lag-in @


A = T2 - T0 (e.g 2 ms)


B = T4 - T0 (e.g 4 ms)


C = T6 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4
T0
eventTime:
m1
The most important metric for Lag in any streaming system
is E2E Lag: the total time a message spent in the system
T6
Lag : How do we compute it?
Lag-in @


A = T2 - T0 (e.g 2 ms)


B = T4 - T0 (e.g 4 ms)


C = T6 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4
T0
eventTime:
m1
E2E Lag
The most important metric for Lag in any streaming system
is E2E Lag: the total time a message spent in the system
T6
Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Instead, we prefer statistics (e.g. P95) to capture population behavior
Lag : How do we compute it?
Some useful Lag statistics are:


E2E Lag (p95) : 95th percentile time of messages spent in the system


Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
Lag : How do we compute it?
Some useful Lag statistics are:


E2E Lag (p95) : 95th percentile time of messages spent in the system


Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N


Process_Duration(N, p95) : Time Spent at any node in the chain!


Lag_out(N, p95) - Lag_in(N, p95)
m1 m1
Process_Duration Graphs show you the contribution to overall Lag from each hop
Lag : How do we compute it?
Loss : What is it?
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Hence, our goal is to minimize loss in order to deliver high quality insights
Loss : How do we compute it?
Loss : How do we compute it?
Concepts : Loss


Loss can be computed as the set difference of messages between any 2
points in the system
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution


Allocate messages to 1-minute wide time buckets using message eventTime


m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Update late arrival data
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Update late arrival data
Compute


Loss
Loss : How do we compute it?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
Now
Update late arrival data
Compute


Loss
Age Out
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution


Allocate messages to 1-minute wide time buckets using message eventTime


Wait a few minutes for messages to transit, then compute loss (e.g. 12:35p loss table)


Raise alarms if loss occurs over a con
fi
gured threshold (e.g. > 1%)


m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
We now have a way to measure the reliability (via Loss metrics) and latency (via Lag
metrics) of our system.
Loss : How do we compute it?
But wait…
Performance
(have we tuned our system for performance yet??)
Performance
Goal : Build a system that can deliver messages reliably from S to D with low latency
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
T
o understand streaming system performance, let’s understand the components of E2E Lag
Performance
Ingest Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
Performance
Ingest Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
• This time includes overhead of reliably sending messages to Kafka
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
Performance
Ingest Time Expel Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Expel Time : Time to process and egest a message at D.
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
E2E Lag : T
otal time messages spend in the system from message ingest to expel!
Performance Penalties
(T
rade off: Latency for Reliability)
Performance
Challenge 1 : Ingest Penalty


In the name of reliability, S needs to call kProducer.
fl
ush() on every inbound API
request


S also needs to wait for 3 ACKS from Kafka before sending its API response
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
Challenge 1 : Ingest Penalty


Approach : Amortization


Support Batch APIs (i.e. multiple messages per web request) to amortize the
ingest penalty
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty


Observations


Kafka is very fast — many orders of magnitude faster than HTTP RTT
s


The majority of the expel time is the HTTP RTT
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty


Approach : Amortization


In each D node, add batch + parallelism
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
E.g. Any 4xx HTTP response code, except for 429 (T
oo Many Requests)
Performance
Challenge 3 : Retry Penalty


Approach


We pay a latency penalty on retry, so we need to smart about


What we retry — Don’t retry any non-recoverable failures


How we retry
Performance
Challenge 3 : Retry Penalty


Approach


We pay a latency penalty on retry, so we need to smart about


What we retry — Don’t retry any non-recoverable failures


How we retry — One Idea : Tiered Retries
Performance - Tiered Retries
Local Retries


T
ry to send message a
con
fi
gurable number of times @ D
Global Retries
Performance - Tiered Retries
Local Retries


T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
Performance - Tiered Retries
Local Retries


T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
The Global Retrier than retries
the message over a longer span of
time
`
E
S
S
S
S
S
S
S
D
RO
RI
S
S
S
gr
Performance - 2 Tiered Retries
RI : Retry_In


RO : Retry_Out
At this point, we have a system that works well at low scale
Performance
Scalability
(But how does this system scale


with increasing traf
fi
c?)
Scalability
First, Let’s dispel a myth!
Scalability
First, Let’s dispel a myth!
Each system is traf
fi
c-rated
The traf
fi
c rating comes from running load tests
There is no such thing as a system that can handle in
fi
nite scale
Scalability
First, Let’s dispel a myth!
Each system is traf
fi
c-rated
The traf
fi
c rating comes from running load tests
There is no such thing as a system that can handle in
fi
nite scale
We only achieve higher scale by iteratively running load tests & removing bottlenecks
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
Scalability - Autoscaling EC2
The most important part of autoscaling is picking the right metric to trigger
autoscaling actions
Scalability - Autoscaling EC2
Pick a metric that


Preserves low latency


Goes up as traf
fi
c increases


Goes down as the microservice scales out
Scalability - Autoscaling EC2
Pick a metric that


Preserves low latency


Goes up as traf
fi
c increases


Goes down as the microservice scales out
E.g.


Average CPU
Scalability - Autoscaling EC2
Pick a metric that


Preserves low latency


Goes up as traf
fi
c increases


Goes down as the microservice scales out
E.g.


Average CPU
What to be wary of


Any locks/code synchronization & IO Waits
Otherwise … As traf
fi
c increases, CPU will plateau, auto-
scale-out will stop, and latency (i.e. E2E Lag) will increase
Conclusion
• We now have a system with the Non-functional Requirements (NFRs)
that we desire!
• While we’ve covered many key elements, a few areas will be covered
in future talks (e.g. Isolation, Containerization, Caching).
• These will be covered in upcoming blogs! Follow for updates on
(@r39132)

More Related Content

What's hot

Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingJack Gudenkauf
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaDataWorks Summit
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planningJamieAlquiza
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...confluent
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsMonal Daxini
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzaAbhishek Shivanna
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasMonal Daxini
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsLightbend
 
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019confluent
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Kai Wähner
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Journeys from Kafka to Parquet
Journeys from Kafka to ParquetJourneys from Kafka to Parquet
Journeys from Kafka to ParquetDataWorks Summit
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformLegacy Typesafe (now Lightbend)
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...StreamNative
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming AnalyticsGuido Schmutz
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Monal Daxini
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingGwen (Chen) Shapira
 

What's hot (20)

Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Kafka for DBAs
Kafka for DBAsKafka for DBAs
Kafka for DBAs
 
Thoughts on kafka capacity planning
Thoughts on kafka capacity planningThoughts on kafka capacity planning
Thoughts on kafka capacity planning
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
Using Machine Learning to Understand Kafka Runtime Behavior (Shivanath Babu, ...
 
The Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data ProblemsThe Netflix Way to deal with Big Data Problems
The Netflix Way to deal with Big Data Problems
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
 
Flink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paasFlink forward-2017-netflix keystones-paas
Flink forward-2017-netflix keystones-paas
 
Operationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML ModelsOperationalizing Machine Learning: Serving ML Models
Operationalizing Machine Learning: Serving ML Models
 
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
Help, my Kafka is broken! (Emma Humber, IBM) Kafka Summit SF 2019
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Journeys from Kafka to Parquet
Journeys from Kafka to ParquetJourneys from Kafka to Parquet
Journeys from Kafka to Parquet
 
Spark+flume seattle
Spark+flume seattleSpark+flume seattle
Spark+flume seattle
 
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive PlatformAkka 2.4 plus new commercial features in Typesafe Reactive Platform
Akka 2.4 plus new commercial features in Typesafe Reactive Platform
 
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
How Pulsar Enables Netdata to Offer Unlimited Infrastructure Monitoring for F...
 
Introduction to Streaming Analytics
Introduction to Streaming AnalyticsIntroduction to Streaming Analytics
Introduction to Streaming Analytics
 
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015Netflix Keystone Pipeline at Samza Meetup 10-13-2015
Netflix Keystone Pipeline at Samza Meetup 10-13-2015
 
Data Architectures for Robust Decision Making
Data Architectures for Robust Decision MakingData Architectures for Robust Decision Making
Data Architectures for Robust Decision Making
 

Similar to Building & Operating High-Fidelity Data Streams - QCon Plus 2021

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Sid Anand
 
DataLinkControl.ppt
DataLinkControl.pptDataLinkControl.ppt
DataLinkControl.pptMaddalaSeshu
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianDataStax Academy
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineCalvin French-Owen
 
Full Consistency Lag and its Applications
Full Consistency Lag and its ApplicationsFull Consistency Lag and its Applications
Full Consistency Lag and its ApplicationsCassandra Austin
 
Congestion control Assignment Help
Congestion control Assignment HelpCongestion control Assignment Help
Congestion control Assignment HelpJosephErin
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Fwdays
 
Cloud Native & Service Mesh
Cloud Native & Service MeshCloud Native & Service Mesh
Cloud Native & Service MeshRoi Ezra
 
Computer network (13)
Computer network (13)Computer network (13)
Computer network (13)NYversity
 
Unit 2 data link control
Unit 2 data link controlUnit 2 data link control
Unit 2 data link controlVishal kakade
 
chapter 3.2 TCP.pptx
chapter 3.2 TCP.pptxchapter 3.2 TCP.pptx
chapter 3.2 TCP.pptxTekle12
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File MulticastYoss Cohen
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesStéphane Maldini
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Randall Hunt
 
How to move a mission critical system to 4 AWS regions in one year?
How to move a mission critical system to 4 AWS regions in one year?How to move a mission critical system to 4 AWS regions in one year?
How to move a mission critical system to 4 AWS regions in one year?Wojciech Gawroński
 
Lessons learned migrating 100+ services to Kubernetes
Lessons learned migrating 100+ services to KubernetesLessons learned migrating 100+ services to Kubernetes
Lessons learned migrating 100+ services to KubernetesJose Galarza
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldKonrad Malawski
 

Similar to Building & Operating High-Fidelity Data Streams - QCon Plus 2021 (20)

Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)Building High Fidelity Data Streams (QCon London 2023)
Building High Fidelity Data Streams (QCon London 2023)
 
DataLinkControl.ppt
DataLinkControl.pptDataLinkControl.ppt
DataLinkControl.ppt
 
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl YeksigianC* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
C* Summit 2013: Time is Money Jake Luciani and Carl Yeksigian
 
Synapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipelineSynapse 2018 Guarding against failure in a hundred step pipeline
Synapse 2018 Guarding against failure in a hundred step pipeline
 
Full Consistency Lag and its Applications
Full Consistency Lag and its ApplicationsFull Consistency Lag and its Applications
Full Consistency Lag and its Applications
 
Congestion control Assignment Help
Congestion control Assignment HelpCongestion control Assignment Help
Congestion control Assignment Help
 
Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...Anton Moldovan "Building an efficient replication system for thousands of ter...
Anton Moldovan "Building an efficient replication system for thousands of ter...
 
Cloud Native & Service Mesh
Cloud Native & Service MeshCloud Native & Service Mesh
Cloud Native & Service Mesh
 
Computer network (13)
Computer network (13)Computer network (13)
Computer network (13)
 
Computer Networking Assignment Help
Computer Networking Assignment HelpComputer Networking Assignment Help
Computer Networking Assignment Help
 
Unit 2 data link control
Unit 2 data link controlUnit 2 data link control
Unit 2 data link control
 
chapter 3.2 TCP.pptx
chapter 3.2 TCP.pptxchapter 3.2 TCP.pptx
chapter 3.2 TCP.pptx
 
FEC & File Multicast
FEC & File MulticastFEC & File Multicast
FEC & File Multicast
 
Reactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServicesReactor, Reactive streams and MicroServices
Reactor, Reactive streams and MicroServices
 
Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017Deep Dive: AWS X-Ray London Summit 2017
Deep Dive: AWS X-Ray London Summit 2017
 
How to move a mission critical system to 4 AWS regions in one year?
How to move a mission critical system to 4 AWS regions in one year?How to move a mission critical system to 4 AWS regions in one year?
How to move a mission critical system to 4 AWS regions in one year?
 
Lessons learned migrating 100+ services to Kubernetes
Lessons learned migrating 100+ services to KubernetesLessons learned migrating 100+ services to Kubernetes
Lessons learned migrating 100+ services to Kubernetes
 
Computer Networking Assignment Help
Computer Networking Assignment HelpComputer Networking Assignment Help
Computer Networking Assignment Help
 
The Need for Async @ ScalaWorld
The Need for Async @ ScalaWorldThe Need for Async @ ScalaWorld
The Need for Async @ ScalaWorld
 
Unit 2
Unit 2Unit 2
Unit 2
 

More from Sid Anand

Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionSid Anand
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Sid Anand
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowSid Anand
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Sid Anand
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Sid Anand
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Sid Anand
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon TokyoSid Anand
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Sid Anand
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Sid Anand
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari Sid Anand
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Sid Anand
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Sid Anand
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)Sid Anand
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Sid Anand
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with MavenSid Anand
 
Learning git
Learning gitLearning git
Learning gitSid Anand
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)Sid Anand
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)Sid Anand
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Sid Anand
 

More from Sid Anand (20)

Low Latency Fraud Detection & Prevention
Low Latency Fraud Detection & PreventionLow Latency Fraud Detection & Prevention
Low Latency Fraud Detection & Prevention
 
Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)Big Data, Fast Data @ PayPal (YOW 2018)
Big Data, Fast Data @ PayPal (YOW 2018)
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)Cloud Native Predictive Data Pipelines (micro talk)
Cloud Native Predictive Data Pipelines (micro talk)
 
Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)Cloud Native Data Pipelines (GoTo Chicago 2017)
Cloud Native Data Pipelines (GoTo Chicago 2017)
 
Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)Cloud Native Data Pipelines (DataEngConf SF 2017)
Cloud Native Data Pipelines (DataEngConf SF 2017)
 
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese)  - QCon TokyoCloud Native Data Pipelines (in Eng & Japanese)  - QCon Tokyo
Cloud Native Data Pipelines (in Eng & Japanese) - QCon Tokyo
 
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
Cloud Native Data Pipelines (QCon Shanghai & Tokyo 2016)
 
Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016Introduction to Apache Airflow - Data Day Seattle 2016
Introduction to Apache Airflow - Data Day Seattle 2016
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)Resilient Predictive Data Pipelines (GOTO Chicago 2016)
Resilient Predictive Data Pipelines (GOTO Chicago 2016)
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)Software Developer and Architecture @ LinkedIn (QCon SF 2014)
Software Developer and Architecture @ LinkedIn (QCon SF 2014)
 
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
LinkedIn's Segmentation & Targeting Platform (Hadoop Summit 2013)
 
Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)Building a Modern Website for Scale (QCon NY 2013)
Building a Modern Website for Scale (QCon NY 2013)
 
Hands On with Maven
Hands On with MavenHands On with Maven
Hands On with Maven
 
Learning git
Learning gitLearning git
Learning git
 
LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)LinkedIn Data Infrastructure Slides (Version 2)
LinkedIn Data Infrastructure Slides (Version 2)
 
LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)LinkedIn Data Infrastructure (QCon London 2012)
LinkedIn Data Infrastructure (QCon London 2012)
 
Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1Linked in nosql_atnetflix_2012_v1
Linked in nosql_atnetflix_2012_v1
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 

Building & Operating High-Fidelity Data Streams - QCon Plus 2021

  • 2.
  • 3.
  • 4.
  • 5.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 17. Why Are Streams Hard? In streaming architectures, any gaps in non-functional requirements can be unforgiving You end up spending a lot of your time fi ghting fi res & keeping systems up If you don’t build your systems with the -ilities as fi rst class citizens, you pay an operational tax
  • 18. Why Are Streams Hard? In streaming architectures, any gaps in non-functional requirements can be unforgiving You end up spending a lot of your time fi ghting fi res & keeping systems up If you don’t build your systems with the -ilities as fi rst class citizens, you pay an operational tax … and this translates to unhappy customers and burnt-out team members!
  • 19. Why Are Streams Hard? In streaming architectures, any gaps in non-functional requirements can be unforgiving You end up spending a lot of your time fi ghting fi res & keeping systems up If you don’t build your systems with the -ilities as fi rst class citizens, you pay an operational tax … and this translates to unhappy customers and burnt-out team members! In this talk, we will focus on building high- fi delity streams from the ground up!
  • 21. Start Simple Goal : Build a system that can deliver messages from source S to destination D S D
  • 22. Start Simple Goal : Build a system that can deliver messages from source S to destination D S D But fi rst, let’s decouple S and D by putting messaging infrastructure between them E S D Events topic
  • 23. Start Simple Make a few more implementation decisions about this system E S D
  • 24. Start Simple Make a few more implementation decisions about this system E S D Run our system on a cloud platform (e.g. AWS)
  • 25. Start Simple Make a few more implementation decisions about this system Operate at low scale E S D Run our system on a cloud platform (e.g. AWS)
  • 26. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition E S D Run our system on a cloud platform (e.g. AWS)
  • 27. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) E S D Run our system on a cloud platform (e.g. AWS)
  • 28. Start Simple Make a few more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) Run S & D on single, separate EC2 Instances E S D Run our system on a cloud platform (e.g. AWS)
  • 29. Start Simple T o make things a bit more interesting, let’s provide our stream as a service We de fi ne our system boundary using a blue box as shown below! È S D
  • 31. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D
  • 32. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss
  • 33. Reliability Goal : Build a system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss Once S has ACKd a message to a remote sender, D must deliver that message to a remote receiver
  • 34. Reliability How do we build reliability into our system? È S D
  • 36. Reliability In order to make this system reliable ` A B C m1 m1
  • 37. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable
  • 38. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link Insight : If each process/link is transactional in nature, the chain will be transactional! In order to make this system reliable
  • 39. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  • 40. Reliability ` A B C m1 m1 T reat the messaging system like a chain — it’s only as strong as its weakest link How do we make each link transactional? In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  • 41. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1
  • 42. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 A is an ingest node
  • 43. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is an internal node
  • 44. Reliability Let’s fi rst break this chain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 C is an expel node
  • 45. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does A need to do? 
 • Receive a Request (e.g. REST) • Do some processing • Reliably send data to Kafka • kProducer.send(topic, message) • kProducer. fl ush() • Producer Con fi g • acks = all • Send HTTP Response to caller
  • 46. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does C need to do? 
 • Read data (a batch) from Kafka • Do some processing • Reliably send data out • ACK / NACK Kafka • Consumer Con fi g • enable.auto.commit = false • ACK moves the read checkpoint forward • NACK forces a reread of the same data
  • 47. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C
  • 48. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer
  • 49. Reliability But, how do we handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer B needs to act like a reliable Kafka Consumer
  • 50. ` A B C m1 m1 Reliability How reliable is our system now?
  • 51. Reliability How reliable is our system now? What happens if a process crashes? ` A B C m1 m1
  • 52. Reliability How reliable is our system now? What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  • 53. Reliability How reliable is our system now? If C crashes, we will stop delivering messages to external consumers! What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  • 54. Reliability ` A B C m1 m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures
  • 55. Reliability ` A B C m1 m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures For now, we appear to have a pretty reliable data stream
  • 56. But how do we measure its reliability?
  • 57. Observability (A story about Lag & Loss Metrics) (This brings us to …)
  • 58. Lag : What is it? Lag is simply a measure of message delay in a system
  • 59. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag
  • 60. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business
  • 61. Lag : What is it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
  • 62. Lag : How do we compute it?
  • 63. Lag : How do we compute it? eventTime : the creation time of an event message Lag can be calculated for any message m1 at any node N in the system as lag(m1, N) = current_time(m1, N) - eventTime(m1) ` A B C m1 m1 T0 eventTime:
  • 64. Lag : How do we compute it? ` A B C T0 = 12p eventTime: m1
  • 65. Lag : How do we compute it? ` A B C T1 = 12:01p T0 = 12p eventTime: m1
  • 66. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T0 = 12p eventTime: m1
  • 67. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T5 = 12:10p T0 = 12p eventTime: m1 m1
  • 68. Lag : How do we compute it? ` A B C T1 = 12:01p T3 = 12:04p T5 = 12:10p T0 = 12p eventTime: m1 m1 lag(m1, A) = T1-T0 lag(m1, B) = T3-T0 lag(m1, C) = T5-T0 lag(m1, A) = 1m lag(m1, B) = 4m lag(m1, C) = 10m
  • 69. ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms)
  • 70. ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 3 ms) C = T5 - T0 (e.g 8 ms) A B C Observation: Lag is Cumulative
  • 71. Lag : How do we compute it? Lag-in @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T6 T0 eventTime: m1
  • 72. Lag : How do we compute it? Lag-in @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T0 eventTime: m1 The most important metric for Lag in any streaming system is E2E Lag: the total time a message spent in the system T6
  • 73. Lag : How do we compute it? Lag-in @ A = T2 - T0 (e.g 2 ms) B = T4 - T0 (e.g 4 ms) C = T6 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T0 eventTime: m1 E2E Lag The most important metric for Lag in any streaming system is E2E Lag: the total time a message spent in the system T6
  • 74. Lag : How do we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages
  • 75. Lag : How do we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages Instead, we prefer statistics (e.g. P95) to capture population behavior
  • 76. Lag : How do we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
  • 77. Lag : How do we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N Process_Duration(N, p95) : Time Spent at any node in the chain! Lag_out(N, p95) - Lag_in(N, p95)
  • 78. m1 m1 Process_Duration Graphs show you the contribution to overall Lag from each hop Lag : How do we compute it?
  • 79. Loss : What is it?
  • 80. Loss : What is it? Loss is simply a measure of messages lost while transiting the system
  • 81. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate!
  • 82. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality
  • 83. Loss : What is it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality Hence, our goal is to minimize loss in order to deliver high quality insights
  • 84. Loss : How do we compute it?
  • 85. Loss : How do we compute it? Concepts : Loss Loss can be computed as the set difference of messages between any 2 points in the system m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  • 86. Loss : How do we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  • 87. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 88. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime 
 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 89. Loss : How do we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p
  • 90. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p
  • 91. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now
  • 92. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data
  • 93. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data Compute 
 Loss
  • 94. Loss : How do we compute it? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p @12:35p @12:36p @12:37p @12:38p @12:39p @12:40p Now Update late arrival data Compute 
 Loss Age Out
  • 95. Loss : How do we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime Wait a few minutes for messages to transit, then compute loss (e.g. 12:35p loss table) Raise alarms if loss occurs over a con fi gured threshold (e.g. > 1%) 
 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 96. We now have a way to measure the reliability (via Loss metrics) and latency (via Lag metrics) of our system. Loss : How do we compute it? But wait…
  • 97. Performance (have we tuned our system for performance yet??)
  • 98. Performance Goal : Build a system that can deliver messages reliably from S to D with low latency S S S S S S S D S S S … SS S … T o understand streaming system performance, let’s understand the components of E2E Lag
  • 99. Performance Ingest Time S S S S S S S D S S S … S S S … Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  • 100. Performance Ingest Time S S S S S S S D S S S … S S S … • This time includes overhead of reliably sending messages to Kafka Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  • 101. Performance Ingest Time Expel Time S S S S S S S D S S S … S S S … Expel Time : Time to process and egest a message at D.
  • 102. Performance E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … S S S … E2E Lag : T otal time messages spend in the system from message ingest to expel!
  • 103. Performance Penalties (T rade off: Latency for Reliability)
  • 104. Performance Challenge 1 : Ingest Penalty In the name of reliability, S needs to call kProducer. fl ush() on every inbound API request S also needs to wait for 3 ACKS from Kafka before sending its API response E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  • 105. Performance Challenge 1 : Ingest Penalty Approach : Amortization Support Batch APIs (i.e. multiple messages per web request) to amortize the ingest penalty E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  • 106. Performance E2E Lag Ingest Time Expel Time T ransit Time Challenge 2 : Expel Penalty Observations Kafka is very fast — many orders of magnitude faster than HTTP RTT s The majority of the expel time is the HTTP RTT S S S S S S S D S S S … SS S …
  • 107. Performance E2E Lag Ingest Time Expel Time T ransit Time Challenge 2 : Expel Penalty Approach : Amortization In each D node, add batch + parallelism S S S S S S S D S S S … SS S …
  • 108. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
  • 109. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures
  • 110. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures
  • 111. Performance Challenge 3 : Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures E.g. Any 4xx HTTP response code, except for 429 (T oo Many Requests)
  • 112. Performance Challenge 3 : Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry
  • 113. Performance Challenge 3 : Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry — One Idea : Tiered Retries
  • 114. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D Global Retries
  • 115. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries
  • 116. Performance - Tiered Retries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries The Global Retrier than retries the message over a longer span of time
  • 117. ` E S S S S S S S D RO RI S S S gr Performance - 2 Tiered Retries RI : Retry_In 
 RO : Retry_Out
  • 118. At this point, we have a system that works well at low scale Performance
  • 119. Scalability (But how does this system scale with increasing traf fi c?)
  • 121. Scalability First, Let’s dispel a myth! Each system is traf fi c-rated The traf fi c rating comes from running load tests There is no such thing as a system that can handle in fi nite scale
  • 122. Scalability First, Let’s dispel a myth! Each system is traf fi c-rated The traf fi c rating comes from running load tests There is no such thing as a system that can handle in fi nite scale We only achieve higher scale by iteratively running load tests & removing bottlenecks
  • 123. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  • 124. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  • 125. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale?
  • 126. Scalability - Autoscaling Autoscaling Goals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale?
  • 127. Scalability - Autoscaling EC2 The most important part of autoscaling is picking the right metric to trigger autoscaling actions
  • 128. Scalability - Autoscaling EC2 Pick a metric that Preserves low latency Goes up as traf fi c increases Goes down as the microservice scales out
  • 129. Scalability - Autoscaling EC2 Pick a metric that Preserves low latency Goes up as traf fi c increases Goes down as the microservice scales out E.g. Average CPU
  • 130. Scalability - Autoscaling EC2 Pick a metric that Preserves low latency Goes up as traf fi c increases Goes down as the microservice scales out E.g. Average CPU What to be wary of Any locks/code synchronization & IO Waits Otherwise … As traf fi c increases, CPU will plateau, auto- scale-out will stop, and latency (i.e. E2E Lag) will increase
  • 131. Conclusion • We now have a system with the Non-functional Requirements (NFRs) that we desire! • While we’ve covered many key elements, a few areas will be covered in future talks (e.g. Isolation, Containerization, Caching). • These will be covered in upcoming blogs! Follow for updates on (@r39132)