The burden of a successful feature: Scaling our real time logging platform

Ines Sombra
Director of Engineering
The burden of a successful feature:  
Scaling our real time logging platform
presents

Today’s Agenda
A delightful
demo &
context
A deep dive
into logging
Challenges
& future

https://vimeo.com/267641392
Fresh from Altitude NYC
—Peter Bourgon, Altitude NYC
Observability is an umbrella term. There are diﬀerent
techniques to achieve observability in a system.

Peter’s classification of Observability
TECHNIQUES SYSTEMS
* Lovingly stolen from Peter Bourgon

SYSTEMS
* Lovingly stolen from Peter Bourgon
TODAY 👉
Peter’s classification of Observability
TECHNIQUES

But Why?
This pipeline is one of the oldest systems at Fastly
Born out of our dissatisfaction w the status quo
We wanted something that would send you logs
extremely fast (stream them near realtime) to
anywhere you want (many endpoints)

Logging @ Fastly
Caches Aggregators Endpoints
s3
syslog
gcs
sumologic
bigquery
ftp
papertrail
…

s3
syslog
gcs
sumologic
bigquery
ftp
papertrail
…
Logging @ Fastly
Caches Aggregators Endpoints

Logging pipeline is Stateless
We don’t batch your logs
We don’t store your logs
We stream your logs in
near real-time to your
deﬁned endpoints
We really don’t want your
logs on disk

Logging @ Fastly
Caches + Senders Aggregators
Varnish
Varnish
Varnish
Varnish

Varnish
Varnish
Varnish
Varnish
Logging @ Fastly

Varnish
Varnish
Varnish
Logging @ Fastly
Varnish

Logging pipeline is Best Effort
We try our best to send logs to
your defined endpoint
Your endpoint must be up &
healthy in order for us to be
able to send data to it
We have minimal buffering
Pipeline optimized for log
streaming speed

Logging Endpoints
We don’t limit the number
of endpoints or log lines
per request
~8.6K active endpoints
Ecosystem of endpoints in
diﬀerent stages of
evolution
Aggregators
Endpoints
s3
syslog
gcs
sumologic
bigquery
ftp
papertrail
…

Logging Streams data
File-based endpoints (time ranged)
Streaming endpoints (protocol or http-requests)
s3 gcs ftp sftp
syslog
sumologic
bigquery logentries papertrail
splunk scalyr honeycomb

Logging Growth (2014-2015)
😳
~430K LPS ~1.2K endpoints ~ 2GBps

Logging Growth (2017-2018)
~3M LPS ~8.6K endpoints ~4GBps

Logging Growth (8X!!)
~3M LPS ~8.6K endpoints ~4GBps

We send a lot of data continuously to
our supported endpoints
Syslog continues to be our most
popular endpoint but S3 & GCS have
the highest volume
The 70's are still alive with a very
respectable 13 MBps to ftp and 74
kBps to sftp*
* for the non-millennials
Logging Endpoints

Challenges &  
Lessons learned

Volume Challenges
No hard limits to what you
can log, this can be
challenging
System is multi-tenant. Noisy
neighbors can aﬀect delivery
Consider sampling for high
volume logging

Burden of many
endpoints
Classic integrations
challenges (each endpoint is
a downstream dependency)
Standard endpoint clients
often don’t meet our needs
Having our own clients
aﬀords us extra optimizations

Endpoints & Health
Some endpoints have known
limitations (infamous
examples: S3, BigQuery, GCS)
Diﬃcult to infer if an
endpoint is working or not
(Hard to test setup too)
Structured logging (JSON via
VCL) is challenging

Service Isolation
Prioritize delivery of content over
log retention
An aggregator discards the oldest
logs it has when it can’t deliver
them fast enough
In a cache node we are our own
customers so senders do the
same when they can’t reach
aggregators fast enough

Expectation Mismatch
Burden of a system that works so well is that it
makes you believe you have strong guarantees
Design constraints determine the SLA of the
pipeline
General advice: Understand the design choices of
the systems you use because they limit what is
possible to guarantee *

The team have been Busy bees
H2
H1
Platform performance
& addressing the
challenges of
individual endpoints
We are getting fancy!

Platform Performance
Reducing lock contention & CPU usage
Smarter memory allocation &
management
Overhauling all endpoints
Halving the time it takes for a log line to
be processed (from sender read to
aggregator line preparation)

Getting fancy
BigQuery improvements
New endpoints: Kafka
More integrations with
cloud services
Make endpoints easier to
debug

Want more endpoints?
Want metrics?
Want easier structured logging?
Want VCL counters + secondly
aggregation + a higher SLA?
Want More?

Want more endpoints?
Want metrics?
Want easier structured logging?
Want VCL counters + secondly
aggregation + a higher SLA?
Dom Fee
Want More?

tl;dr LOGGING
Fastly lets you extend the
visibility of your system to the
edge & gain meaningful insights
in near real-time
Is a pipeline with very speciﬁc
constraints & guarantees
Exciting things are coming!

(l,d)ogs of Fastly
https://github.com/Randommood/Altitude2018

The burden of a successful feature: Scaling our real time logging platform

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to The burden of a successful feature: Scaling our real time logging platform

Similar to The burden of a successful feature: Scaling our real time logging platform (20)

More from Fastly

More from Fastly (20)

Recently uploaded

Recently uploaded (20)

The burden of a successful feature: Scaling our real time logging platform