Hi. I am Jason Stangroome. I work for section.io and I’d like to talk to you about how we handle operational visibility.
I am definitely not here to claim that we way we do things is the best way. Some days I wonder if its even one of the good ways. But hopefully through sharing, we can all improve.
Now, to provide some context for this presentation, I’d like to give you some background on what section.io does:
Section.io is a new class of Content Delivery Network.
Prior to section.io, the CDN market focused on solutions that operate in-front of, or beside, your production website. Furthermore, the market dominators were very closed and slow to respond to both industry change and customer requests.
In contrast, at section.io, we have three driving tenets: be open be easy and give users control
These tenets have manifested in our CDN today in several ways:
Firstly, you can run our CDN on in local development and other environments prior to production.
This is important because even a basic CDN is having an impact on which browser requests actually reach your origin web server and how those requests and responses are modified in transit. An advanced CDN configuration is going to be manipulating the traffic significantly if is delivering the best value possible for your particular site. Being able to see how this affects your sites behaviour before production is critical to reproducing and understanding reported issues and even catching new issues before a feature deployment becomes a production incident.
See Menulog’s customer data leakage earlier this month for an example of what can happen when your CDN configuration is insufficiently tested.
The configuration for every website on section.io is tracked in a per-site git version control repository providing natural change tracking. We provide a friendly web GUI over most of it but can also just clone the git repository locally and use your preferred tools.
Deployments happen simply by pushing to the remote git branch that corresponds to your environment and moments later the new configuration is active at all the delivery nodes.
Similarly, a request to flush your entire site cache, or just specific URLs, or even some combination in between happens in seconds.
We provide a strong Qualys Grade A HTTPS configuration by default at no extra charge if you bring your own certificate. Our team are also currently working on Let’s Encrypt integration so you’ll even get a free HTTPS certificate as soon as you point your DNS records at our edge nodes.
Once HTTPS is activated, you’ll find HTTP/2 is also enabled out-of-the-box, even if your origin doesn’t support it.
section.io consists of quite a few popular open-source systems that we have integrated together and we believe this gives at least two important benefits to our users:
If you have existing skills with these products, these skills are immediately useful when working with the section.io platform. If you don’t have these skills, there is already a wealth of existing content, and an establishing community for these products on top of what our section.io documentation and support team already provide.
If you later find section.io is not the right fit for your website, you haven’t coupled to some section.io-specific implementation. All our architectural decisions err away from building a section.io custom build that could result in vendor lock-in.
Everything in our web management portal is built API-first. If you can perform an operation through our web UI, you can also do it via our REST API. This makes it easy to automate tasks and include section.io deployments into your internal deployment pipelines.
We focus on providing access to logs as close as possible to the time the event occurred. You shouldn’t have to raise a support ticket to request access to detailed data of the traffic for your website.
Lastly, we don’t manage our customer’s origin web servers – they have their own operations staff. But to keep the origin web servers running smoothly, and to understand anomalies, those staff need access to much of the same data that we need at section.io to ensure the CDN platform itself is operating as expected.
As such, we’ve built our system so that the data and techniques that we use to run section.io are also available to our users. The primary difference is permissions – we don’t give our users visibility into the data of websites they don’t own.
section.io in its current form, began in the second half of 2014, just after Docker reached version 1.0. Prior to that we only operated a fully-managed CDN service and wanted to find a way to put the control back in the hands of our users.
Docker’s approach to containerisation proved to be the catalyst and after watching it slowly mature, we seized the opportunity to re-architect.
Containers enable us to more easily build a multi-tenant system giving each user their own isolated environment for handling their website’s traffic and the CDN configuration and operational data.
As just a few examples: we’re using Varnish Cache as our CDN’s caching solution. If you don’t know Varnish, it is the caching solution used by Wikipedia, The New York Times, Pinterest, some competing CDNs, and *many* others. On section.io each site gets its own Varnish instance in its own Docker container with dedicated configuration.
Similarly we provide ModSecurity as our Web Application Firewall offering and we have a content-rewriting proxy in development right now.
We use Kibana for querying logs, Graphite for metrics, and Umpire for alerting, and these are all containerised per website too.
There’s very little left in our platform that isn’t in a container and that list diminishes with each iteration.
The first step to monitoring is gathering the data and containers brought some challenges.
Most of the data is web access logs and syslog. We also run various processes and jobs to capture additional data to a useful log format.
We are then running multiple docker containers per customer website. There’s a whole debate raging on about just how much a single container should do. On one end of the spectrum you have the one-process-per-container crowd, and on the other end you have an init daemon, various system services, a handful other processes and the kitchen sink. There are merits to each perspective but us for the sheer process count is a driver toward the minimalist end.
We have over 300 containers actively running on some nodes and if each container is running its own log shipping process, that’s another 300 processes fighting for a slice of CPU time and another 300 connections to our log ingestion system, *per node*. Instead we leverage Docker Volumes to map the log directories of each container out to where a single per-node log shipping process can harvest them all.
Today we’re using Elastic Filebeat for shipping log files. We like that it is using TLS, it batches logs together and gets some good compression from repeated values, and it requires acknowledgement from the receiver before proceeding. Filebeat is a fairly new product though and we’ve been hitting a number of edge cases, luckily the Elastic team has been responsive to our bug reports (once we started including repro scripts).
We’re also interested to adopt Elastic’s Topbeat and Packetbeat solutions where we have previously used collectd.
Log rotation is also a little more involved in this world. Again we don’t want 300 cron daemons running to handle each container but at the same time we do need to signal all the container processes that have open file descriptors to close and re-open the log files after rotation. For now we’ve integrated `docker exec` calls into shared logrotateD configurations.
Our container hosts are short-lived – days, sometimes weeks in the quieter partitions. This happens for two reasons:
Platform deployments are implemented by provisioning new hosts, bringing them into service, and retiring the old hosts. We scale horizontally in response to increased load but whenever we scale-in as the load declines we retire the oldest hosts – just one more nail in the coffin for configuration drift.
Just how much data are we dealing with at the moment?
Focused only on our self-managed customers and a portion of a fully-managed customers, and only on the web traffic logs, we’re handling about 600 million new logs a week.
Many of our fully-managed customers are still logging through our previous generation system and are being migrated incrementally.
Our non-web logs are not included in this number.
This is only expected to grow with our user base.
Our users can extract their logs via the ES API for their own archives before the 7 days passes. We are investigating other options for shipping logs directly to our user’s own systems.
From the moment a web request is handled in a delivery node and written to the local log file it is typically as little as 5 seconds until that log is searchable in the Kibana UI for ES. Under peak load the latency can reach 2 minutes for some log sources and this is our acceptable upper bound.
We found that we needed to split the Logstash pipeline into separate processes for resiliency, especially due to the design of the Elasticsearch output plugins blocking the pipeline and signalling back pressure all the way to the delivery nodes.
Redis helps to decouple the performance and availability of the components in the pipeline.
We autoscale our Logstash machines based on log flow rate. Essentially it’s a combination of CPU usage and queue depth.
Kibana containers use nginx auth subrequests to ensure containers are running so that we can limit the number of Kibana containers running concurrently to the same number of users actively using the Kibana UI – a much smaller number that the total number of users on the platform. We run a single Elasticsearch cluster but we then use an nginx proxy with LUA parse the requests and whitelist which indexes are permitted per user. On-demand messages allow varnishlog to be run in cache containers to grab a snapshot of recent requests in much greater detail that we can currently ship. This is very useful for diagnosing issues.
Everything in metrics could be queried from Elasticsearch but metrics make many queries more efficient and we can get better retention Umpire allows smart alerting by leveraging existing synthetics platforms instead of building our own
Boring stuff like CPU, etc
All the same traffic data as mentioned on the last slide plus…
Front-line with a buddy. Including the CEO. Both customer support and platform support. Gives a great range of perspectives.
Alerts are all actionable and documented. If it can’t be actioned, the alert is removed. The documentation lists impacted systems and user experiences, possible causes and sometimes how the failure may cascade.
When incidents occur the immediate focus is on rectifying the issue, ideally without destroying any diagnostic data. Then post-mortems are performed to document the series of conditions that allowed the situation to exist – careful not target individual actions. The post-mortems then become an basis for identifying room for improvement in the platform and workflows that would circumvent similar incidents in the future, and those ideas go into our product backlog for consideration in the next iteration of development.
Like Umpire but for Elasticsearch data
But we don’t want string_1, integer_7, custom_field_9
Sometimes you just want to tail a log and get all Matrix-y
It is common for a site to have high traffic during the business day and drop to low overnight, typically 3am Sydney time for an Australian website. Weekends have similar shapes. But the absolute numbers of these peaks and troughs change over time, by season, and as business grows so its difficult to establish a baseline from which to trigger alerts. We’d like to investigate options to be notified, quickly, when traffic is not following the trend.
Thank you all for listening.
I hope you found at least some of what I’ve shared today to be useful and I’d love to hear back from you about anything I’ve mentioned tonight.
Monitoring at section.io - Operational Intelligence Meetup May 2016
Monitoring at section.io
Operational visibility for both the platform and our users
• Runs on your local machine and pre-production
• Configuration and deployment via git
• Fast global cache management
• HTTPS and HTTP/2 by default
A modern CDN
• Integrates with popular open-source
• API driven
• Near real-time log access
• Consistent operational interface