My name is Laurynas, I will speak about practical application logging and monitoring. If there is something you want to ask me, but won’t be able to do that today or you think of some questions later, you can drop me an email at this address. You can also visit my github profile, where you can find the example application using the tools we are about to discuss.
So, a little bit about me first. I am 24 years old, I work and live in Kaunas. I am a functional programming and DevOps enthusiast. I currently work at iGeolise where we are building TravelTime platform, an API that makes maps searchable by time. I have about 5 years of professional experience. Also, I am very active in sports, I do kickboxing and jiu jitsu, usually 5-6 times a week and I suggest anyone who has zero sports activity to take something up. It is very important in our line of work, sitting in front of a monitor for at least 8 hours a day.
I am currently working at iGeolise where we are building TravelTime platform, an API that makes maps searchable by time. For example, imagine that you are looking for a new place to live. The property listing you are looking at has an integration with out api. Now, instead of searching for places that are within 10 miles from your work, you could search for places that are reachable within 15 minutes by public transport, or driving. So you can get the results that are relevant to you, so that's the kind of service that we provide. The picture shows what places you can reach by tube in central London, versus a simple radius search.
We are a relatively small team, we only have 7 developers and 2 of them, myself included, are responsible for operations. I personally like to not be 100% devops all the time because I cannot imagine myself doing just one, devops or programming. I enjoy both of these things and our structure allows it to happen. Also, there are two of us and that’s plenty for the 20-something servers that we have to manage.
Our platform consists of several services written in Scala using Akka and Play framework. We also have some single page web applications, those are written in Scala.js and using the React framework from Facebook.
We use ansible heavily, all of our servers are managed with it and we also do deployments with ansible.
Let’s start off with an example application. This is of course very simple, you have your load balancers, application servers and database. As you know load balancers can be configured to use sticky sessions, meaning that when you visit a website for the first time, the load balancer picks a server for you and all your following requests for a certain period of time will be routed to that same server. This can help solve issue regarding sessions, session information is stored on each server separately so if you hit another server each time the server will create a new session for you and you will have to relogin. However, sticky sessions have some downsides as well, for example in case of rolling deploy you will be redirected to another server and will lose session information as well.
To combat this you can store sessions in some database which of course is slower than just having it in memory, or have some kind of session replication. Play framework also offers the option to store session information on the client side, but encrypted of course, so zero setup. For single page applications that only do background ajax calls to the servers, a common practice is to not have sessions at all and authenticate during each request. That is of course slower, but not significantly as the database can cache such requests.
So, for the sake of our example, lets assume that the loadbalancers do not have sticky sessions and that each request hits a separate server.
You have setup logging and you capture when someone searches for an item, views an item, adds or removes an item from the cart, enters the checkout and completes the purchase. Each server logs into a separate file.
So imagine that an exception is thrown and you are emailed the stacktrace. It did include the client ID and you want to find out what exactly happened so you are trying to reproduce the bug by doing the exact same steps. The issue is that each action was logged into a separate server so now you have to scavenge for logs in all of the servers. This may not be an issue if you have just a couple of servers, but it gets really painful as you scale. My talk will focus on how to solve the problem and also how to capture meaningful metrics and visualize the captured data.
So let's talk about the tools that we are going to use. We will store our logs in ElasticSearch. It is a full-text search and analytics engine, it is often used with applications that have complex search features and requirements, most common use case probably being a web shop so that it can provide search autocompletion and enables complex item filtering. It works great for logs files and metrics too.
In log storing case elastic search is usually accomandied by logstash, a data collection engine that can read the log files, extract some values from it and then store it in elastic search so it can be analyzed later. For example at Geolise at the end of request we capture and log the following information: what client made the request, how long did the request take, what transportation mode was used and what api method was called. So instead just storing plain log lines in elastic search it extracts the patterns. You should store the plain log files as well, so you easily find all actions done by specific client, like I described with the web shop example previously. This requires logstash to run on the same node as where the log files are stored meaning we would have to run it on each node, it runs on jvm and has some overhead it's not what you usually want to do.
And the last piece of the puzzle is Kibana. It is an analytics and visualization platform designed to work with elastic search. You can easily perform advanced data analysis and visualize your data in a variety of charts, tables, and even maps. It makes it easy to understand large volumes of data. It features a browser-based interface that allows you to create dynamic dashboards and interact with the data.
Usually you have a small service running on each node that ships the logs to a redis database and then logstash can read from redis. So you have a single elastic search, logstash and redis installation and on each of the nodes you run beaver, which is a service that ships those logs.
However, elastic recently released a tool called filebeat that can ship logs straight to logstash, no need for a redis server. Filebeat would also have to be run on each server.
So we started off with this simple setup, however, there were some problems. I wasn't a part of the company yet but the guys said that Logstash was just randomly crashing once in a while and were unable to diagnose the problem. Now we have a small in-hosue built application that does a small portion of what the logstash is capable of. Logstash has many plugins and pipelining capabilities, but our needs are very simple, just to extract some fields and store the data into appropriate indices so the application that we've build reads from redis and indexes to elastic. Going forward, if we ever need more features, we may go back to logstash, maybe it's ok now.
So now we have the metrics and dashboards in place, but we don't have any kind of monitoring yet. Monitoring of system ram/cpu disk activity and reporting when values get high is pretty easy, but we want to measure our business metrics. We have developed another small application that queries entries indexed in elastic search for the last 10 minutes. it monitors average response time, requests per minute, status codes returned and the amount of requests per minute per client, so if the average response time gets very high for some reason we are alerted immediately by a hipchat notification. We are also alerted if there have been no requests at all which means something may be down or if a single client starts making a lot of requests. It also alerts if it hasn't received any logs from a server for a certain period of time, that way know that log shipping tool beaver works. We
had a few occasions were it stopped working because we used it together with logrotate so we were happy we had this monitoring in place.
I’ve created a simple endpoint to demonstrate the functionality. It’s similar to what our travel time platform accepts, of course usually you would receive an api key instead of client name and you would search coordinates and much more other stuff.