In recent months at ENA we have been using Graylog, a log aggregation system.
We use it to keep track of our interactive and programmatic submission systems.
In this talk I want to share what we have learnt so far and to encourage you to try it with your own projects.
So why did we start using Graylog?
Our submission systems have been around for a long time, and we get errors.
More specifically our submitters get errors and report them to us through our helpdesk.
Two types of submitters tend to go to the trouble of reporting:
The most helpful submitters
The most frustrated submitters
There is also a group of submitters who experience errors but don’t report them.
We don’t know how many there are but these are, but these are the ones more likely to give up and and submit through NCBI instead.
Helpdesk creates JIRAs for each error reported and we work through them.
In recent years we have moved away from prioritising based on who shouts the loudest and now attempt to prioritise by potential impact.
However, we are still reactive and tend to deal with a single error in isolation.
When we receive an error, the detective work starts.
An error report can be detailed but it also can be as simple as “I got error when submitting”.
I first spend time finding out submitter details. I then delve into the logs on each server looking for stack traces and other logging messages.
When I have pieced this together what has happened it usually does not take too long to get a fix deployed.
We inform the submitter and I move onto the next JIRA in the queue.
Sometimes the submitter will come back and let us know that the problem has been resolved, sometimes they don’t.
There are limitations with this approach:
It relies on submitters reporting errors
It focuses on individual errors
It is hard to verify that an error has been fixed once and for all
The most import point is that submitters experience errors
It is very easy to get into thinking that because we are fixing errors promptly and submitters are grateful that we are doing a good job.
I don’t believe we are.
Our submission systems should be like air conditioning. You probably have not thought about the air conditioning in this room today as it is working fine. It is boring. It has faded into the background.
You have forgotten that a complicated system is there. Our submission systems should be like. They should work so smoothly that submitters think they are trivial.
This is why we started looking at Graylog. We wanted to move away from reacting to individual error reports.
Instead we wanted to move towards detecting trends, resolving classes of error and preventing errors before submitters notice.
Graylog provides a central destination to receive and store all logging messages from our applications. It then provides:
search
alerting
analysis
With search, when a submitter reports an error we don’t need to go digging around in server logs any more.
We search Graylog.
When we have found the error we can then see all occurrences. We can also see all logging information around the errors giving us the context we need to understand them.
Setting up monitors to track the error and get real-time counts and graphs of when the error is occuring means that when we deploy a fix we can see the graph drop to zero.
We are still reactive to error reports but we are much more effective in resolving them.
We can also define a baseline indicators and display them on a dashboard.
If we normally get 10 submissions per hour and it drops to 0 we know something is likely to be wrong.
If failed logins rise from 10 per hour to 1000 per hour likewise we know something is amiss.
We can set up alerts to flag this unusual activity to slack or email.
We can deal with it before submitters start contacting the helpdesk.
We become proactive.
For me, where Graylog comes into its own though is with analysis. We can model the journey of submitters through the system.
By creating and logging messages that act as checkpoints we can see where submitters are getting stuck and this can be very enlightening.
We can spot trends and use that to target our efforts on error prevention.
For examples, one option in our submission processes is for the submitter to download a template spreadsheet for entering sample data.
They then return and upload the completed spreadsheet.
With logging in place we found that over half the attempts to upload a spreadsheet were failing.
Submitters saw a message telling them the spreadsheet was not in the right format.
They were not reporting it as an error.
However, it was certainly a source of problems as we could see from the logs that in many cases it prevented them continuing with their submission.
When we looked at the screen in question we saw there was simply a button named ’Upload Spreadsheet’.
There was no indication of what type of spreadsheet we were expecting.
In Graylog we were able to group failures into three categories:
Unexpected spreadsheet format - submitters were uploading Excel instead of CSV
Unexpected type of file - submitters were uploading non-spreadsheet files. This issue first came to our attention when a submitter tried to upload a 150MB fasta file. The server ran out of memory trying to parse it.
Unexpected content - the spreadsheets were the submitters’ own spreadsheets and not based on our templates
This feature was causing a very high failure rate but the fix was simple:
Restrict the file extensions that can be uploaded
Provide clear explanation text of what the upload spreadsheet function was for.
When we applied this change we were able to use Graylog to monitor the failure rate.
We could prove that this change eliminated file related failures.
Content failures also were reduced to a fraction of what they were before avoiding and lot of submitter confusion.
These are some reasons we like Graylog.
In this talk I wanted to concentrate more on why Graylog was useful to us rather than how it works.
There is much better documentation and video online than I can create but I will give a very brief overview.
Graylog is a standalone server that uses Elasticsearch and MongoDB. Log messages are collected in a JSON format called GELF and stored in MongoDB. Elasticsearch provides the search engine.
Our instance is installed on a standard technical services provided VM. It took me a morning to setup manually.
If you can use Docker or Amazon Web Services it appears much more straight-forward as you can just download pre-configured images.
Graylog itself then provides a web interface that allows setting up of inputs, streams, alerts and dashboards.
There are several different options for getting data into Graylog. We are just using a single input that takes GELF messages on a UDP port.
We use a library for Logback that provides an appender to convert logging messages into GELF formatted messages and delivers them to a specified host and port.
We then use Slack for our alerts.
There is a library of official and third party plugins, from big screen visualization to automatic creation of JIRAs but we are using Graylog out of the box at the moment.
It is very early days for us with Graylog. We are still learning, still exploring what is possible. We realise we are currently using a small fraction of its potential.
For example it can cope with 100k messages/second and we are using 100/second at most.
We have not even started at looking at plugins so we have only scratched the surface of what it can do.
However, it already shows real promise as it enables a different way of working that could potentially make a big different to submitter’s experience of our services.
We hope they will first notice a more responsive and efficient service when they do encounter errors and in time find that they are not experiencing errors at all.
I encourage anyone who would like to try Graylog in their team to do so and share your experience. We have limited time to spend on Graylog in our group so help will be invaluable.
If enough people are using it and finding it useful it may be something we can ask to be managed centrally.