The ELK Stack @ Inbot
Jilles van Gurp - Inbot Inc.
Who is Jilles?
www.jillesvangurp.com, and @jillesvangurpon everything I've signed up for
Java (J)Ruby Python Javascript/node.js
Servers reluctant Devops guy Software Architecture
Universities of Utrecht (NL), Blekinge (SE), and Groningen (NL)
GX (NL),Nokia Research (FI), Nokia/Here (DE),Localstream (DE),
Inbot(DE).
Inbot app - available for Android & IOS
ELK Stack?
Elasticsearch
Logstash
Kibana
Recent trends
Clustered/scalable time series DBs
Other people than sysadmins looking at graphs
Databases do some funky stuff these days: aggregations, search
Serverless, Docker, Amazon Lambda, Microservices etc. - where do the logs go?
More moving parts = more logs than ever
Logging
Kind of a boring topic ...
Stuff runs on servers, cloud, whatever
Produces errors, warnings, debug, telemetry, analytics, kpis, ux events, ...
Where does all this go and how do you make sense of it?
WHAT IS HAPPENING??!?!
Old school: Cat, grep, awk, cut, ….
Good luck with that on 200GB of unstructured logs from a gazillion microservices
on 40 virtual machines, docker images, etc.
That doesn't really work anymore ...
If you are doing this: you are doing it wrong!
Hadoop ecosystem?
Works great for structured data, if you know what you are looking for.
Requires a lot of infrastructure and hassle.
Not really real-time, tedious to explore data
Some hipster with a Ph.D. will fix it or ...
I’m not a data scientist, are you?
Monitoring/graphing ecosystem
Mostly geared at measuring stuff like cpu load, IO, memory, etc.
Intended for system administrators
What about the higher level stuff?
You probably should do monitoring but it’s not really what we need either ...
So, ELK ….
Logging
Most languages/servers ship with awful logging defaults, you can fix this
Log enough but not too much or too little.
Log at the right log level ⇒ Turn off DEBUG log. Use ERROR sparingly.
Log metadata so you can pick your logs apart ⇒ Metadata == json fields.
Log opportunistically, it's cheap
Too much logging
Your Elasticsearch cluster dies/you pay a fortune to keep data around that you
don’t need.
Not enough logging
Something happened, you don’t know what because there’s nothing in the logs;
you can't find back relevant events because metadata is missing.
You are going to waste what you saved in cost on finding out WTF is going on,
probably more.
Log entries in ELK
{
"message": "[3017772.750979] device-mapper: thin: 252:0: unable to service pool target messages
in READ_ONLY or FAIL mode",
"@timestamp": "2016-08-16T09:50:01.000Z",
"type": "syslog",
"host": "10.1.6.7",
"priority": 3,
"timestamp": "Aug 16 09:50:01",
"logsource": "ip-10-1-6-7",
"program": "kernel",
"severity": 3,
"facility": 0,
"facility_label": "kernel",
"severity_label": "Error"
}
Plumbing your logs
Simple problem: given some logs, convert it into json and shove it into
Elasticsearch.
Lots of components to help you do that: Logstash, Docker Gelf driver, Beats, etc.
If you can, log json natively: e.g. Logback logstash driver, http://jsonlines.org/
Ca. 40 Amazon EC2 instances, most of which have docker containers
VPC with several subnets and dmz.
Testing, production, and dev environments + dev infrastructure.
AWS comes with monitoring & alerts for basic stuff.
Everything logs to http://logs-internal.inbot.io/
Elasticsearch 2.2.0, logstash 2.2.1, kibana 4.4.1
1 week data retention, 14M events/day
Inbot technical setup
Demo time
Things to watch out for
Avoid split brains and other nasty ES failure modes -> RTFM & configure ...
Data retention policies are not optional
Use curator https://github.com/elastic/curator
Customise your mappings, changing them sucks on a live logstash cluster.
Dynamic mappings on fields that sometimes look like a number will break shit.
Running out of CPU credits in Amazon can kill your ES cluster
ES Rolling restarts take time when you have 6 months of logs
Mapped Diagnostic Context (MDC)
Common in java logging fws - log4j, slf4j, logback, etc.
Great for adding context to your logs
E.g. user_id, request url, host name, environment, headers, user agent, etc.
Makes it easy to slice and dice your logs
{
MDC.put("user_id","123");
LOG.info("some message");
MDC.remove("user_id");
}
MDC for node.js: our log4js fork
https://github.com/joona/log4js-node
Allows for MDC style attributes
Sorry: works for us but not in shape for pull request; maybe later.
But: this was an easy hack.
MdcContext
https://github.com/Inbot/inbot-utils/blob/master/src/main/java/io/inbot/utils/MdcCont
ext.java
try(MdcContext ctx=MdcContext.create()){
ctx.put("user_id","123");
LOG.info("some message");
}
Application Metrics
http://metrics.dropwizard.io/
Add counters, timers, gauges, etc. to your business logic.
metrics.register("httpclient_leased", new Gauge<Integer>() {
@Override
public Integer getValue() {
return connectionManager.getTotalStats().getLeased();
}
});
Reporter uses MDC to log once per minute: giant json blob but it works.
Docker Gelf driver
Configure your docker hosts to log the output of any docker containers using the
log driver.
command, container id, etc. become fields in log entry
nice as a fallback when you don't control the logging
/usr/bin/docker daemon --log-driver=gelf --log-opt gelf-address=udp://logs-internal.inbot.io:12201
Thanks
@jillesvangurp, @inbotapp

Elk stack @inbot

  • 1.
    The ELK Stack@ Inbot Jilles van Gurp - Inbot Inc.
  • 2.
    Who is Jilles? www.jillesvangurp.com,and @jillesvangurpon everything I've signed up for Java (J)Ruby Python Javascript/node.js Servers reluctant Devops guy Software Architecture Universities of Utrecht (NL), Blekinge (SE), and Groningen (NL) GX (NL),Nokia Research (FI), Nokia/Here (DE),Localstream (DE), Inbot(DE).
  • 3.
    Inbot app -available for Android & IOS
  • 4.
  • 5.
    Recent trends Clustered/scalable timeseries DBs Other people than sysadmins looking at graphs Databases do some funky stuff these days: aggregations, search Serverless, Docker, Amazon Lambda, Microservices etc. - where do the logs go? More moving parts = more logs than ever
  • 6.
    Logging Kind of aboring topic ... Stuff runs on servers, cloud, whatever Produces errors, warnings, debug, telemetry, analytics, kpis, ux events, ... Where does all this go and how do you make sense of it? WHAT IS HAPPENING??!?!
  • 7.
    Old school: Cat,grep, awk, cut, …. Good luck with that on 200GB of unstructured logs from a gazillion microservices on 40 virtual machines, docker images, etc. That doesn't really work anymore ... If you are doing this: you are doing it wrong!
  • 8.
    Hadoop ecosystem? Works greatfor structured data, if you know what you are looking for. Requires a lot of infrastructure and hassle. Not really real-time, tedious to explore data Some hipster with a Ph.D. will fix it or ... I’m not a data scientist, are you?
  • 9.
    Monitoring/graphing ecosystem Mostly gearedat measuring stuff like cpu load, IO, memory, etc. Intended for system administrators What about the higher level stuff? You probably should do monitoring but it’s not really what we need either ...
  • 10.
  • 11.
    Logging Most languages/servers shipwith awful logging defaults, you can fix this Log enough but not too much or too little. Log at the right log level ⇒ Turn off DEBUG log. Use ERROR sparingly. Log metadata so you can pick your logs apart ⇒ Metadata == json fields. Log opportunistically, it's cheap
  • 12.
    Too much logging YourElasticsearch cluster dies/you pay a fortune to keep data around that you don’t need. Not enough logging Something happened, you don’t know what because there’s nothing in the logs; you can't find back relevant events because metadata is missing. You are going to waste what you saved in cost on finding out WTF is going on, probably more.
  • 13.
    Log entries inELK { "message": "[3017772.750979] device-mapper: thin: 252:0: unable to service pool target messages in READ_ONLY or FAIL mode", "@timestamp": "2016-08-16T09:50:01.000Z", "type": "syslog", "host": "10.1.6.7", "priority": 3, "timestamp": "Aug 16 09:50:01", "logsource": "ip-10-1-6-7", "program": "kernel", "severity": 3, "facility": 0, "facility_label": "kernel", "severity_label": "Error" }
  • 14.
    Plumbing your logs Simpleproblem: given some logs, convert it into json and shove it into Elasticsearch. Lots of components to help you do that: Logstash, Docker Gelf driver, Beats, etc. If you can, log json natively: e.g. Logback logstash driver, http://jsonlines.org/
  • 15.
    Ca. 40 AmazonEC2 instances, most of which have docker containers VPC with several subnets and dmz. Testing, production, and dev environments + dev infrastructure. AWS comes with monitoring & alerts for basic stuff. Everything logs to http://logs-internal.inbot.io/ Elasticsearch 2.2.0, logstash 2.2.1, kibana 4.4.1 1 week data retention, 14M events/day Inbot technical setup
  • 16.
  • 17.
    Things to watchout for Avoid split brains and other nasty ES failure modes -> RTFM & configure ... Data retention policies are not optional Use curator https://github.com/elastic/curator Customise your mappings, changing them sucks on a live logstash cluster. Dynamic mappings on fields that sometimes look like a number will break shit. Running out of CPU credits in Amazon can kill your ES cluster ES Rolling restarts take time when you have 6 months of logs
  • 18.
    Mapped Diagnostic Context(MDC) Common in java logging fws - log4j, slf4j, logback, etc. Great for adding context to your logs E.g. user_id, request url, host name, environment, headers, user agent, etc. Makes it easy to slice and dice your logs { MDC.put("user_id","123"); LOG.info("some message"); MDC.remove("user_id"); }
  • 19.
    MDC for node.js:our log4js fork https://github.com/joona/log4js-node Allows for MDC style attributes Sorry: works for us but not in shape for pull request; maybe later. But: this was an easy hack.
  • 20.
  • 21.
    Application Metrics http://metrics.dropwizard.io/ Add counters,timers, gauges, etc. to your business logic. metrics.register("httpclient_leased", new Gauge<Integer>() { @Override public Integer getValue() { return connectionManager.getTotalStats().getLeased(); } }); Reporter uses MDC to log once per minute: giant json blob but it works.
  • 22.
    Docker Gelf driver Configureyour docker hosts to log the output of any docker containers using the log driver. command, container id, etc. become fields in log entry nice as a fallback when you don't control the logging /usr/bin/docker daemon --log-driver=gelf --log-opt gelf-address=udp://logs-internal.inbot.io:12201
  • 23.