Slides for my talk at the Austrian Perl Workshop in Salzburg on October 10th.
A video of the talk can be found at https://www.youtube.com/watch?v=4Qj-_eimGuE
2. Logging is Like Lego
Many
Interchangeable
Options
Not the focus of this talk
2
3. Our Journey
• Almost no logging when I joined in 2008
• Incremental improvements as a background
project over years
• Currently capturing 600-900 logs / minute
from ~200 machines
• Not claiming "best practice", just some
hopefully useful tips from our long journey
3
4. Log file per-application
• Adopted Log::Log4perl
• Wrote utility function to add a log file
• Intercept warnings and fatal exceptions
• Simple layout with timestamp and severity
4
6. Capture Warnings
$SIG{__WARN__} = sub {
!
# protect against infinite recursion
return warn @_ ## no critic (RequireCarping)
if $within_log_sig
or not defined $Log::Log4perl::Logger::ROOT_LOGGER;
local $within_log_sig = 1;
!
local $Log::Log4perl::caller_depth = $Log::Log4perl::caller_depth + 1;
!
chomp(my $msg = shift);
get_logger()->warn($msg);
};
6
7. Capture Fatal Exceptions
$SIG{__DIE__} = sub {
!
return if $^S; # We're in an eval, so ignore it
die @_ if not defined $^S; # Parsing module/eval
!
# protect against infinite recursion
die @_ ## no critic (RequireCarping)
if $within_log_sig
or not defined $Log::Log4perl::Logger::ROOT_LOGGER;
local $within_log_sig=1;
!
local $Log::Log4perl::caller_depth = $Log::Log4perl::caller_depth + 1;
!
chomp(my $msg = shift);
get_logger()->fatal($msg);
die "$msgn"; # may duplicate message but that's better than loosing it
};
!
7
8. Were there any errors?
log4perl.rootLogger = INFO, TLScreen, TLErrorBuffer
!!
log4perl.appender.TLErrorBuffer = TigerLead::Log::Appender::RecentSummaryBuffer
log4perl.appender.TLErrorBuffer.Threshold = ERROR
log4perl.appender.TLErrorBuffer.max_messages = 10
log4perl.appender.TLErrorBuffer.layout = Log::Log4perl::Layout::PatternLayout
log4perl.appender.TLErrorBuffer.layout.ConversionPattern = %m{chomp}
!!
Ring buffer for log messages.
Used at the end of old batch job code to decide if something went wrong.
8
9. State of play
• Timestamped log message with severity etc
• Per-app log files
• Can tell if warnings or errors were produced
But:
• Not capturing stdout/stderr & non-perl apps
9
12. Capturing stdout/stderr
setsid $start_daemons_command 2>&1
| setsid $capture_logs_command &
!
setsid puts deamons into a separate process group, isolated from terminal.
Capture stdout/stderr from all child processes and pipe to logger process.
Logger process is also in a separate isolated process group
We use daemontools so for us:
start_daemons_command="svscan $supervise_dir"
capture_logs_command="multilog t s1000000 n100 dir $logdir"
multilog t prepends high-resolution timestamps to log messages
multilog t accuracy depends on when the log was flushed
multilog s1000000 n100 dir does log rotation for us
Logger exits only when all child processes have closed stdout/stderr
even if they've become daemons, forked more child processes and died.
12
14. State of play
• Capturing stdout/stderr & non-perl apps
But:
• We had to login to see what was happening
• No single place to watch errors and
warnings across the systems
• Wanted to parse log messages to extract
more useful info
14
15. Log Stream-Store-View
Stream:
Logstash – collect, edit, and forward logs
Store:
Elasticsearch – real-time distributed search
and analytics engine. JSON REST over Lucene
View:
Kibana – browser based analytics and search
dashboard for Elasticsearch
15
20. Our ELK setup
• Started with single machine
• Now using three machines
• Logstash, Elasticsearch and Kibana on each
• Elasticsearch cluster across all three
• HAProxy load balancer in front of all three
20
22. syslog forwarding
• Forwarding system syslog was easy first step
• We're using CentOS6 with rsyslog v7.6
• Started forwarding notice+ severity messages
but now forward info+
22
23. Rsyslog forwarding
# buffering config
$WorkDirectory /var/lib/rsyslog # where to place spool files
$ActionQueueFileName logstash # unique name prefix for spool files
$ActionQueueMaxDiskSpace 1g # 1gb space limit
$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
$ActionQueueType LinkedList # run asynchronously
$ActionResumeRetryCount -1 # infinite retries if host is down
!!
# forward info+ level logs from all facilities to logstash
*.info @@logstash-app-stag.tigerlead.local:5544; RSYSLOG_ForwardFormat
!!
# RSYSLOG_ForwardFormat gives us high-resolution timestamp and timezone
# We use TCP (not UDP) for reliability may switch to RELP later
23
26. Ship our logs to logstash
• Wanted to parse messages but didn't want
to do that on the central logstash server
• Started with a Message::Passing utility to tail
and parse specific logs files and ship as JSON
• Turned out we don't need much parsing
• Now using an extra rsyslogd that follows log
files and forwards to the local root rsyslogd
26
27. AAppppss common
Shipper logstash
ES
Kibana
System rsyslog
queue
Ffilielses
Flow of log messages
27
28. AAppppss common
rsyslog logstash
ES
Kibana
System rsyslog
queue
Ffilielses
Flow of log messages
28
29. Eradicating 'our' log files
• Still have our 'app log files' separate from the
'system log files' in /var/log/*
• Harder to correlate events between them
• Experiment: use syslog for more/everything?
• Want: per-app log files, high-res timestamp
with lexical ordering (sort -m *.log | ...)
• Let the system look after log rotation etc
29
30. Send app logs to syslog
log4perl.rootLogger = INFO, TLScreen, TLErrorBuffer, TLSyslog
!
log4perl.appender.TLSyslog = TigerLead::Log::Appender::Syslog
log4perl.appender.TLSyslog.layout = Log::Log4perl::Layout::PatternLayout
log4perl.appender.TLSyslog.layout.ConversionPattern = %m{chomp} [@%F{1}:%L %M{1}
()}]%n
!
The syslog format provides program name, severity and pid.
30
31. Eradicating 'our' log files
template( name="sortable_log_format" type="string" # format for log lines
# e.g. "2014-06-28 17:47:11.636078 $facility.$severity $program: $message"
string="%TIMESTAMP:::date-pgsql%.%TIMESTAMP:::date-subseconds% %PRI-TEXT%
%syslogtag%%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%n"
)
!
template( name="file_per_programname" type="string" # format for log file names
# e.g. program="run-parts(/etc/cron.hourly)"
# becomes "/var/log/tiger/run-parts" using the 'leading safe characters'
string="/var/log/tiger/%programname:R,ERE,0,ZERO:^[-_a-zA-Z0-9]+--end%.log"
)
!
ruleset(name="write_tiger_progname_log_files") {
action( Type="omfile" Template="sortable_log_format"
DynaFile="file_per_programname")
}
!
if ( ($syslogseverity <= 5) or not ($programname == [ ... ]) ) then {
call write_tiger_progname_log_files
}
31
32. Flow of log messages
AAppppss common
rsyslog logstash
ES
Kibana
Ffilielses
System rsyslog
Ffilielses queue
32
33. Logstash Enrichment #1
hostgroup - first word of server name
• handy to focus in on a group of servers
related to a particular service
punct - just the punctuation chars
• handy to focus on, or exclude, a particular
'shape' of message
33
35. State of play
• No longer had to login to multiple machines
to see what was happening
• Can easily drill-down to explore the logs
from multiple machines and systems
• Can share a URL to that view - very handy
But now:
• Want to be able to live-stream errors
35
36. Live-stream to IRC
• Separate production and staging channels
• Currently just error severity or higher
• Messages with 'alert' or 'emergency' severity
are also sent to main developer channel
• Proven to be very useful
36
37. Live-stream to IRC
But:
• occasionally have floods of messages
• logstash irc rate limiting behaviour is dumb
• want to rate-limit only 'repeated' messages
• 'repeated' should allow for minor differences
• logstash can help...
37
38. Enrichment: message_gist
mutate {
add_field => [ "message_gist", "%{message}" ] # copy to edit
}
mutate {
# normalize numbers
gsub =>[ "message_gist", "[-+]?[0-9]*.?[0-9]+([eE][-+]?[0-9]+)?", "N" ]
# normalize double quoted strings
gsub =>[ "message_gist", ""[^"]*"", "S" ]
# normalize single quoted strings, but try to avoid matching apostrophes
gsub =>[ "message_gist", "(A|W)'[^']*'(?!w)", "1S" ]
# truncate urls to remove the query/fragment part
gsub =>[ "message_gist", "(w:/[^?#s]*)S*", "1" ]
}
fingerprint { # convert the normalized string into an integer hash
source => "message_gist"
target => "message_gist"
method => "MURMUR3"
}
38
39. Enrichment: repeat tag
if [severity] and [severity] =~ /0|1|2|3|4/ {
!
throttle {
period => 60 # seconds
!
before_count => -1
after_count => 2 # allow N within period before throttling
!
key => "%{hostgroup}%{severity}%{program}%{message_gist}"
max_counters => 10000 # track this many variants
!
add_tag => "repeat"
}
!
# may add a more strict 'duplicate' tag here in future
# using period=>5, after_count=>1, and %{message} not %{message_gist}
}
39
40. Enrichment: late tag
# flooding may cause a backlog that delays messages reaching logstash
# tag messages that arrive 'late'
ruby {
code => "
msg_age = Time.now - event['@timestamp']
!
if msg_age >= +60 then msg_tag = 'late' # delayed
elsif msg_age <= -60 then msg_tag = 'early' # craziness
end
!
if msg_tag
then
event.tag msg_tag
event['message_delay'] = msg_age.to_i # age
end
"
}
40
41. Better IRC live-stream
if [severity] and [severity] =~ /0|1|2|3|4/
and "repeat" not in [tags]
and (![message_delay] or [message_delay] < 600) # not too 'late'
{
if [severity] =~ /0|1|2|3/ { # 4 (warning) is currently too noisy
irc {
channels => [ "#logprod" ]
messages_per_second => 10
format => "%{severity_label} %{host} %{program}: %{message}"
}
}
if [severity] =~ /0|1/ { # emergency and alert only
irc {
channels => [ "#l2dev" ]
messages_per_second => 5
format => "%{severity_label} %{host} %{program}: %{message}"
}
}
}
41
42. Flow of log messages
AAppppss common
IRC
rsyslog logstash
ES
Kibana
Ffilielses
System rsyslog
Ffilielses queue
42
43. State of play
• Live-stream to IRC, promotes awareness
• Developers work to reduce spurious noise
But now we want more context:
• "what was the app working on when that
warning or error was triggered?"
• "what was the web request URL?"
or "what were the async job parameters?"
43
44. How to get context?
• Add more info into every log message text,
then parse it out again? Not ideal.
• Start by capturing all the HTTP access logs
• Could do log-shipping for each access log file
• But all traffic passes through HAProxy
• So HAProxy logging can give us everything
44
45. HAProxy logs
• already had haproxy notice+ messages
• now added haproxy traffic logs,
first HTTP then TCP as well
• can include one request and response cookie
• plus multiple request and response headers
45
48. Logstash for HAProxy
• change the host field (and thus hostgroup) to
the backend machine name, so the logs from
haproxy appear to be coming from the
appropriate machine
• parse out request URL parameters
• decode URL parameters
48
49. Logstash for HAProxy
# extract the request url params into a 'params' hash
mutate { gsub => [ "request", "#.*", "" ] } # remove fragment, if any, first
kv { source => "request" field_split => "&?" target => "params" }
!
# XXX disabled re https://github.com/elasticsearch/logstash/issues/1695
# urldecode { field => "params" all_fields => true }
!
if [response] >= 500 {
mutate { replace => [ "severity", "4", "severity_label", "warn" ] }
}
else if [response] >= 400 {
mutate { replace => [ "severity", "5", "severity_label", "notice" ] }
}
!
mutate { # replace raw message with a human friendly version to view/search on
gsub => [ "request", "?.*", "" ] # remove params now we've extracted them
replace => [ "message", "%{be_host} %{client_ip} %{Tw}/%{Tc}/%{Tt}ms %
{bytes_in}b %{bytes_out}b %{response} %{verb} %{request}" ]
}
(Abridged!)
49
50. State of play
• now have detailed TCP and HTTP traffic logs
But:
• still parsing textual messages
• still hard to handle multi-line messages
• still don't have contextual data for logs
• still can't correlate http to application logs
50
51. Log as JSON from app
• Parsing textual log messages to extract data
that your own code put there is a bit dumb
• Log as JSON lines instead (jsonlines.org)
• Opens the door to logging extra information
• Bonus: solves the multi-line message problem,
at least for perl apps
51
52. Log::Log4perl::Layout::JSON
log4perl.rootLogger = INFO, TLScreen, TLFile, TLErrorBuffer, TLSyslogJSON
!
log4perl.appender.TLSyslogJSON = TigerLead::Log::Appender::Syslog
log4perl.appender.TLSyslogJSON.Threshold = INFO
log4perl.appender.TLSyslogJSON.layout = Log::Log4perl::Layout::JSON
log4perl.appender.TLSyslogJSON.layout.prefix = @cee: # used as tag
log4perl.appender.TLSyslogJSON.layout.field.message = %m
log4perl.appender.TLSyslogJSON.layout.field.src_file = %F{1}
log4perl.appender.TLSyslogJSON.layout.field.src_sub = %M{1}
log4perl.appender.TLSyslogJSON.layout.field.src_line = %L
!
Example output (spaces and line breaks added for clarity):
!
2014-10-08 12:56:28.641086 local0.info 70-lead-basic-t[13374]: @cee:{
"message":"...n...n...", "src_file":"Foo.pm", "src_sub":"frobnicate",
"src_line":"18" }
!
Note that src_file, src_sub and src_line used to be appended to the message text.
52
53. Decoding JSON in logstash
grok {
# @cee: is syslog 'CEE Event Flag' per https://cee.mitre.org/
match => { message => "^@cee: ?%{GREEDYDATA:cee_data}" }
add_tag => [ "cee" ]
tag_on_failure => []
}
!
if ("cee" in [tags]) {
json {
source => "cee_data"
remove_field => [ "cee_data" ]
}
}
53
54. State of play
• now have rich JSON formatted log messages
• multi-line messages are no longer a problem
But:
• still only very basic contextual data for logs
• still can't correlate http to application logs
54
55. "Context Data"
• Significant items of 'ambient information'
• The current 'things being worked on'
• Would like that info added to any log msgs
• Including warnings and fatal exceptions
(e.g. if hooked via $SIG{__WARN__})
55
56. Context Data
for my $foo_id (@list_of_foo_ids) {
!
# we want the current $foo_id value to be included
# in any log messages in this scope
!
do_something_useful($foo_id);
}
!
# we DON'T want $foo_id to be included in any future log messages
56
57. Context Data
• Put the 'ambient information' in a hash
• Add the contents of the hash to the JSON
• Use local to limit the scope
57
58. Context Data
for my $foo_id (@list_of_foo_ids) {
local log_context->{foo_id} = $foo_id; # simple!
do_something_useful($foo_id);
}
The imported log_context utility:
sub log_context { return %Log::Log4perl::MDC::MDC_HASH }
The Log::Log4perl::Layout::JSON config line:
log4perl.appender.TLSyslogJSON.layout.include_mdc = 1
58
59. Context Data
Context added to root hash by default:
2014-10-08 12:56:28.641086 local0.info 70-lead-basic-t[13374]: @cee:{
"message":"...n...n...", "src_file":"Foo.pm", "src_sub":"frobnicate",
"src_line":"18", "foo_id":42 }
Optionally put context data items into a nested hash:
log4perl.appender.TLSyslogJSON.layout.name_for_mdc = extra_stuff
!
2014-10-08 12:56:28.641086 local0.info 70-lead-basic-t[13374]: @cee:{
"message":"...n...n...", "src_file":"Foo.pm", "src_sub":"frobnicate",
"src_line":"18", "extra_stuff":{ "foo_id":42 } }
59
60. State of play
• now have easy way to add contextual data
• array and hash refs work (keep it small)
But:
• what contextual data should we include?
• request URL? decoded parameters?
• expensive to include in every message
60
61. HAProxy Correlation
• We have a stream of haproxy logs
• We have a stream of application logs
• Want to be able to correlate them
"what HTTP request caused this warning?"
• Add unique-id to HTTP log & HTTP header
61
62. HAProxy Configuration
defaults
mode http
unique-id-format %{+X}o %ci:%cp_%fi:%fp_%Ts_%rt:%pid
unique-id-header X-TLXID
log-format %ci [%t] %ft %b/%s %Tw/%Tc/%Tt %U %B %tsc %ac/%fc/
%bc/%sc/%rc %sq/%bq %ID %{+Q}r %ST %Tq/%Tr %{+Q}CC %{+Q}hr %{+Q}CS
%{+Q}hs
!
• HAProxy now generates a unique-id for each HTTP request
• Adds it to the HTTP request as a X-TLXID header
• Includes the unique-id value in the syslog message
62
63. Capture X-TLXID
package TigerLead::Plack::Middleware::SetUpLogContext;
use strict;
use warnings;
use parent qw( Plack::Middleware );
!
use Plack::Request;
use TigerLead::Log qw(log_context);
!
sub call {
my($self, $env) = @_;
!
my $req = Plack::Request->new($env);
# reset log context at start of a new request
%{log_context()} = (tlxid => scalar $req->header('X-TLXID'));
!
return $self->app->($env);
}
63
64. Correlation
• Given any log message from a web app we
can now find the HTTP request that was
being processed at the time
• That includes the session cookie, so we can
view the stream of requests for that session
• Demo...
64