Slides for my talk "Never gonna give you up, never gonna let you down - 7 lessons learned building high availability, high performance systems".
Presented at Codemotion Berlin 2015
4. @EdMcBane
The challenge
● Primary european client
● Innovative service for the consumer market
● Non-trivial userbase (400K+ users)
● High request rate
● Low latency requirement (<< RTT)
11. @EdMcBane
Challenges
● Support for required failover modes
● Support for required scale-out/scale-up modes
● Operability in general
○ and monitoring in particular
● most important of all, avoiding complexity
15. @EdMcBane
LESS(1) General Commands Manual LESS(1)
NAME
less - opposite of more
SYNOPSIS
less -?
less --help
less -V
less --version
less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~]
[-b space] [-h lines] [-j line] [-k keyfile]
[-{oO} logfile] [-p pattern] [-P prompt] [-t tag]
[-T tagsfile] [-x tab,...] [-y lines] [-[z] lines]
[-# shift] [+[+]cmd] [--] [filename]...
(See the OPTIONS section for alternate option syntax with long option
names.)
DESCRIPTION
LESS IS similar to MORE (1), but has many more features.
Less does not have to read the entire input file before starting, so
with large input files it starts up faster than text editors like vi
(1). Less uses termcap (or terminfo on some systems), so it can run on
Manual page less(1) line 1 (press h for help or q to quit) .
17. @EdMcBane
SO_REUSEPORT
For TCP, so_reuseport allows multiple
listener sockets to be bound to the same
port.
Received packets are distributed to
multiple sockets bound to the same port
using a 4-tuple hash.
With so_reuseport the distribution is
uniform.
18. @EdMcBane
Suggestions
● Prefer open source solutions
○ when things break, you want to be able to fix it
● Be skeptical
○ pick any software, chances are it is crap
○ +1 for open source, you can “peek under the hood”
● Do not use tools you do not fully understand
○ or as I’d rather say...
21. @EdMcBane
TCP_TW_RECYCLE
Enable fast recycling TIME-WAIT sockets.
Default value is 0. It should not be changed
without advice/request of technical experts.
Linux will drop any segment from the remote
host whose timestamp is not strictly bigger
than the latest recorded timestamp
TCP_TW_RECYCLE + NAT = MADNESS
29. @EdMcBane
...but be prepared to improvise
Processes designed for ordinary times are not
resilient in a crisis and need to be changed.
Dave Snowden
“
”
30. @EdMcBane
Easier said than done
No, improvising is wonderful.
But, the thing is that you cannot
improvise unless you know
exactly what you're doing.
Christopher Walken
“
”
32. @EdMcBane
Also from Walken...
At its best, life is completely
unpredictable.“
”
Everybody has to be a little
lucky, I think.“
”
I try not to worry about things
I can't do anything about.“
”
37. @EdMcBane
No one size fits all
● “Monitor everything”, like “100% test coverage”
is a nice slogan, nothing more.
● Each environment requires a slightly different
solution
● Balance between data availability, cost and
ability to keep it actionable
39. @EdMcBane
We are doing logging wrong
● Unstructured
● Inconsistent
● Poor defaults
● Complex, obscure components
● A huge waste of computing power
40. @EdMcBane
We need a complete overview
● Logs
● Metrics
● Alerts
● Together, coherent, cross-referenced
○ correlating different stores poses challenges
41. @EdMcBane
Human beings, who are almost unique in
having the ability to learn from the
experience of others, are also remarkable
for their apparent disinclination to do so.
Douglas Adams
“
”