Monitoring and Logging in Wonderland

M O N I T O R I N G A N D L O G G I N G  
I N W O N D E R L A N D
H E L P, W H AT I S H A P P E N I N G ?

PA U L S E I F F E RT
Team Leader at Jimdo, 
Traveller, Foodie, Runner
@seiffertp 
paul.seiffert@gmail.com

• Jimdo’s internal PaaS that runs 250 services
• 2500 Docker containers at a time
• 600 deployments per Day
W O N D E R L A N D

W O N D E R L A N D
AW S
O T H E R S E R V I C E
P R O V I D E R S
I N F R A S T R U C T U R E A U T O M AT I O N
A P I S
M O N I T O R I N G ,  
L O G G I N G
C L I T O O L S
WONDERLAND
O T H E R T O O L I N G

W O N D E R L A N D
W O N D E R L A N D
A P I
AW S E C S
E C S
A G E N T
L O G G I N G  
D A E M O N
M E T R I C  
D A E M O N
EC2 
Instance

• Your team is responsible for the software component
that delivers websites of 20m customers
• You are on-call this night
I M A G I N E …

4 : 0 1 A M
Partial outage of 
web delivery component

• either because a health check failed
• or because a metric exceeded a configured threshold
PA G E R D U T Y C A L L S

H E A LT H
C H E C K S
A L E RT  
M A N A G E R
P R O M E T H E U S

• All services on Wonderland: Route53 health checks
• Infrastructure components: Pingdom checks
A P I H E A LT H C H E C K S
GET /health 
HTTP/1.1 200 OK

• Workers notify a health check service after each execution
• Prometheus pushgateway
• cronitor.io
• healthchecks.io
• If not notified for a certain time an alert is created
W O R K E R H E A LT H C H E C K S

Run tests against production periodically, 
monitor results, and alert on issues
S E M A N T I C M O N I T O R I N G
S Y N T H E T I C M O N I T O R I N G

4 : 1 0 A M
Service still running

S E R V I C E D A S H B O A R D

G R A FA N A
• Each service running on Wonderland automatically has a
dashboard showing key metrics for debugging
• Developers can create custom dashboards for more detailed
analysis
• Grafana pulls data from Prometheus instances

P R O M E T H E U S
• Semi-centralized metric system
• Pull-based metric retrieval
• On-the-fly calculation of derived metrics

M E T R I C S
I N F R A S T R U C T U R E M E T R I C S
S Y S T E M M E T R I C S
A P P L I C AT I O N M E T R I C S

I N F R A S T R U C T U R E M E T R I C S
P R O M E T H E U S
C L O U D WAT C H
E X P O RT E R
AW S
C U S T O M
E X P O RT E R S
W O N D E R L A N D
A P I S

E X A M P L E S
aws_autoscaling_group_desired_capacity_average{
auto_scaling_group_name="crims", 
job="cloudwatch_exporter" 
}
aws_elb_request_count_sum{ 
cluster=“crims", 
job="wonderland_elb_exporter", 
service_name="web-prod" 
}

S Y S T E M M E T R I C S
P R O M E T H E U S
C O L L E C T D
C A D V I S O R

E X A M P L E S
container_memory_rss{ 
container_label_cluster="crims", 
container_label_container_name="web-prod--web", 
image="web-prod:abc123", 
instance="10.8.4.91:9104", 
job=“crims_cadvisor_metrics" 
}
collectd_memory{ 
instance="10.8.4.42:9103", 
job="crims_collectd_metrics", 
memory="free" 
}

A P P L I C AT I O N M E T R I C S
P R O M E T H E U S
C O N TA I N E R A
C O N TA I N E R B
…
GET /metrics

P R O M E T H E U S
C O N TA I N E R A
C O N TA I N E R B
…
W O N D E R L A N D
S E R V I C E
D I S C O V E RY
W O N D E R L A N D
A P I
update 
config
locate 
 
containers
scrape 
metrics
and 
reload
S E R V I C E D I S C O V E RY
D O W N L O A D E R
get scrape 
targets

http_requests_total{instance=“10.8.3.101:80”} = 53 
http_requests_total{instance=“10.8.3.102:80”} = 81 
http_requests_total{instance=“10.8.3.103:80”} = 2
...
job:http_requests_total:sum = sum(http_requests_total) without (instance)
Automatically generated recording rules:

L O N G - T E R M -
P R O M E T H E U S
S H O RT- T E R M  
P R O M E T H E U S
scrape 
 
filtered metrics
'match[]':
- '{job="application_metrics", instance=""}'
32
DAYS
30
MIN
F E D E R AT I O N

L O N G - T E R M -
P R O M E T H E U S
S H O RT- T E R M  
P R O M E T H E U S
scrape 
 
filtered metrics
http_requests_total{instance=“10.8.3.101:80”} 
... 
job:http_requests_total:sum{}
job:http_requests_total:sum{}

4 : 1 2 A M
Auto-Scaling broken

L E T ’ S TA K E A L O O K AT
T H E L O G S

• Centralised logging is a must-have in a distributed
system
• It should be very easy to gather all information that
concerns a service
C E N T R A L I S E D L O G G I N G

• Output of all services running on Wonderland is stored
centrally
• Optionally logs are parsed with configurable formats
$ cat wonderland.yaml 
---
components:
- name
image: my-nginx-image
logging:
types:
- access_log
- error_log_nginx

D O C K E R L O G B E AT L O G Z . I O
fluentd 
 
protocol
lumberjack 
 
protocol
Wonderland Logbeat
• receives logs via fluent protocol,
• parses them,
• adds metadata,
• and streams them to our logging provider logz.io

T H E T R U T H
D O C K E R L O G B E AT L O G Z . I O
fluentd 
 
protocol
lumberjack 
 
protocol

T H E T R U T H
D O C K E R
L O G B E AT L O G Z . I O
fluentd
lumberjack
D O C K E R L O G -
S T R E A M
PA P E RT R A I L .
C O M
syslog
We are in a migration right now.

4 : 1 7 A M
You ﬁnd this log message of the service
autoscaler:
Unable to scale-out service “web-
delivery”. Configured maximum number
of instances reached.

4 : 1 7 A M
You increase the maximum number of
instances:
$ cat wonderland.yaml  
[…] 
auto-scaling: 
min-instances: 60 
max-instances: 150

2 : 0 0 P M
In the PMA for this night’s incident, you create the
action item to
Monitor the number of instances of web-delivery
to detect potential breaches of auto-scaling limits
before affecting the system’s health

Open positions:
• Senior Infrastructure Engineer
• Senior Backend Engineer
• Senior Frontend Engineer
jobs@jimdo.com

F U RT H E R R E A D I N G / S O U R C E S
• Beyer, Jones, Petoff & Murphy 
Site Reliability Engineering
• Susan Fowler 
Production-Ready Microservices
• Sam Newman 
Building Microservices
• Stripe / Increment 
On-Call (https://increment.com/on-call/)
• Mathias Lafeldt & Paul Seiffert 
A Journey Through Wonderland 
(https://speakerdeck.com/mlafeldt/a-journey-through-wonderland)

F O T O S
• Marcel Stockmann 
https://www.flickr.com/photos/marcelstockmann/33068471286
• Michael Theis 
https://www.flickr.com/photos/huskyte/6931056896

Monitoring and Logging in Wonderland

Recommended

Recommended

More Related Content

What's hot

What's hot (8)

Similar to Monitoring and Logging in Wonderland

Similar to Monitoring and Logging in Wonderland (20)

Recently uploaded

Recently uploaded (20)

Monitoring and Logging in Wonderland