This presentation gives a lot of insights into Jimdo's infrastructure that hosts 20 million websites. To enable our application developers to quickly launch and improve their services, we've created a platform called Wonderland that does all the infrastructure work them.
In this talk, I present the parts of Wonderland related to monitoring and logging. You can learn about our Prometheus setup as well as how we stream log messages from Docker to Logstash.
5. • Jimdo’s internal PaaS that runs 250 services
• 2500 Docker containers at a time
• 600 deployments per Day
W O N D E R L A N D
6. W O N D E R L A N D
AW S
O T H E R S E R V I C E
P R O V I D E R S
I N F R A S T R U C T U R E A U T O M AT I O N
A P I S
M O N I T O R I N G ,
L O G G I N G
C L I T O O L S
WONDERLAND
O T H E R T O O L I N G
7. W O N D E R L A N D
W O N D E R L A N D
A P I
AW S E C S
E C S
A G E N T
L O G G I N G
D A E M O N
M E T R I C
D A E M O N
EC2
Instance
8. • Your team is responsible for the software component
that delivers websites of 20m customers
• You are on-call this night
I M A G I N E …
11. 4 : 0 1 A M
Partial outage of
web delivery component
12. • either because a health check failed
• or because a metric exceeded a configured threshold
PA G E R D U T Y C A L L S
13. H E A LT H
C H E C K S
A L E RT
M A N A G E R
P R O M E T H E U S
14. • All services on Wonderland: Route53 health checks
• Infrastructure components: Pingdom checks
A P I H E A LT H C H E C K S
GET /health
HTTP/1.1 200 OK
15. • Workers notify a health check service after each execution
• Prometheus pushgateway
• cronitor.io
• healthchecks.io
• If not notified for a certain time an alert is created
W O R K E R H E A LT H C H E C K S
16. Run tests against production periodically,
monitor results, and alert on issues
S E M A N T I C M O N I T O R I N G
S Y N T H E T I C M O N I T O R I N G
19. G R A FA N A
• Each service running on Wonderland automatically has a
dashboard showing key metrics for debugging
• Developers can create custom dashboards for more detailed
analysis
• Grafana pulls data from Prometheus instances
20. P R O M E T H E U S
• Semi-centralized metric system
• Pull-based metric retrieval
• On-the-fly calculation of derived metrics
21. M E T R I C S
I N F R A S T R U C T U R E M E T R I C S
S Y S T E M M E T R I C S
A P P L I C AT I O N M E T R I C S
22. I N F R A S T R U C T U R E M E T R I C S
P R O M E T H E U S
C L O U D WAT C H
E X P O RT E R
AW S
C U S T O M
E X P O RT E R S
W O N D E R L A N D
A P I S
23. E X A M P L E S
aws_autoscaling_group_desired_capacity_average{
auto_scaling_group_name="crims",
job="cloudwatch_exporter"
}
aws_elb_request_count_sum{
cluster=“crims",
job="wonderland_elb_exporter",
service_name="web-prod"
}
24. S Y S T E M M E T R I C S
P R O M E T H E U S
C O L L E C T D
C A D V I S O R
25. E X A M P L E S
container_memory_rss{
container_label_cluster="crims",
container_label_container_name="web-prod--web",
image="web-prod:abc123",
instance="10.8.4.91:9104",
job=“crims_cadvisor_metrics"
}
collectd_memory{
instance="10.8.4.42:9103",
job="crims_collectd_metrics",
memory="free"
}
26. A P P L I C AT I O N M E T R I C S
P R O M E T H E U S
C O N TA I N E R A
C O N TA I N E R B
…
GET /metrics
27. P R O M E T H E U S
C O N TA I N E R A
C O N TA I N E R B
…
W O N D E R L A N D
S E R V I C E
D I S C O V E RY
W O N D E R L A N D
A P I
update
config
locate
containers
scrape
metrics
and
reload
S E R V I C E D I S C O V E RY
D O W N L O A D E R
get scrape
targets
30. L O N G - T E R M -
P R O M E T H E U S
S H O RT- T E R M
P R O M E T H E U S
scrape
filtered metrics
'match[]':
- '{job="application_metrics", instance=""}'
32
DAYS
30
MIN
F E D E R AT I O N
31. L O N G - T E R M -
P R O M E T H E U S
S H O RT- T E R M
P R O M E T H E U S
scrape
filtered metrics
http_requests_total{instance=“10.8.3.101:80”}
http_requests_total{instance=“10.8.3.102:80”}
http_requests_total{instance=“10.8.3.103:80”}
...
job:http_requests_total:sum{}
job:http_requests_total:sum{}
35. • Centralised logging is a must-have in a distributed
system
• It should be very easy to gather all information that
concerns a service
C E N T R A L I S E D L O G G I N G
36. • Output of all services running on Wonderland is stored
centrally
• Optionally logs are parsed with configurable formats
C E N T R A L I S E D L O G G I N G
$ cat wonderland.yaml
---
components:
- name
image: my-nginx-image
logging:
types:
- access_log
- error_log_nginx
37. C E N T R A L I S E D L O G G I N G
D O C K E R L O G B E AT L O G Z . I O
fluentd
protocol
lumberjack
protocol
Wonderland Logbeat
• receives logs via fluent protocol,
• parses them,
• adds metadata,
• and streams them to our logging provider logz.io
38.
39. T H E T R U T H
D O C K E R L O G B E AT L O G Z . I O
fluentd
protocol
lumberjack
protocol
40. T H E T R U T H
D O C K E R
L O G B E AT L O G Z . I O
fluentd
lumberjack
D O C K E R L O G -
S T R E A M
PA P E RT R A I L .
C O M
syslog
We are in a migration right now.
41. 4 : 1 7 A M
You find this log message of the service
autoscaler:
Unable to scale-out service “web-
delivery”. Configured maximum number
of instances reached.
42. 4 : 1 7 A M
You increase the maximum number of
instances:
$ cat wonderland.yaml
[…]
auto-scaling:
min-instances: 60
max-instances: 150
44. 2 : 0 0 P M
In the PMA for this night’s incident, you create the
action item to
Monitor the number of instances of web-delivery
to detect potential breaches of auto-scaling limits
before affecting the system’s health
48. F U RT H E R R E A D I N G / S O U R C E S
• Beyer, Jones, Petoff & Murphy
Site Reliability Engineering
• Susan Fowler
Production-Ready Microservices
• Sam Newman
Building Microservices
• Stripe / Increment
On-Call (https://increment.com/on-call/)
• Mathias Lafeldt & Paul Seiffert
A Journey Through Wonderland
(https://speakerdeck.com/mlafeldt/a-journey-through-wonderland)
49. F O T O S
• Marcel Stockmann
https://www.flickr.com/photos/marcelstockmann/33068471286
• Michael Theis
https://www.flickr.com/photos/huskyte/6931056896