How logging makes a private cloud a better cloud
Dec/01/2016
Kentaro Sasaki
Global Operations Department, Rakuten, Inc.
2
Rakuten is …
a Tokyo-based e-commerce and Internet company
3
Rakuten Ecosystem
The Rakuten Ecosystem and our
membership database form the
foundation of our business
4
Membership
116.52 Million persons
Gross Transaction Volume
7.6 Trillion JPY
5
Logging Infrastructure for Private Cloud
6
Private Cloud at Rakuten
7
Timeline of Private Cloud History
Hypervisor: Xen
OS Instances: 2,000+
Management features from scratch
Hypervisor: KVM
Use OpenStack API
2015
Gen3
2012
Gen2
2010
Gen1
Hypervisor: VMware ESXi
OS Instances: 25,000+
Management features from scratch
8
Logging Matters
9
Benefits
Logging enables log visualization
Get easier to analysis and debugging
From a business point of view
Shorten the time spent on troubleshooting
Leads to a better Customer Support
10
Assumptions
Messages might be un-manageable
Increasing logs require huge log storage
Concerns
How to take care of data loss
How to parse data from different sources
11
Log Management
12
High Availability
Availability, Redundancy and Scalability
Maintainability
Minimum data loss and operation overhead
13
Huge Number of Targets
Hundreds of Hypervisors (ESXi & KVM)
Tens of thousands of VMs
Cover many sort of log
Splunk is suited for log analytics
Need Time-series DB for performance logs
Splunk
InfluxDB
14
Overview of Our Logging Infrastructure
15
Logging Infrastructure
Event log
Performance log
InfluxDB & Grafana
GoogleCloudStorage
Splunk & PagerDuty
FluentdKafka
Splunk
Kafka
Splunk
Fluentd
Fluentd
Metricbeat
CloudFoundry
16
Event Logging Infrastructure
17
Event Logs in OpenStack
18
Huge Number of log files
22 log files in a single cluster
Manage logs for every Regions & Availability Zones
Manage un-manageable logs
CRITICAL message is un-manageable
Need to have strong analytical storage engine
Component # Log files
Nova 8
Keystone 1
Neutron 6
Glance 2
Cinder 5
etc. etc.
2013-02-25 21:05:51 17409 CRITICAL cinder [-
] Bad or unexpected response from the
storage volume backend API: volume group
cinder-volumes doesn't exist
...
2013-02-25 21:05:51 17409 TRACE cinder
VolumeBackendAPIException: Bad or unexpected
response from the storage volume
backend API: volume group cinder-volumes
doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder
19
Event Logs in VMware
20
Almost all VMware logs
Event logs from vShpere
Warning and error logs from ESXi
SAN storage logs
Error logs from multi vendor’s SAN storage
21
Log storage for Event logs: Splunk
System Configuration
Splunk v6.4.x (as of Nov 2016)
Using Indexer cluster and Search head cluster
Manage huge data
150+ GB input size per a day
30+ TB indexed data size
22
Input size / a day
Indexed data size
23
Alerting and Reporting on Splunk
24
OpenStack logs
26 alerts
16 dashboards for reporting
VMware logs
68 alerts
12 dashboards for reporting (e.g. Visualize number of errors)
25
Useful alerting function
Collaborate with Pagerduty
Strong analytical engine
Manage and analyze almost all type of logs
Manage un-manageable logs
26
Performance Logging Infrastructure
27
Log Collector Requirements
28
Handle log streams
Support various log file format
Strong parse engine
User-friendly agent
Minimum computation resource usage
Pluggable Architecture
29
Log Collector: Fluentd, Metricbeat
30
HVs and Storage Performance logs
31
OpenStack Hosts logs
Use Fluentd exec plugin for getting nf_conntrack_count
Metricbeat v5 for cpu, mem, diskio, filesystem, network
VMware HVs and SAN logs
Use In-house Fluentd custom plugin for getting
Output to InfluxDB and analyze on Grafana
32
VMs Performance logs from Hypervisors
33
#!/usr/bin/env python
import json, libvirt
conn = libvirt.openReadOnly()
for id in conn.listDomainsID():
dom = conn.lookupByID(id)
print(json.dumps({
"uuid": dom.UUIDString(),
"name": dom.name(),
"id": dom.ID(),
"vcpus":dom.vcpus()[0][3],
}))
From KVM (OpenStack)
Use libvirt Python bindings to build the custom scripts
Generate json data and use in_tail plugin
From ESXi (VMware)
Get logs from vCenter
34
Log streaming: Kafka
35
Kafka Specs
Kafka v0.10.0
Run on OpenStack and use full SSDs
System Configuration
100~500 partitions and 3 replications per topics
Make backup for important logs to GCS
Transform to the other Kafka (If necessary)
KafkaGoogle
Cloud
Storage
Kafka
36
Log storage for Performance logs:
InfluxDB and Grafana
37
InfluxDB
Run InfluxDB v1.1.0 on physical server
Multiple post by using Kafka and Fluentd
Grafana
72 dashboards for visualizing performance data
Access to Multiple InfluxDBs via Load balancer
Kafka
Grafana
38
Fluentd - Useful Log Collector
Fluentd can handle various log format and be easy to parse logs
Minimum resource usage
Redundant system
Realize InfluxDB mirroring by Kafka and Fluentd
Minimize data loss by transporting logs to Kafka – Additionally use GCS
39
Summary
40
2 logging Engine
Splunk for event logs, InfluxDB for performance logs
Cover all of our requirements
Easy for troubleshooting, visualization, analysis and
improvement
41
Our logging infra makes our private cloud a better cloud

How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

Editor's Notes

  • #8 https://www.vmware.com/company/news/media-resources/logo-guidelines http://www.openstack.org/brand/openstack-logo/ http://www.xenproject.org/logos-mascots.html
  • #14 http://influxdata.github.io/design.influxdata.com/branding-docs/
  • #16 https://www.vmware.com/company/news/media-resources/logo-guidelines http://www.openstack.org/brand/openstack-logo/ http://influxdata.github.io/design.influxdata.com/branding-docs/ https://commons.wikimedia.org/wiki/File:Apache_kafka.png https://en.wikipedia.org/wiki/File:CloudFoundryCorp_vertical.svg http://docs.fluentd.org/articles/logo https://www.pagerduty.com/resources/logo/
  • #36 https://commons.wikimedia.org/wiki/File:Apache_kafka.png http://docs.fluentd.org/articles/logo
  • #38 http://influxdata.github.io/design.influxdata.com/branding-docs/ https://commons.wikimedia.org/wiki/File:Apache_kafka.png http://docs.fluentd.org/articles/logo