How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

How logging makes a private cloud a better cloud
Dec/01/2016
Kentaro Sasaki
Global Operations Department, Rakuten, Inc.

2
Rakuten is …
a Tokyo-based e-commerce and Internet company

3
Rakuten Ecosystem
The Rakuten Ecosystem and our
membership database form the
foundation of our business

4
Membership
116.52 Million persons
Gross Transaction Volume
7.6 Trillion JPY

5
Logging Infrastructure for Private Cloud

7
Timeline of Private Cloud History
Hypervisor: Xen
OS Instances: 2,000+
Management features from scratch
Hypervisor: KVM
Use OpenStack API
2015
Gen3
2012
Gen2
2010
Gen1
Hypervisor: VMware ESXi
OS Instances: 25,000+
Management features from scratch

9
Benefits
Logging enables log visualization
Get easier to analysis and debugging
From a business point of view
Shorten the time spent on troubleshooting
Leads to a better Customer Support

10
Assumptions
Messages might be un-manageable
Increasing logs require huge log storage
Concerns
How to take care of data loss
How to parse data from different sources

12
High Availability
Availability, Redundancy and Scalability
Maintainability
Minimum data loss and operation overhead

13
Huge Number of Targets
Hundreds of Hypervisors (ESXi & KVM)
Tens of thousands of VMs
Cover many sort of log
Splunk is suited for log analytics
Need Time-series DB for performance logs
Splunk
InfluxDB

14
Overview of Our Logging Infrastructure

15
Logging Infrastructure
Event log
Performance log
InfluxDB & Grafana
GoogleCloudStorage
Splunk & PagerDuty
FluentdKafka
Splunk
Kafka
Splunk
Fluentd
Fluentd
Metricbeat
CloudFoundry

16
Event Logging Infrastructure

18
Huge Number of log files
22 log files in a single cluster
Manage logs for every Regions & Availability Zones
Manage un-manageable logs
CRITICAL message is un-manageable
Need to have strong analytical storage engine
Component # Log files
Nova 8
Keystone 1
Neutron 6
Glance 2
Cinder 5
etc. etc.
2013-02-25 21:05:51 17409 CRITICAL cinder [-
] Bad or unexpected response from the
storage volume backend API: volume group
cinder-volumes doesn't exist
...
2013-02-25 21:05:51 17409 TRACE cinder
VolumeBackendAPIException: Bad or unexpected
response from the storage volume
backend API: volume group cinder-volumes
doesn't exist
2013-02-25 21:05:51 17409 TRACE cinder

20
Almost all VMware logs
Event logs from vShpere
Warning and error logs from ESXi
SAN storage logs
Error logs from multi vendor’s SAN storage

21
Log storage for Event logs: Splunk

System Configuration
Splunk v6.4.x (as of Nov 2016)
Using Indexer cluster and Search head cluster
Manage huge data
150+ GB input size per a day
30+ TB indexed data size
22
Input size / a day
Indexed data size

23
Alerting and Reporting on Splunk

24
OpenStack logs
26 alerts
16 dashboards for reporting
VMware logs
68 alerts
12 dashboards for reporting (e.g. Visualize number of errors)

25
Useful alerting function
Collaborate with Pagerduty
Strong analytical engine
Manage and analyze almost all type of logs
Manage un-manageable logs

26
Performance Logging Infrastructure

28
Handle log streams
Support various log file format
Strong parse engine
User-friendly agent
Minimum computation resource usage
Pluggable Architecture

29
Log Collector: Fluentd, Metricbeat

30
HVs and Storage Performance logs

31
OpenStack Hosts logs
Use Fluentd exec plugin for getting nf_conntrack_count
Metricbeat v5 for cpu, mem, diskio, filesystem, network
VMware HVs and SAN logs
Use In-house Fluentd custom plugin for getting
Output to InfluxDB and analyze on Grafana

32
VMs Performance logs from Hypervisors

33
#!/usr/bin/env python
import json, libvirt
conn = libvirt.openReadOnly()
for id in conn.listDomainsID():
dom = conn.lookupByID(id)
print(json.dumps({
"uuid": dom.UUIDString(),
"name": dom.name(),
"id": dom.ID(),
"vcpus":dom.vcpus()[0][3],
}))
From KVM (OpenStack)
Use libvirt Python bindings to build the custom scripts
Generate json data and use in_tail plugin
From ESXi (VMware)
Get logs from vCenter

35
Kafka Specs
Kafka v0.10.0
Run on OpenStack and use full SSDs
System Configuration
100~500 partitions and 3 replications per topics
Make backup for important logs to GCS
Transform to the other Kafka (If necessary)
KafkaGoogle
Cloud
Storage
Kafka

36
Log storage for Performance logs:
InfluxDB and Grafana

37
InfluxDB
Run InfluxDB v1.1.0 on physical server
Multiple post by using Kafka and Fluentd
Grafana
72 dashboards for visualizing performance data
Access to Multiple InfluxDBs via Load balancer
Kafka
Grafana

38
Fluentd - Useful Log Collector
Fluentd can handle various log format and be easy to parse logs
Minimum resource usage
Redundant system
Realize InfluxDB mirroring by Kafka and Fluentd
Minimize data loss by transporting logs to Kafka – Additionally use GCS

40
2 logging Engine
Splunk for event logs, InfluxDB for performance logs
Cover all of our requirements
Easy for troubleshooting, visualization, analysis and
improvement

41
Our logging infra makes our private cloud a better cloud

How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

More Related Content

What's hot

Viewers also liked

Similar to How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

More from VirtualTech Japan Inc.

Recently uploaded

How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

Editor's Notes