Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

690 views

Published on

How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)
講師:佐々木 健太郎(Rakuten)
アジェンダ:
- Logging Infrastructure for Private Cloud
-- Private Cloud at Rakuten
-- Logging Matters
-- Log Management
- Overview of Our Logging Infrastructure
-- Event Logging Infrastructure
--- Event Logs in OpenStack
--- Event Logs in VMware
-- Log storage for Event logs: Splunk
-- Alerting and Reporting on Splunk
-- Performance Logging Infrastructure
-- Log Collector Requirements
--- Log Collector: Fluentd, Metricbeat
-- HVs and Storage Performance logs
-- VMs Performance logs from Hypervisors
-- Log streaming: Kafka
-- Log storage for Performance logs:
--- InfluxDB and Grafana
- Summary

Published in: Technology
  • Be the first to comment

  • Be the first to like this

How logging makes a private cloud a better cloud - OpenStack最新情報セミナー(2016年12月)

  1. 1. How logging makes a private cloud a better cloud Dec/01/2016 Kentaro Sasaki Global Operations Department, Rakuten, Inc.
  2. 2. 2 Rakuten is … a Tokyo-based e-commerce and Internet company
  3. 3. 3 Rakuten Ecosystem The Rakuten Ecosystem and our membership database form the foundation of our business
  4. 4. 4 Membership 116.52 Million persons Gross Transaction Volume 7.6 Trillion JPY
  5. 5. 5 Logging Infrastructure for Private Cloud
  6. 6. 6 Private Cloud at Rakuten
  7. 7. 7 Timeline of Private Cloud History Hypervisor: Xen OS Instances: 2,000+ Management features from scratch Hypervisor: KVM Use OpenStack API 2015 Gen3 2012 Gen2 2010 Gen1 Hypervisor: VMware ESXi OS Instances: 25,000+ Management features from scratch
  8. 8. 8 Logging Matters
  9. 9. 9 Benefits Logging enables log visualization Get easier to analysis and debugging From a business point of view Shorten the time spent on troubleshooting Leads to a better Customer Support
  10. 10. 10 Assumptions Messages might be un-manageable Increasing logs require huge log storage Concerns How to take care of data loss How to parse data from different sources
  11. 11. 11 Log Management
  12. 12. 12 High Availability Availability, Redundancy and Scalability Maintainability Minimum data loss and operation overhead
  13. 13. 13 Huge Number of Targets Hundreds of Hypervisors (ESXi & KVM) Tens of thousands of VMs Cover many sort of log Splunk is suited for log analytics Need Time-series DB for performance logs Splunk InfluxDB
  14. 14. 14 Overview of Our Logging Infrastructure
  15. 15. 15 Logging Infrastructure Event log Performance log InfluxDB & Grafana GoogleCloudStorage Splunk & PagerDuty FluentdKafka Splunk Kafka Splunk Fluentd Fluentd Metricbeat CloudFoundry
  16. 16. 16 Event Logging Infrastructure
  17. 17. 17 Event Logs in OpenStack
  18. 18. 18 Huge Number of log files 22 log files in a single cluster Manage logs for every Regions & Availability Zones Manage un-manageable logs CRITICAL message is un-manageable Need to have strong analytical storage engine Component # Log files Nova 8 Keystone 1 Neutron 6 Glance 2 Cinder 5 etc. etc. 2013-02-25 21:05:51 17409 CRITICAL cinder [- ] Bad or unexpected response from the storage volume backend API: volume group cinder-volumes doesn't exist ... 2013-02-25 21:05:51 17409 TRACE cinder VolumeBackendAPIException: Bad or unexpected response from the storage volume backend API: volume group cinder-volumes doesn't exist 2013-02-25 21:05:51 17409 TRACE cinder
  19. 19. 19 Event Logs in VMware
  20. 20. 20 Almost all VMware logs Event logs from vShpere Warning and error logs from ESXi SAN storage logs Error logs from multi vendor’s SAN storage
  21. 21. 21 Log storage for Event logs: Splunk
  22. 22. System Configuration Splunk v6.4.x (as of Nov 2016) Using Indexer cluster and Search head cluster Manage huge data 150+ GB input size per a day 30+ TB indexed data size 22 Input size / a day Indexed data size
  23. 23. 23 Alerting and Reporting on Splunk
  24. 24. 24 OpenStack logs 26 alerts 16 dashboards for reporting VMware logs 68 alerts 12 dashboards for reporting (e.g. Visualize number of errors)
  25. 25. 25 Useful alerting function Collaborate with Pagerduty Strong analytical engine Manage and analyze almost all type of logs Manage un-manageable logs
  26. 26. 26 Performance Logging Infrastructure
  27. 27. 27 Log Collector Requirements
  28. 28. 28 Handle log streams Support various log file format Strong parse engine User-friendly agent Minimum computation resource usage Pluggable Architecture
  29. 29. 29 Log Collector: Fluentd, Metricbeat
  30. 30. 30 HVs and Storage Performance logs
  31. 31. 31 OpenStack Hosts logs Use Fluentd exec plugin for getting nf_conntrack_count Metricbeat v5 for cpu, mem, diskio, filesystem, network VMware HVs and SAN logs Use In-house Fluentd custom plugin for getting Output to InfluxDB and analyze on Grafana
  32. 32. 32 VMs Performance logs from Hypervisors
  33. 33. 33 #!/usr/bin/env python import json, libvirt conn = libvirt.openReadOnly() for id in conn.listDomainsID(): dom = conn.lookupByID(id) print(json.dumps({ "uuid": dom.UUIDString(), "name": dom.name(), "id": dom.ID(), "vcpus":dom.vcpus()[0][3], })) From KVM (OpenStack) Use libvirt Python bindings to build the custom scripts Generate json data and use in_tail plugin From ESXi (VMware) Get logs from vCenter
  34. 34. 34 Log streaming: Kafka
  35. 35. 35 Kafka Specs Kafka v0.10.0 Run on OpenStack and use full SSDs System Configuration 100~500 partitions and 3 replications per topics Make backup for important logs to GCS Transform to the other Kafka (If necessary) KafkaGoogle Cloud Storage Kafka
  36. 36. 36 Log storage for Performance logs: InfluxDB and Grafana
  37. 37. 37 InfluxDB Run InfluxDB v1.1.0 on physical server Multiple post by using Kafka and Fluentd Grafana 72 dashboards for visualizing performance data Access to Multiple InfluxDBs via Load balancer Kafka Grafana
  38. 38. 38 Fluentd - Useful Log Collector Fluentd can handle various log format and be easy to parse logs Minimum resource usage Redundant system Realize InfluxDB mirroring by Kafka and Fluentd Minimize data loss by transporting logs to Kafka – Additionally use GCS
  39. 39. 39 Summary
  40. 40. 40 2 logging Engine Splunk for event logs, InfluxDB for performance logs Cover all of our requirements Easy for troubleshooting, visualization, analysis and improvement
  41. 41. 41 Our logging infra makes our private cloud a better cloud

×