Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

11,319 views

Published on

After One year of OpenStack Cloud Operation (NTT DOCOMO)

アジェンダ:
- Our Project
- Operation
- Monitoring System
- Log Analytics

Published in: Technology
  • Be the first to comment

NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud Operation (NTT DOCOMO)」

  1. 1. Copyright©2015 NTT DOCOMO, INC. All rights reserved. After One Year of OpenStack Cloud Operation (NTT DOCOMO) NTT DOCOMO Inc. Ken Igarashi NTT Software Asako Ishigaki NEC Akihiro Motoki
  2. 2. DOCOMO, INC All Rights Reserved Ken Igarashi ○ Leading OpenStack Project at NTT DOCOMO ○ One of the first members of proposing OpenStack Bare Metal Provisioning (currently called "Ironic") - bit.ly/1stuN2E Asako Ishigaki ○ Engineer, NTT Software ○ Developing OpenStack log collection and analytics tools. Akihiro Motoki ○ Senior Research Engineer, NEC ○ Core developer of Neutron and Horizon. About Us 2
  3. 3. Copyright©2015 NTT DOCOMO, INC. All rights reserved. Our Project
  4. 4. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 4 Scalable Test using 100 nodes (10) System Design (8) Recovery Tests (12) Racking and Cabling (14) 24/7 support (14) User Support (+x) 2014-6 2014-8 2014-11 2015-2 2015-5 2015-112015-8
  5. 5. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 5 o Team Rules (Culture)  Focusing on using OpenStack instead of developing OpenStack  Think how to use it.  Don’t think OpenStack can’t do XXXX.  Reducing Opex/Promoting Automation  Operation tools • “Anything that a humane needs to do more than twice must be automated.”  Reduce operators by HA and self healing.
  6. 6. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 6 o Tools  Ansible, Python, Shell Script CI/CD • pep-8 • Ansible-lint • Install Spec Writing Test Review Production +5 200+ deployments (2015) 2000+ patches (2015) Deployment Procedure
  7. 7. Copyright©2015 NTT DOCOMO, INC. All rights reserved. Operation
  8. 8. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 8 o OpenStack Configuration(http://bit.ly/1DbJPUO)  Double redundancies for hardware  Triple redundancies for software VM VM VM VM VM VM MySQL (Galera) Arbitrator DB1 DB2 DB3 DB4 VM VM Nova OpenStack APIs Zabbix LBLB Neutron Agents PXE, DNS, DHCP MaaS RabbitMQ
  9. 9. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 9 o OpenStack Configuration(http://bit.ly/1DbJPUO)  Double redundancies for hardware  Triple redundancies for software VM VM VM VM VM VM MySQL (Galera) Arbitrator DB1 DB2 DB3 DB4 VM VM Nova OpenStack APIs Zabbix LBLB Neutron Agents PXE, DNS, DHCP MaaS RabbitMQ
  10. 10. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 10 o Deployment  CMDB Registration
  11. 11. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 11 o Choose playbooks for Ansible Dynamic Inventory Ansible
  12. 12. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 12 o Deployments  Common: network, account, logging, Zabbix agent, drivers/firmware x 37  OpenStack: Nova, Swift, Neutron, ……. x 62  HA Configuration compileInitial update setup kernel driver firmware filesystem development environment Install HDD Driver
  13. 13. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 13 o Operation x 31  Common: process restart, log correction  OpenStack Operation: usage, VM migration/backup, user add/delete/quota change  OpenStack Monitoring: health check tools  perhost instance check • Launch instances on given node(s) • boot succeed, instance log • Metadata retrieval, login prompt, SSH access • Optionally, test volume attach and its read/write access
  14. 14. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 14 o 2015/10/27 4:40pm - 5:20pm  Heian (New Takanawa) What are operators doing behind the Cloud?
  15. 15. Copyright©2015 NTT DOCOMO, INC. All rights reserved. Monitoring System
  16. 16. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 16 o Monitoring System Weekday daytime 24h / 365d VM VM … VM VM Swift VM VM Cinder VM VM Nova RabbitMQ Neutron Agents Data Bases Fluentd Elastic search Zabbi x Kibana
  17. 17. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 17 VM VM … VM VM Swift VM VM Cinder VM VM Nova RabbitMQ Neutron Agents Data Bases Memory CPU Network HDD General OpenStack Monitoring Items Self Healing 1,970 25 3,957 59
  18. 18. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 18 o RabbitMQ  Configuration  3 node cluster  cluster_partition_handling, autoheal  Monitoring  Split Brain check: • “rabbitmqctl eval '[N||{partitions,N}<-rabbit_mnesia:status()].’”  Port Check (5672, 25672)  Process Check • Beam.smp • Rabbitmq-server At least one node running(1/3) • {Openstack-RabbitMQ:grpsum["HostG- RabbitMQ","net.tcp.service[tcp,,25672]",last ,0].count(#3,0,"eq")}=3 • {OpenStack-RabbitMQ:grpsum["HostG- RabbitMQ","proc.num[beam.smp]",last,0].c ount(#3,0,"eq")}=3
  19. 19. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 19 o MySQL  Configuration  4 Nodes + 1 Arbitrator  Monitoring  Cluster Check • wsrep_local_recv_queue • wsrep_local_send_queue • wsrep_flow_control_paused • wsrep_local_commits Arbitrator LB R/W
  20. 20. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 20 o MySQL Cluster Master Disk Galera recv_queuesend_queue Commit Disk Replication OK Slave MySQL Client OK Wait until receive OK from replication
  21. 21. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 21 o MySQL Cluster Freeze Master Disk Galera recv_queuesend_queue Commit Disk Replication OK Slave MySQL Client OK Wait until receive OK from replication 👿 • Disk Failure: 😀 (removed from cluster) • Disk Speed Throttling : 😢
  22. 22. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 22
  23. 23. DOCOMO, INC All Rights Reserved ○ Prohibit some self-healing actions  Do not reboot some OpenStack processes – neutron-plugin-openvswitch-agent  Do not reboot network nodes – loose network reachability (can’t recreate network namespace) Prohibited Actions while MySQL Cluster Freeze 23 Solved at Liberty? All the VMs loose connections
  24. 24. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 24 o Throttling happens during DB backup  Limit Backup Node  Backup Method LB R/W Limit Backup Node LOCK TABLES FOR BACKUP (online) 1. Take from cluster (Donor/Desynced) 2. DB lock and do backup (FLUSH TABLES WITH READ LOCK) 3. Return to cluster (wsrep_desync=OFF) – wsrep_local_recv_queue – wsrep_local_commits
  25. 25. Copyright©2015 NTT DOCOMO, INC. All rights reserved. Log Analytics Kibana
  26. 26. DOCOMO, INC All Rights Reserved (1) detect critical system- failure We have to recover immediately (2) detect malicious access We need to notify users (3) detect no critical errors Better to be fixed as soon as possible (4) find errors/warnings that have no service impact We want to filter out next time Purpose of Log Analytics 26
  27. 27. DOCOMO, INC All Rights Reserved ○ e.g.Logs of a day Total: 100 GB, 80M lines Sum of critical, error and warning logs: 200K lines The meaningful logs are more restrictive: (1) 0 critical failure (2) 0 malicious access (3) 6 non-critical failure (4) 6 ignorable failure 0% 0% 1% 30% 39% 30% Breakdown of Logs Critical Error Warning Info Debug Other Treasure Hunt in The Ocean of Logs 0% 24% 24%49% 3% HW OS OpenStack backend OpenStack Operation tools 27
  28. 28. DOCOMO, INC All Rights Reserved ○ We analyze logs to enhance our black list and white list. ○ Logs found in our black list are sent to Zabbix. Log Analytics Based on White/Black List ----- ----- ----- Logs trash Zabbix Kibana ----- ----- ----- ----- expand expand reduce analyze… 28 add add black list white list
  29. 29. DOCOMO, INC All Rights Reserved Log Server Network Node Control Node Compute Node How to Adopt Black/White List Using Fluentd Fluentd Elasticsearch zabbix_sender fluentd LB UTM • Add “ignorable” flag according to white list • Put metadata to create graphs from the logs rsyslog refer Zabbix alerts Kibana graph graph Notify Zabbix according to black list 29
  30. 30. DOCOMO, INC All Rights Reserved Log Server How to Adopt Black/White List Using Fluentd Fluentd Elasticsearch zabbix_sender fluentd 1. syslog 10:01 crit: hardware failure path: syslog rsyslog api.log timestamp: 10:01 10:03 10:04 severity: crit warn ERROR item: - ids ignore source_ip: - x.x.x.x - message: hardware failure IDS: from x.x.x.x invalid request format 3. api.log 10:04 ERROR: invalid request format 2. rsyslog 10:03 warn: IDS: from x.x.x.x Zabbix hardware failure Kibana IDS graph crit graph refer 30
  31. 31. DOCOMO, INC All Rights Reserved Example of Our White List # with Juno • Count response codes and understand the trend. That’s enough. ^keystonemiddleware.auth_token [-] Unable to find authentication token in headers$ • This ERROR means user’s operation was denied due to quota. • It has no impact to our system. Should be INFO log? ^nova.api.openstack [[^]]*] Caught error: VolumeSizeExceedsAvailableQuota: Requested volume or snapshot exceeds allowed Gigabytes quota..*$ • This WARNING is caused by presence of SHUTOFF instances. • It is commonplace condition. Need to be ignored. ^nova.scheduler.host_manager [[^]]+] Host has more disk space than database expected .*$ 31 1 2 3
  32. 32. DOCOMO, INC All Rights Reserved ○ We succeeded in reducing logs to be analyzed.  In other words, so many meaningless logs have high log-levels. Effect of Our White List Without White List: 160K With White List: 37 reduce 99.98% 32 Today We can analyze all logs in 2- 3 hours a day! 1 year ago We couldn’t analyze all logs in a day
  33. 33. DOCOMO, INC All Rights Reserved Example of Our Black List • This message indicates disk problem on Compute node. ^kernel: [[^]]*] XXXXX.*hardware failure.$ • Corosync needs cleanup its resources. ^pengine: warning: unpack_rsc_op: Processing failed op monitor for .*$ • Fullbackup of mysql failed once. ^mysql_fullbackup[d+]:sFailedstosMySQLsfullbacku p.*$ 33 Warning alert Information alert Information alert 1 2 3
  34. 34. DOCOMO, INC All Rights Reserved Demonstration with Kibana ○ 3 dashboards  OpenStack  All Logs  Error Logs  Critical Logs  Warning Logs  IDS 34
  35. 35. DOCOMO, INC All Rights Reserved Trademarks ○ Kibana is a trademark of Elasticsearch BV, registered in the U.S. and in other countries. ○ Elasticsearch is a trademark of Elasticsearch BV, registered in the U.S. and in other countries. ○ logstash is a trademark of Elasticsearch BV. 35
  36. 36. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 36 o Presentation - Operation  2015/10/27 4:40pm - 5:20pm Heian (New Takanawa) 「What are operators doing behind the Cloud?」 o Exhibition  NEC Booth(H4)  28(Wed.)10:45-13:00,16:30-18:30, 29(Thu.) 9:00-14:00  NTT Group Booth(S14)  28(Wed.) 13:15-16:15 「Touch and Feel! NTT DOCOMO’s Cloud Operation」 contact-cloudpf-ml@nttdocomo.com
  37. 37. Copyright©2015 NTT DOCOMO, INC. All rights reserved. 37 NEC NTT
  38. 38. Copyright©2015 NTT DOCOMO, INC. All rights reserved. ご清聴ありがとうございました。

×