Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

OpenStack Summit Tokyo2015 Presentation NTTResonant

396 views

Published on

NTT Resonant Inc., one of NTT group company, is an operator of the "goo" Japanese web portal and a leading provider of Internet services. NTT Resonant deployed and has been operating OpenStack as its service infrastructure since October 2014 in production. The infrastructure started with 400 hypervisors and now accommodates more than 80 services and over 1700 virtual servers. It processes most of 170 Million unique users per month and 1 Billion page views per month.
We will show our knowledge based on our experience.

Published in: Software
  • Be the first to comment

OpenStack Summit Tokyo2015 Presentation NTTResonant

  1. 1. OpenStack at NTT Resonant: Lessons Learned in Web Infrastructure Tomoya Hashimoto, Business Platform Division, NTT Resonant Inc. Kazuhiro Tooriyama, Business Platform Division, NTT Resonant Inc. Toshikazu Ichikawa, NTT Software Innovation Center, NTT Corporation
  2. 2. Presentation Video This slide deck was presented in OpenStack Summit Tokyo 2015. You will find our video-recorded presentation at the URL https://www.openstack.org/summit/tokyo- 2015/videos/presentation/openstack-at-ntt-resonant-lessons-learned- in-web-infrastructure 2
  3. 3. Speakers Tomoya Hashimoto Kazuhiro Tooriyama 2010 – 2014 NTT Communications Development of ISP NW (OCN) 2014 - current NTT Resonant Engineer of server platform 2001 - 2012 NTT Resonant goo blog, oshiete goo (Q&A service) Development and operation of core services 2012 - current NTT Resonant Architect of server platform 3 Toshikazu Ichikawa 2011 – 2014 Verio (NTT America) Development of cloud service “Cloudn” and managed hosting service 2014 - current NTT Development of cloud service platform
  4. 4. 1. About NTT Resonant 2. OpenStack Infrastructure Design 3. VM setup by Puppet with OpenStack 4. Monitoring OpenStack and VMs 5. Current Issues and Future Plan 4 Agenda
  5. 5. 1. About NTT Resonant 5
  6. 6. 1. About NTT Resonant 6 Regional Communications Business Long Distance and International Communications Business Mobile Communications Business Date Communications Business $112 billion in total revenue 240,000 employees worldwide #1 in Data Center floor space #2 in Global IP Backbone Source: TeleGeography All facts and figures accurate as of March 2014 R&D
  7. 7. 1. About NTT Resonant B2C services Platform and B2B2C services Portal Site Smartphone application goo milk feeder goo disaster prevention application 7 Services of Customers Healthcare Disaster prevention/ response solutionsPhone Cloud / Developer support e-commerce site for communications devices NTT Resonant’s Business Area
  8. 8. 1. About NTT Resonant 8 Dictionaries ZIP codes Laboratory Bodycloud Housing and real estate Search Baby-care Movies Maps Navigation Horoscopes RankingsCar and bike News Weather Healthcare Smartphone applicationsBlogs Job search Love and marriage Online store Travel Providing 60+ services including • Web search • Blogging • News • Oshiete! goo Q&A site Launched in 1997 18th years old Web portal site “goo” http://www.goo.ne.jp/
  9. 9. 1. About NTT Resonant 9 How large is ? The 3rd largest web portal in Japan Yahoo! Google Rakuten MSN Scale of web portal “goo” 170 million unique browsers per month 1 billion page views per month Source: 2015.02 NetRatings
  10. 10. 2. OpenStack Infrastructure Design 10
  11. 11. 2. OpenStack Infrastructure Design • Migrate to another data center, under limited timeframe –The termination of existing data center (DC) contract is fixed. We need to migrate our system from existing DC to another DC by the time. • Shorten a lead time for service release –Speed up by changing manual operation to create and manage VMs –Comparable to public cloud service such as AWS • Support all workflows to provide service –not just introducing OpenStack, as an infrastructure for web services –Not only VM creation, but also an installation and configuration of software inside VMs What was required to us at OpenStack deployment 11
  12. 12. Service Teams Platform Team 2. OpenStack Infrastructure Design Organization and Formation NTT Resonant Service Developer 60+ services Platform service Operation Partners Operator Outsourcing 12 ~10 engineers 300+ engineers … Service Developer Service Developer Service Developer Service Developer Operator Operator Design Team NTT R&D OpenStack Community … Joint experiment Contribution Distribution
  13. 13. 2. OpenStack Infrastructure Design • It’s decided to migrate our services to another data center. –2014/3 Project Started OpenStack installation’s design and deployment begins –2014/10 OpenStack is ready, in production –2014/10-2015/01 (4 months) 70 services, 1300 VMs started OpenStack deployment timeline with our services 13 March July Oct. JanJuneApril May Aug. Sep. Nov. Dec. 2014 2015 Migration of services from old existing environment ★OpenStack started, In Production ★Migration Completed OpenStack Installation: Design / Deployment Requirement Definition About 6 months ★Old existing environment Closed
  14. 14. 2. OpenStack Infrastructure Design • Using OpenStack as Private Cloud • In production since 2014 October • As of now, it supports –80+ services –1 billion page views per month • With –400 hypervisors •2 Nova cells –4,800 physical cores –1,800+ virtual servers OpenStack Scale at main data center of NTT Resonant 14 Launch
  15. 15. Dashboard 2. OpenStack Infrastructure Design OpenStack Components (Icehouse Release) 15 Horizon Neutron Nova Glance Cinder Swift Keystone Network Hypervisor Image Block Storage Object Storage Identity Virtual Router and LAN Virtual Load Balancer Virtual server VM Template Image snapshot Virtual volume RESTful file store Replication What we use VM VM VM APP OS APP OS APP OS designed by Freepik Trove Heat Orchestration Database services Ceilometer Telemetry
  16. 16. • Distribution –RDO with CentOS 6 –Icehouse version • Automation –Puppet for Configuration Management •Thanks RDO Community for Puppet manifest 16 2. OpenStack Infrastructure Design Deployment
  17. 17. • Provider network with VLAN –No control L3+ including Router, NAT, Load Balancer, Firewall • Using ML2, Linuxbridge agent –We are familiar with it • Service Model –An administrator prepares networks and subnets per tenant –A tenant is not allowed to create/delete a network • Close to “Scenario: Provider networks with Linux bridge” in the “OpenStack Networking Guide” [1] 17 [1] http://docs.openstack.org/networking-guide/scenario_provider_lb.html Neutron L4-7: Load Balancer, VPN L3: Router, NAT L2: Network, Port What we use 2. OpenStack Infrastructure Design Networking with Neutron
  18. 18. 18 Node Type OpenStack Components RabbitMQ (MQ) and MariaDB (Database) HAProxy (LB) and Pacemaker (HA cluster) Top cell Controller Nova, Glance, Keystone, Neutron, Horizon RabbitMQ Mirrored queue Nova, Keystone, Neutron, Horizon, DB, MQ Child cell Controller Nova RabbitMQ Mirrored queue MQ Database N/A MariaDB Galera Cluster N/A Swift Proxy Swift, Glance N/A Swift, Glance Swift Storage Swift N/A N/A Compute Nova, Neutron N/A N/A Node types and HA(High Availability) strategy 2. OpenStack Infrastructure Design
  19. 19. 2. OpenStack Infrastructure Design Contribution to Community related to this project 19 • This bug was a show-stopper for the project until we fixed –Bug Fix [1]: •the Shelve function didn't work at Icehouse release with nova-cell deployment •We use shelve/unshelve for hypervisor maintenance • Some bugs we found and fixed –Security Bug Fix [2]: •This was announced as OSSA 2015-017 recently –8 bug fixes other than above [1] "shelve api does not work in the nova-cell environment“ https://bugs.launchpad.net/nova/+bug/1338451 [2] "Deleting instance while resize instance is running leads to unuseable compute nodes” https://bugs.launchpad.net/nova/+bug/1392527
  20. 20. 2. OpenStack Infrastructure Design • We modified codes to enforce our operation rules –We modified only Horizon •Users come through Horizon, not API •What we implemented –Server naming restriction –Access limit to security group function –And, about 40 items –No modification to other components except bug fix backports •Minimize the cost to maintain 20 Server creation dialog of Horizon Customization on Dashboard
  21. 21. 3. VM setup by Puppet with OpenStack 21
  22. 22. 3. VM setup by Puppet with OpenStack Issue of VM setup (installation and configuration) 22 • Only 4 months from VM creation to service migration – Time is limited for VM setup – 1,300 VMs need to be migrated onto OpenStack – Automate procedures, as much as possible • The key is puppet manifest using at existing data center (DC) – We used puppet manifest to setup VM at existing DC • Making a bridge between OpenStack and puppet – The goal is to setup our services on top of OpenStack quickly and easily We resolved this issue by using puppet integrated with OpenStack
  23. 23. OpenStack 3. VM setup by Puppet with OpenStack How we use puppet with OpenStack 23 • Our puppet design – Individual puppet master per tenant – Linux account, middleware, config file etc. – Single manifest repository • What is required to use Puppet – Host name can be resolved with DNS – Host group is defined in LDAP – Puppet manifest has the entry for a host group Tenant: A Tenant:A User VM-A VM-B SVN Puppet Master DNS LDAP Necessary
  24. 24. OpenStack 3. VM setup by Puppet with OpenStack How we use puppet with OpenStack 24 • Synchronization tool – Polling on Nova API to detect a new VM – VM registration to DNS, LDAP and puppet manifest – Complete above steps every 5 minutes • OpenStack user is able to apply puppet manifest easily and quickly right after a VM creation Tenant: A Tenant:A User VM-A VM-B SVN Puppet Master Synchronization tool Polling NovaAPI DNS LDAP Add entry Add entry
  25. 25. Outcome from VM setup framework with OpenStack 25 • Drastically shortened timeline and efficient workflow –1000 VMs service deployment within 1 months –Only 30 minutes from VM creation to service start •It needed 5 business days without OpenStack –Eliminate tasks of two operators by reducing manual operation • Common process to build service environment –Service engineer don’t worry about environment, focusing on business 3. VM setup by Puppet with OpenStack
  26. 26. 4. Monitoring OpenStack and VMs 26
  27. 27. 4. Monitoring OpenStack and VMs • Two monitoring environments 1. For cloud infrastructure •NW, Physical servers, Openstack itself 2. For web services •Providing standard service monitoring methods on the private cloud • Tools and Situations – Zabbix •Semi-auto VM monitoring – Redmine and Wiki •As an issue(ticket) managiment system •Auto issuing 1 ticket / 1 trouble – Operation Center •24/7 monitoring and calling via TEL •1st response to simple occasion 27 Abstract of our monitoring env. Operation Center Web Service Teams Automatic Issuing Infra Team (us) Watcing 24/7 In case of trouble or provisioning In case of serious situation designed by Freepik
  28. 28. 4. Monitoring OpenStack and VMs • Severity order – API monitoring •keystone-api, nova-api, neutron-api, horizon GUI, glance-api, swift-proxy •Quite serious trouble – Process failure detection •nova-*, swift-*, keystone-*, rabbitmq-server, mysqld(MariaDB) etc. – Process performance monitoring •Depending on middleware •i.e.) MySQL connection number etc. – Log monitoring •Treat any log message above ERROR as “trouble” from the beginning –Lack of knowledge leads doubt •Filtering problem-free logs day by day 28 1. For cloud infrastructure (OpenStack monitoring)
  29. 29. 4. Monitoring OpenStack and VMs • What’s this? – Log messages from the OpenStack launching one virtual machine 29 223 lines, 119698 characters (only 24 lines without DEBUG level) Problems: complicated logs (icehouse release)
  30. 30. 4. Monitoring OpenStack and VMs • Analyzing without DEBUG logs – In a case of failure to create new instance 30 2015-07-XX 17:00:YY TopCellController INFO nova.osapi_compute.wsgi.server 172.X.X.X "GET <API_URL>/servers/<VM-UUID> HTTP/1.1" status: 200 ->Accepting the request of creating new one 2015-07-XX 17:00:YY TopCellController INFO nova.scheduler.filter_scheduler Attempting to build 1 instance(s) ->Just reporting 2015-07-XX 17:00:YY ChildCellController WARNING nova.scheduler.driver [instance:<VM-UUID>] Setting instance to ERROR state. ->The beginning of sleepless night 2015-07-XX 17:00:YY ChildCellController INFO nova.filters Filter DiskFilter returned 0 hosts -> Lack of free disks? Where is the processing sequence? (icehouse release) Problems: complicated logs For newbies, it’s not friendly. 
  31. 31. 4. Monitoring OpenStack and VMs 31 2015-07-XX 17:00:YY ChildCellController DEBUG nova.filters Filter RamFilter returned 88 host(s) get_filtered_objects /usr/lib/python2.6/site-packages/nova/filters.py:88 ->report: enough memory 2015-07-XX 17:00:YY ChildCellController DEBUG nova.scheduler.filters.disk_filter (<hypervisor-name>) ram:46581 disk:731136 io_ops:0 instances:3 does not have 1433600 MB usable disk, it only has 731136.0 MB usable disk. ->report: not enough disk * 88 times 2015-07-XX 17:00:YY ChildCellController INFO nova.filters Filter DiskFilter returned 0 hosts ->Lacks of free disk space. We need to add more disk rapidly. • Analyzing DEBUG logs – In a case of failure to create new instance (icehouse release) Problems: complicated logs DEBUG log shows internal processing, but it’s quite scruffy.
  32. 32. 4. Monitoring OpenStack and VMs • New function to trace logs easily even across components – Target) nova, cinder, glance, neutron, keystone, etc. • Current – Each component, each request ID – Need to map request IDs for tracing logs – Difficulty of finding IDs  – i.e.) Create new volume from image (cinder calls glance api) • NTT’s suggestion – Log request ID mapping within 1 line in each caller – Approved as a cross-project spec, To be implemented • https://review.openstack.org/#/c/156508 32 Log Request ID mapping glance-apicinder-volume 2015-10-08 16:14:33.498 DEBUG cinder.volume.manager [req-A admin] image down load from glance req-B 015-10-08 16:14:33.521 DEBUG glanceclient.common.http [req-A admin] HTTP/1.1 200 OK content-length: 0 x-image-meta-status: active x-image-meta-owner: 46e99ee00fd14957b9d75d997cbbbcd8 … x-openstack-request-id: req-B … x-image-meta-disk_format: ami log_http_response /usr/local/lib/python2.7/dist- packages/glanceclient/common/http.py:136 … 2015-10-08 16:14:33.517 11610 DEBUG glance.registry.client.v1.client [req-B 924515e485e846799215a0c9be9789cf 46e99ee00fd14957b9d75d997cbbbcd8 - - -] Registry request GET /images/c95a9731-77c8-4da7-9139-fedd21e9756d HTTP 200 request id req-req-5cb606e5-ea1c-4afc-a626-a4deb83c56a1 do_request /opt/stack/glance/glance/registry/client/v1/client.py:124 2015-10-08 16:14:33.520 11610 INFO eventlet.wsgi.server [req- B 924515e485e846799215a0c9be9789cf 46e99ee00fd14957b9d75d997cbbbcd8 … Buried deep!
  33. 33. 4. Monitoring OpenStack and VMs • We’ve been providing standard monitoring system inside our company – Standardized monitoring work-flow for internal service developers • Standard monitoring item sets and rules • Parameter threshold of alerts – Monitoring configuration into Zabbix (or Nagios) by our hands • Think about monitoring scheme with OpenStack – Over 1,000 virtual machines are born, also suddenly die – By our hands? – Our zabbix given new function • Detecting new VMs and starting monitoring semi-automatically • Before getting along with OpenStack... – Consider your today’s work-flow deeper for an efficient operation 33 2. For web services - Changing operation work-flow
  34. 34. 5. Current Activity and Future Plan 34
  35. 35. 5. Current Activity and Future Plan • Changing Sizing and improving VM density • Initial flavors are designed by focusing on migration project •Compatibility with old DC rather than resource efficiency •VM spec same as old DC was the best for migration plan • Current usage –Disk capacity is too much •Design: 37 Gbytes disk size per 1 Gbytes memory size •Actual Usage: 7 Gbytes disk size per 1 Gbytes memory size • Providing new flavors based on actual usage, asking to return unused disk capacity • Increase server physical memory double • Aiming to increase VM density 1.3 – 2 times 35 Current Activity
  36. 36. 5. Current Activity and Future Plan • Upgrade Openstack –Load Balancer as a Service (LBaaS) is desired •Current: Manual operation to Load Balancer •LBaaS API v1 is not enough •Waiting our vender driver for LBaaS API v2 –Establish upgrade operation •Need to apply our patches •Need to develop and test these patches •These prevents us from frequent upgrade –Mitaka release •NTT R&D locates at “Mitaka” 36 Future Plan
  37. 37. 37 Summary 1. About NTT Resonant, operates web portal site “goo”. • 170 million unique browsers and 1 billion page views per month 2. OpenStack Infrastructure Design • It increased our business speed and agility • We successfully deployed 400 hypervisors in 6 months • Stable in production for more than 1 year 3. VM setup by Puppet with OpenStack • We could start 70+ services on 1,300 VMs in 4 months • It shorten the time to deploy a service from 5 days to 30 minutes 4. Monitoring both of OpenStack and VMs, with Zabbix 5. Current Activity and Future Plans • Current: Sizing to improve VM density • Future: Upgrade, LBaaS and more toward Mitaka release
  38. 38. Openstack new VM Appendix: Our monitoring environment TIPS) Semi-auto monitoring setting Zabbix polles VMs and reads monitoring.conf, and then apply specified template. サーバ Zabbix サーバ サーバ サーバ サーバ Redmine (2) Getting monitoring definition via the agent. (1)Zabbix server polles IP segments of Openstack VMs, finding out zabbix agents => Registering it as monitoring target (3)Applying corresponding monitoring template against monitoring.conf (4)In case of catching trouble, kick the script for auto-issuing => Sending request to Redmine API X.Y.Z.0/24 Monitoring.conf apache_prod mysql_prod linux_prod alert_on ZabbixAgent Examples apache_prod = Apache in production monitor apache_dev = Apache in development monitor linux_prod = Linux OS in production monitor alert_on = sending alert to the VM users alert_off = maintenance(silent) mode … Script Polling VMs (Auto Discovery) Trouble Ticket Issuing New! 38

×