Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat

1,871 views

Published on

Audience: Intermediate
About: Tales from an OpenStack operations team that had to learn to walk before they could fly. A small agile team who follow scrum to reduce single points of failure and rely heavily on orchestration. This presentation will outline how we use metrics to investigate, troubleshoot and influence purchasing decisions. Why Up Down monitoring is not enough in this day and age, and how to support the inevitable Persistent VM in the cloud.

Speaker Bio: Rarm Nagalingam – Senior Consultant, Red Hat

Rarm is a Senior Consultant at Red Hat working with customers to deploy and manage their cloud infrastructure. As a passionate cloud advocate, he has assisted in the migration of workloads running on legacy virtualisation to the cloud. Rarm has over 13 years of experience in the ICT industry, specializing in rapid development of bespoke systems.

OpenStack Australia Day - Sydney 2016
http://australiaday.openstack.org.au/sydney-2016/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat

  1. 1. /bin/tails from OpenStack Operations OpenStack Australia Day Rarm Nagalingam DevOps J.O.A.T Engineer May 2016
  2. 2. OpenStack Australia Day 2016 INTRODUCTION Rarm Nagalingam DevOps J.O.A.T Engineer rarm@redhat.com linkedin.com/in/rarm-nagalingam-736aa54
  3. 3. OpenStack Australia Day 20163 ● Current Architecture, Size, Workloads ● Patch Methodology ● User Issue: Is the Cloud Slow!! today? ● egrep fail -R ./ == fail ● Let's play the blame game ● Fool me once, shame on you, fool me twice, monitor it! ● Role Play ● Questions & possibly Answers AGENDA OpenStack Australia Day: /bin/tails from OpenStack Operations
  4. 4. Architecture
  5. 5. OpenStack Australia Day 2016 Current - RHELOSP 5.0 (ICEHOUSE) • 3 x Physical Controllers • 3 x Physical DB Nodes • 2 x Virtual Load Balancers • 26 x Compute Nodes (56 vCPUs and 256 GB ram) • 1456 vCPUs / 6.6TB of RAM – 90% allocated • Storage NFS via Filer
  6. 6. OpenStack Australia Day 2016
  7. 7. OpenStack Australia Day 2016 Future - RHELOSPd 8.0 (LIBERTY) ● 3 x Physical Controllers ● 3 x Physical DB Nodes ● 3 x Physical CEPH Monitor Nodes ● 9 x Physical CEPH Storage Nodes (~ 36TB per node with NVMe Journals) ● 2 x Virtual Load Balancers ● (xxx) x Compute Nodes (56 vCPUs and 512 GB ram each)
  8. 8. OpenStack Australia Day 2016
  9. 9. OpenStack Australia Day 2016 Current Workloads ● Cloud Based ● Web Apps ● Cloudy-VMs ++ https://www.flickr.com/photos/truedimensions/
  10. 10. Patch Methodology
  11. 11. OpenStack Australia Day 2016 Patch Methodology https://www.flickr.com/photos/emma-lego/
  12. 12. Is the Cloud Slow!! today?
  13. 13. OpenStack Australia Day 2016 ● Option 1: Scatter Gun Take Aim Fire Ah... www.safaribooksonline.com
  14. 14. OpenStack Australia Day 2016 Option 2: Become an Elite Cloud Admin (cc) https://www.flickr.com/photos/-chuckc-/
  15. 15. egrep fail -R ./ == FAIL
  16. 16. OpenStack Australia Day 2016 ERROR nova.openstack.common.rpc.common [req-c5e13da1-97f2-4da5-855f-1c09a11f328a None None] ['Traceback (most recent call last):n', ' File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_datan **args)n', ' File "/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatchn result = getattr(proxyobj, method)(ctxt, **kwargs)n', ' File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 439, in innern return catch_client_exception(exceptions, func, *args, **kwargs)n', ' File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 420, in catch_client_exceptionn return func(*args, **kwargs)n', ' File "/opt/stack/nova/nova/network/manager.py", line 573, in get_instance_nw_infon instance_uuid)n', ' File "/opt/stack/nova/nova/db/api.py", line 561, in virtual_interface_get_by_instancen return IMPL.virtual_interface_get_by_instance(context, instance_id)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 138, in wrappern instance_get_by_uuid(context, instance_uuid)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1678, in instance_get_by_uuidn columns_to_join=columns_to_join)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1684, in _instance_get_by_uuidn filter_by(uuid=uuid).n', ' Filepython2.7/dist- packages/sqlalchemy/engine/base.py", line 1449, in executen params)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelementn compiled_sql, distilled_paramsn', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1698, in _execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1691, in _execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 331, in do_executen cursor.execute(statement, parameters)n', ' File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in executen self.errorhandler(self, exc, value)n', ' File "/usr/lib/python2.7/dist- packages/MySQLdb/connections.py", line 36, in defaulterrorhandlern raise errorclass, errorvaluen', 'OperationalError: (OperationalError) (1054, "Unknown column 'instances.locked_by' in 'field list'") 'SELECT anon_1.instances_created_at AS anon_1_instances_created_at, anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at, anon_1.instances_deleted AS anon_1_i instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS instances_vm_state, instances.task_state AS instances_task_state…...ces_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config, instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS instances_cleaned nFROM instances nWHERE instances.deleted = %s AND instances.uuid = %s n LIMIT %s) AS anon_1 LEFT OUTER JOIN instance_info_caches AS instance_info_caches_1
  17. 17. OpenStack Australia Day 2016 http://logstash.openstack.org/#/dashboard/file/logstash.json
  18. 18. OpenStack Australia Day 2016 Got Logs ● Troubleshooting from the 90’s ● Log Aggregation FTW ● Support infrastructure just as important as the Cloud ● Testing in Prod == a resume generating event
  19. 19. Difference between Metrics and Monitoring
  20. 20. OpenStack Australia Day 2016 Use metrics to prove your theories https://www.elastic.co/blog/kibana-4-5-0-released
  21. 21. Let's Play the Blame Game
  22. 22. OpenStack Australia Day 2016 Let's Play the Blame Game ∙ Enforce OLAs ∙ Influence and support purchasing
  23. 23. Fool me once, shame on you. Fool me twice, monitor it!
  24. 24. OpenStack Australia Day 2016 Fool me twice, monitor it! (cc) rarm
  25. 25. Role Play
  26. 26. OpenStack Australia Day 2016 (cc) https://www.flickr.com/photos/d0ppler/ Role Play
  27. 27. OpenStack Australia Day 2016 Exercise 1: You arrive to work and discover one of you compute nodes had been hard powered off. The node was running three high priority instances, a small 60GB Windows instance and two medium RHEL instances. Goal: Without rebuilding the compute node, restart the instances on another node. Example Scenario
  28. 28. BackUps!
  29. 29. OpenStack Australia Day 2016 Exercise 3: One of the admins accidentally dropped a database table. However, rather than just clearing out the redundant data they dropped all the tables form the OpenStack nova database. Thankfully you saw the user do this and can respond quickly. Goal: Redirect users to a temporary site stating that an outage has occurred. Restore the database and ensure that all services are able to successfully interact with the database before removing the redirect BackUp Scenario
  30. 30. OpenStack Australia Day 2016 Now you are an Elite Cloud Admin (cc) https://www.flickr.com/photos/-chuckc-/
  31. 31. Questions & Possibly Answers
  32. 32. THANK YOU plus.google.com/+RedHat linkedin.com/company/red-hat youtube.com/user/RedHatVideos facebook.com/redhatinc twitter.com/RedHatNews

×