/bin/tails from OpenStack Operations
OpenStack Australia Day
Rarm Nagalingam
DevOps J.O.A.T Engineer
May 2016
OpenStack Australia Day 2016
INTRODUCTION
Rarm Nagalingam
DevOps J.O.A.T Engineer
rarm@redhat.com
linkedin.com/in/rarm-nagalingam-736aa54
OpenStack Australia Day 20163
● Current Architecture, Size, Workloads
● Patch Methodology
● User Issue: Is the Cloud Slow!! today?
● egrep fail -R ./ == fail
● Let's play the blame game
● Fool me once, shame on you, fool me twice, monitor it!
● Role Play
● Questions & possibly Answers
AGENDA
OpenStack Australia Day: /bin/tails from OpenStack Operations
Architecture
OpenStack Australia Day 2016
Current - RHELOSP 5.0 (ICEHOUSE)
• 3 x Physical Controllers
• 3 x Physical DB Nodes
• 2 x Virtual Load Balancers
• 26 x Compute Nodes (56 vCPUs and 256 GB ram)
•
1456 vCPUs / 6.6TB of RAM – 90% allocated
• Storage NFS via Filer
OpenStack Australia Day 2016
OpenStack Australia Day 2016
Future - RHELOSPd 8.0 (LIBERTY)
●
3 x Physical Controllers
●
3 x Physical DB Nodes
●
3 x Physical CEPH Monitor Nodes
●
9 x Physical CEPH Storage Nodes (~ 36TB per node with NVMe Journals)
●
2 x Virtual Load Balancers
●
(xxx) x Compute Nodes (56 vCPUs and 512 GB ram each)
OpenStack Australia Day 2016
OpenStack Australia Day 2016
Current Workloads
●
Cloud Based
●
Web Apps
●
Cloudy-VMs ++
https://www.flickr.com/photos/truedimensions/
Patch Methodology
OpenStack Australia Day 2016
Patch Methodology
https://www.flickr.com/photos/emma-lego/
Is the Cloud Slow!! today?
OpenStack Australia Day 2016
●
Option 1: Scatter Gun
Take Aim Fire Ah...
www.safaribooksonline.com
OpenStack Australia Day 2016
Option 2: Become an Elite Cloud Admin
(cc) https://www.flickr.com/photos/-chuckc-/
egrep fail -R ./ == FAIL
OpenStack Australia Day 2016
ERROR nova.openstack.common.rpc.common [req-c5e13da1-97f2-4da5-855f-1c09a11f328a None None] ['Traceback (most recent call last):n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_datan **args)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatchn result = getattr(proxyobj, method)(ctxt, **kwargs)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/common.py", line 439, in innern return catch_client_exception(exceptions, func, *args, **kwargs)n', ' File
"/opt/stack/nova/nova/openstack/common/rpc/common.py", line 420, in catch_client_exceptionn return func(*args, **kwargs)n', ' File
"/opt/stack/nova/nova/network/manager.py", line 573, in get_instance_nw_infon instance_uuid)n', ' File "/opt/stack/nova/nova/db/api.py", line 561, in
virtual_interface_get_by_instancen return IMPL.virtual_interface_get_by_instance(context, instance_id)n', ' File
"/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line
138, in wrappern instance_get_by_uuid(context, instance_uuid)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return
f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1678, in instance_get_by_uuidn columns_to_join=columns_to_join)n', ' File
"/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1684, in _instance_get_by_uuidn filter_by(uuid=uuid).n', ' Filepython2.7/dist-
packages/sqlalchemy/engine/base.py", line 1449, in executen params)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1584,
in _execute_clauseelementn compiled_sql, distilled_paramsn', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1698, in
_execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1691, in _execute_contextn context)n', ' File
"/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 331, in do_executen cursor.execute(statement, parameters)n', ' File
"/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in executen self.errorhandler(self, exc, value)n', ' File "/usr/lib/python2.7/dist-
packages/MySQLdb/connections.py", line 36, in defaulterrorhandlern raise errorclass, errorvaluen', 'OperationalError: (OperationalError) (1054,
"Unknown column 'instances.locked_by' in 'field list'") 'SELECT anon_1.instances_created_at AS anon_1_instances_created_at,
anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at,
anon_1.instances_deleted AS anon_1_i instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS
instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS
instances_vm_state, instances.task_state AS instances_task_state…...ces_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config,
instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS
instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS
instances_cleaned nFROM instances nWHERE instances.deleted = %s AND instances.uuid = %s n LIMIT %s) AS anon_1 LEFT OUTER JOIN
instance_info_caches AS instance_info_caches_1
OpenStack Australia Day 2016
http://logstash.openstack.org/#/dashboard/file/logstash.json
OpenStack Australia Day 2016
Got Logs
●
Troubleshooting from the 90’s
●
Log Aggregation FTW
●
Support infrastructure just as important as the Cloud
●
Testing in Prod == a resume generating event
Difference between Metrics and
Monitoring
OpenStack Australia Day 2016
Use metrics to prove your theories
https://www.elastic.co/blog/kibana-4-5-0-released
Let's Play the Blame Game
OpenStack Australia Day 2016
Let's Play the Blame Game
∙ Enforce OLAs
∙ Influence and support purchasing
Fool me once, shame on you.
Fool me twice, monitor it!
OpenStack Australia Day 2016
Fool me twice, monitor it!
(cc) rarm
Role Play
OpenStack Australia Day 2016
(cc) https://www.flickr.com/photos/d0ppler/
Role Play
OpenStack Australia Day 2016
Exercise 1:
You arrive to work and discover one of you compute nodes had been hard powered off. The
node was running three high priority instances, a small 60GB Windows instance and two
medium RHEL instances.
Goal:
Without rebuilding the compute node, restart the instances on another node.
Example Scenario
BackUps!
OpenStack Australia Day 2016
Exercise 3:
One of the admins accidentally dropped a database table. However, rather than just clearing
out the redundant data they dropped all the tables form the OpenStack nova database.
Thankfully you saw the user do this and can respond quickly.
Goal:
Redirect users to a temporary site stating that an outage has occurred. Restore the database
and ensure that all services are able to successfully interact with the database before
removing the redirect
BackUp Scenario
OpenStack Australia Day 2016
Now you are an Elite Cloud Admin
(cc) https://www.flickr.com/photos/-chuckc-/
Questions & Possibly Answers
THANK YOU
plus.google.com/+RedHat
linkedin.com/company/red-hat
youtube.com/user/RedHatVideos
facebook.com/redhatinc
twitter.com/RedHatNews

/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat

  • 1.
    /bin/tails from OpenStackOperations OpenStack Australia Day Rarm Nagalingam DevOps J.O.A.T Engineer May 2016
  • 2.
    OpenStack Australia Day2016 INTRODUCTION Rarm Nagalingam DevOps J.O.A.T Engineer rarm@redhat.com linkedin.com/in/rarm-nagalingam-736aa54
  • 3.
    OpenStack Australia Day20163 ● Current Architecture, Size, Workloads ● Patch Methodology ● User Issue: Is the Cloud Slow!! today? ● egrep fail -R ./ == fail ● Let's play the blame game ● Fool me once, shame on you, fool me twice, monitor it! ● Role Play ● Questions & possibly Answers AGENDA OpenStack Australia Day: /bin/tails from OpenStack Operations
  • 4.
  • 5.
    OpenStack Australia Day2016 Current - RHELOSP 5.0 (ICEHOUSE) • 3 x Physical Controllers • 3 x Physical DB Nodes • 2 x Virtual Load Balancers • 26 x Compute Nodes (56 vCPUs and 256 GB ram) • 1456 vCPUs / 6.6TB of RAM – 90% allocated • Storage NFS via Filer
  • 6.
  • 7.
    OpenStack Australia Day2016 Future - RHELOSPd 8.0 (LIBERTY) ● 3 x Physical Controllers ● 3 x Physical DB Nodes ● 3 x Physical CEPH Monitor Nodes ● 9 x Physical CEPH Storage Nodes (~ 36TB per node with NVMe Journals) ● 2 x Virtual Load Balancers ● (xxx) x Compute Nodes (56 vCPUs and 512 GB ram each)
  • 8.
  • 9.
    OpenStack Australia Day2016 Current Workloads ● Cloud Based ● Web Apps ● Cloudy-VMs ++ https://www.flickr.com/photos/truedimensions/
  • 10.
  • 11.
    OpenStack Australia Day2016 Patch Methodology https://www.flickr.com/photos/emma-lego/
  • 12.
    Is the CloudSlow!! today?
  • 13.
    OpenStack Australia Day2016 ● Option 1: Scatter Gun Take Aim Fire Ah... www.safaribooksonline.com
  • 14.
    OpenStack Australia Day2016 Option 2: Become an Elite Cloud Admin (cc) https://www.flickr.com/photos/-chuckc-/
  • 15.
    egrep fail -R./ == FAIL
  • 16.
    OpenStack Australia Day2016 ERROR nova.openstack.common.rpc.common [req-c5e13da1-97f2-4da5-855f-1c09a11f328a None None] ['Traceback (most recent call last):n', ' File "/opt/stack/nova/nova/openstack/common/rpc/amqp.py", line 461, in _process_datan **args)n', ' File "/opt/stack/nova/nova/openstack/common/rpc/dispatcher.py", line 172, in dispatchn result = getattr(proxyobj, method)(ctxt, **kwargs)n', ' File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 439, in innern return catch_client_exception(exceptions, func, *args, **kwargs)n', ' File "/opt/stack/nova/nova/openstack/common/rpc/common.py", line 420, in catch_client_exceptionn return func(*args, **kwargs)n', ' File "/opt/stack/nova/nova/network/manager.py", line 573, in get_instance_nw_infon instance_uuid)n', ' File "/opt/stack/nova/nova/db/api.py", line 561, in virtual_interface_get_by_instancen return IMPL.virtual_interface_get_by_instance(context, instance_id)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 138, in wrappern instance_get_by_uuid(context, instance_uuid)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 126, in wrappern return f(*args, **kwargs)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1678, in instance_get_by_uuidn columns_to_join=columns_to_join)n', ' File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 1684, in _instance_get_by_uuidn filter_by(uuid=uuid).n', ' Filepython2.7/dist- packages/sqlalchemy/engine/base.py", line 1449, in executen params)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1584, in _execute_clauseelementn compiled_sql, distilled_paramsn', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1698, in _execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/base.py", line 1691, in _execute_contextn context)n', ' File "/usr/lib/python2.7/dist-packages/sqlalchemy/engine/default.py", line 331, in do_executen cursor.execute(statement, parameters)n', ' File "/usr/lib/python2.7/dist-packages/MySQLdb/cursors.py", line 174, in executen self.errorhandler(self, exc, value)n', ' File "/usr/lib/python2.7/dist- packages/MySQLdb/connections.py", line 36, in defaulterrorhandlern raise errorclass, errorvaluen', 'OperationalError: (OperationalError) (1054, "Unknown column 'instances.locked_by' in 'field list'") 'SELECT anon_1.instances_created_at AS anon_1_instances_created_at, anon_1.instances_updated_at AS anon_1_instances_updated_at, anon_1.instances_deleted_at AS anon_1_instances_deleted_at, anon_1.instances_deleted AS anon_1_i instances_hostname, instances.launch_index AS instances_launch_index, instances.key_name AS instances_key_name, instances.key_data AS instances_key_data, instances.power_state AS instances_power_state, instances.vm_state AS instances_vm_state, instances.task_state AS instances_task_state…...ces_access_ip_v6, instances.auto_disk_config AS instances_auto_disk_config, instances.progress AS instances_progress, instances.shutdown_terminate AS instances_shutdown_terminate, instances.disable_terminate AS instances_disable_terminate, instances.cell_name AS instances_cell_name, instances.internal_id AS instances_internal_id, instances.cleaned AS instances_cleaned nFROM instances nWHERE instances.deleted = %s AND instances.uuid = %s n LIMIT %s) AS anon_1 LEFT OUTER JOIN instance_info_caches AS instance_info_caches_1
  • 17.
    OpenStack Australia Day2016 http://logstash.openstack.org/#/dashboard/file/logstash.json
  • 18.
    OpenStack Australia Day2016 Got Logs ● Troubleshooting from the 90’s ● Log Aggregation FTW ● Support infrastructure just as important as the Cloud ● Testing in Prod == a resume generating event
  • 19.
  • 20.
    OpenStack Australia Day2016 Use metrics to prove your theories https://www.elastic.co/blog/kibana-4-5-0-released
  • 21.
    Let's Play theBlame Game
  • 22.
    OpenStack Australia Day2016 Let's Play the Blame Game ∙ Enforce OLAs ∙ Influence and support purchasing
  • 23.
    Fool me once,shame on you. Fool me twice, monitor it!
  • 24.
    OpenStack Australia Day2016 Fool me twice, monitor it! (cc) rarm
  • 25.
  • 26.
    OpenStack Australia Day2016 (cc) https://www.flickr.com/photos/d0ppler/ Role Play
  • 27.
    OpenStack Australia Day2016 Exercise 1: You arrive to work and discover one of you compute nodes had been hard powered off. The node was running three high priority instances, a small 60GB Windows instance and two medium RHEL instances. Goal: Without rebuilding the compute node, restart the instances on another node. Example Scenario
  • 28.
  • 29.
    OpenStack Australia Day2016 Exercise 3: One of the admins accidentally dropped a database table. However, rather than just clearing out the redundant data they dropped all the tables form the OpenStack nova database. Thankfully you saw the user do this and can respond quickly. Goal: Redirect users to a temporary site stating that an outage has occurred. Restore the database and ensure that all services are able to successfully interact with the database before removing the redirect BackUp Scenario
  • 30.
    OpenStack Australia Day2016 Now you are an Elite Cloud Admin (cc) https://www.flickr.com/photos/-chuckc-/
  • 31.
  • 32.