Enforcing Application
SLAs with Congress
and Monasca
Fabio Giannetti, Ken Owens
April 28, 2016
• Vision
• Congress and Monasca implementing:
• OPS/NOC SLA Policies
• App Intent SLA Policies
• Current State and Next Steps
Outline
Vision
• Application
owners/developers do not
care about the underlining
infrastructure unless it is a
problem.
• Microservices based
architectures demands
inherently granular
application design.
• SLAs for applications must
be holistic and independent
of the underlining
infrastructure
Vision
Host
Virtualization VirtualizationContainer Container
Container Container
Srvc Srvc Srvc Srvc Srvc Srvc Srvc
Application A Application B
Enable business/application
owners to easily define the
aspects that are relevant in
running their applications with
the budget constraints that are
imposed by IT.
Vision
Monitoring is now holistic and has to
consider various level of
virtualization and harmonize data
over the different layers.
Containers are short lived and
moved around the available
infrastructure.
Vision
Host
Virtualization VirtualizationContainer Container
Container Container
Application owners’ soft limits (alarms) are notified back and hard limits
(actions) are performed whenever required.
Vision
OPS/NOC SLA using
Congress and Monasca
Underutilized Servers 
OPS/NOC Policy Example
error(vm, email) :-
nova:server_owner(vm, owner),
two_months_before_today(start, end),
ceilometer:statistics(vm, start, end, “cpu-util”, cpu),
cpu < 5,
keystone:email(owner, email)
two_months_before_today(start, end) :-
date:today(end),
date:minus(end, “2 months”, start)
If a VM has less than 5% CPU utilization for the last 2 months,
then notify its owner via email
Current Solution
Ceilometer API
Congress API
Policy
Engine
Ceilometer
Datasource
GET
/v2/meters/cpu_util/statistics?resource_
id=…
VM UUID (Resource ID) CPU
xxxxxxxx-0001-xxxx-xxxxxxxxxxx
xxxxxxxx-0002-xxxx-xxxxxxxxxxx
xxxxxxxx-0003-xxxx-xxxxxxxxxxx
xxxxxxxx-0004-xxxx-xxxxxxxxxxx
xxxxxxxx-0005-xxxx-xxxxxxxxxxx
Poll every <n>s
40
30
2
70
55
Current Solution
Congress APIPolicy
Engine
Ceilometer
Datasource
VM UUID (Resource ID) CPU
xxxxxxxx-0001-xxxx
xxxxxxxx-0002-xxxx
xxxxxxxx-0003-xxxx
xxxxxxxx-0004-xxxx
xxxxxxxx-0005-xxxx
40
30
2
70
55
Nova API
Nova
Datasource
Keystone
Datasource
Keystone API
VM Owner
xxxxxxxx-0001-xxxx Ann
xxxxxxxx-0002-xxxx Fabio
xxxxxxxx-0003-xxxx Fabio
xxxxxxxx-0004-xxxx Ken
xxxxxxxx-0005-xxxx Ken
Owner Email
Ann AnnNotRealEmail@cisco.com
Fabio FabioNotRealEmail@cisco.com
Ken KenNotRealEmail@cisco.com
VM Email
xxxxxxxx-0003-xxxx FabioNotRealEmail@cisco.com
From Policy to Alarm
error(vm, email) :-
nova:server_owner(vm, owner),
two_months_before_today(start, end),
monasca_alarms:stats(vm, start, end, “cpu.user_perc”, cpu),
cpu < 5,
keystone:email(owner, email)
two_months_before_today(start, end) :-
date:today(end),
date:minus(end, “2 months”, start)
{
"name":"Average CPU percent is less than 5",
"description":"The average CPU percent is lesser than 5",
"expression":"(avg(cpu.user_perc{resource_id=vm}) < 5)",
"match_by":[
"resource_id"
],
"severity":”HIGH",
"ok_actions":[
”action_id_for_ok"
],
"alarm_actions":[
”action_id_for_alarm"
]
}
Proposed Solution (receiving notif.)
Metrics
DB
Monasca
Agents
Monasca API
Notification
Engine
Threshold
Engine
Persister
Kafka Cluster
Congress API
Policy
Engine
Monasca Alarm
Datasource
Webhook:
…/v1/data-
sources/monasca_alarm
?execute&action=handl
e_alarm
Settings
DB
monasca notification-create congress WEBHOOK
http:…/v1/data-
sources/monasca_alarm?execute&action=handle_ala
handle_alarm(params)
VM UUID (Resource ID) CPU
xxxxxxxx-0003-xxxx 2
POST /v2.0/alarm-definitions
Proposed Solution (receiving notifications)
Congress API
Policy
Engine
Monasca Alarm
Datasource
VM UUID (Resource ID) CPU
xxxxxxxx-0003-xxxx 2
Nova API
Nova
Datasource
Keystone
Datasource
Keystone API
VM Owner
xxxxxxxx-0003-xxxx Fabio
Owner Email
Fabio FabioNotRealEmail@cisco.com
VM Email
xxxxxxxx-0003-xxxx FabioNotRealEmail@cisco.com
Application Intent SLA using
Congress and Monasca
VM Evacuation for Biz Critical App if Host has potential health issues 
App Intent Policy Example
error(vm) :-
nova:show(vm, hostID),
monasca_alarm:host_issues(hostID)
If a Host has issues, for instance:
1. Unhealthy: cannot be pinged and or SSH into
2. Network errors and packet loss
3. Disk space below certain threshold
App Intent Policy: Metrics Correlation
error(vm) :-
nova:show(vm, hostID),
monasca_alarm:host_issues(hostID)
Metric Name Dimensions Value
host_alive_status observer_host=fqdn,
hostname=supplied hostname being
checked,
test_type=ping or ssh
0=online, 1=offline
disk.space_used_perc device, mount_point The percentage of disk space that
is being used on a device
net.in_packets_dropped_sec device Number of inbound network packets
dropped per second
net.out_packets_dropped_sec device Number of outbound network
packets dropped per second
App Intent Policy: Multi-Alarms #1
{
"name":”Host is Unhealty",
"description":"The host is considered unhealty",
"expression":"(host_alive_status{host_id=hostID}) = 1)",
"match_by":[
"host_id"
],
...
}
{
"name":”Host disk getting full",
"description":"The host disk is reaching capacity",
"expression":"(disk.space_used_perc{host_id=hostID}) > 90)",
"match_by":[
"host_id"
],
...
}
Metric Name Value
host_alive_status 0=online, 1=offline
disk.space_used_perc The percentage of disk
space that is being used on
a device
net.in_packets_dropped_sec Number of inbound network
packets dropped per
second
net.out_packets_dropped_se
c
Number of outbound
network packets dropped
per second
App Intent Policy: Multi-Alarms #2
{
"name":”Host is Unhealty",
"description":"The host is considered unhealty",
"expression":"(net.in_packets_dropped_sec{host_id=hostID}) > 30)",
"match_by":[
"host_id"
],
...
}
{
"name":”Host disk getting full",
"description":"The host disk is reaching capacity",
"expression":"(net.out_packets_dropped_sec{host_id=hostID}) > 30)",
"match_by":[
"host_id"
],
...
}
Metric Name Value
host_alive_status 0=online,
1=offline
disk.space_used_perc The percentage
of disk space that
is being used on
a device
net.in_packets_dropped_sec Number of
inbound network
packets dropped
per second
net.out_packets_dropped_sec Number of
outbound network
packets dropped
per second
Current State and Future
Work
Overall Architecture
Settings
DB
Metrics
DB
Monasca
Agents
Monasca API
Keystone
Notification
Engine
Threshold
Engine
Persister
Kafka Cluster
Congress API
Policy
Engine
Monasca Alarm
Datasource
Metric Value
metric1 val1
metricN valN
In Mem DB
webhookrpc
• Done:
• Developed a Monasca Datasource to validate integration.
• Designed the solution and found the main integration points
• To be Done:
• Developed a Monasca Alarm Datasource leveraging the RPC
capabilties in Congress.
• Create a Congress Notification Webhook for Monasca
• Develop a policy to alarm conversion component to develop
policies prefixed with monasca-alarm.
Current Status and Next Steps
OpenStack Summit
Austin, Texas 2016
Thank You!

Enforcing Application SLA with Congress and Monasca

  • 1.
    Enforcing Application SLAs withCongress and Monasca Fabio Giannetti, Ken Owens April 28, 2016
  • 2.
    • Vision • Congressand Monasca implementing: • OPS/NOC SLA Policies • App Intent SLA Policies • Current State and Next Steps Outline
  • 3.
  • 4.
    • Application owners/developers donot care about the underlining infrastructure unless it is a problem. • Microservices based architectures demands inherently granular application design. • SLAs for applications must be holistic and independent of the underlining infrastructure Vision Host Virtualization VirtualizationContainer Container Container Container Srvc Srvc Srvc Srvc Srvc Srvc Srvc Application A Application B
  • 5.
    Enable business/application owners toeasily define the aspects that are relevant in running their applications with the budget constraints that are imposed by IT. Vision
  • 6.
    Monitoring is nowholistic and has to consider various level of virtualization and harmonize data over the different layers. Containers are short lived and moved around the available infrastructure. Vision Host Virtualization VirtualizationContainer Container Container Container
  • 7.
    Application owners’ softlimits (alarms) are notified back and hard limits (actions) are performed whenever required. Vision
  • 8.
  • 9.
    Underutilized Servers  OPS/NOCPolicy Example error(vm, email) :- nova:server_owner(vm, owner), two_months_before_today(start, end), ceilometer:statistics(vm, start, end, “cpu-util”, cpu), cpu < 5, keystone:email(owner, email) two_months_before_today(start, end) :- date:today(end), date:minus(end, “2 months”, start) If a VM has less than 5% CPU utilization for the last 2 months, then notify its owner via email
  • 10.
    Current Solution Ceilometer API CongressAPI Policy Engine Ceilometer Datasource GET /v2/meters/cpu_util/statistics?resource_ id=… VM UUID (Resource ID) CPU xxxxxxxx-0001-xxxx-xxxxxxxxxxx xxxxxxxx-0002-xxxx-xxxxxxxxxxx xxxxxxxx-0003-xxxx-xxxxxxxxxxx xxxxxxxx-0004-xxxx-xxxxxxxxxxx xxxxxxxx-0005-xxxx-xxxxxxxxxxx Poll every <n>s 40 30 2 70 55
  • 11.
    Current Solution Congress APIPolicy Engine Ceilometer Datasource VMUUID (Resource ID) CPU xxxxxxxx-0001-xxxx xxxxxxxx-0002-xxxx xxxxxxxx-0003-xxxx xxxxxxxx-0004-xxxx xxxxxxxx-0005-xxxx 40 30 2 70 55 Nova API Nova Datasource Keystone Datasource Keystone API VM Owner xxxxxxxx-0001-xxxx Ann xxxxxxxx-0002-xxxx Fabio xxxxxxxx-0003-xxxx Fabio xxxxxxxx-0004-xxxx Ken xxxxxxxx-0005-xxxx Ken Owner Email Ann AnnNotRealEmail@cisco.com Fabio FabioNotRealEmail@cisco.com Ken KenNotRealEmail@cisco.com VM Email xxxxxxxx-0003-xxxx FabioNotRealEmail@cisco.com
  • 12.
    From Policy toAlarm error(vm, email) :- nova:server_owner(vm, owner), two_months_before_today(start, end), monasca_alarms:stats(vm, start, end, “cpu.user_perc”, cpu), cpu < 5, keystone:email(owner, email) two_months_before_today(start, end) :- date:today(end), date:minus(end, “2 months”, start) { "name":"Average CPU percent is less than 5", "description":"The average CPU percent is lesser than 5", "expression":"(avg(cpu.user_perc{resource_id=vm}) < 5)", "match_by":[ "resource_id" ], "severity":”HIGH", "ok_actions":[ ”action_id_for_ok" ], "alarm_actions":[ ”action_id_for_alarm" ] }
  • 13.
    Proposed Solution (receivingnotif.) Metrics DB Monasca Agents Monasca API Notification Engine Threshold Engine Persister Kafka Cluster Congress API Policy Engine Monasca Alarm Datasource Webhook: …/v1/data- sources/monasca_alarm ?execute&action=handl e_alarm Settings DB monasca notification-create congress WEBHOOK http:…/v1/data- sources/monasca_alarm?execute&action=handle_ala handle_alarm(params) VM UUID (Resource ID) CPU xxxxxxxx-0003-xxxx 2 POST /v2.0/alarm-definitions
  • 14.
    Proposed Solution (receivingnotifications) Congress API Policy Engine Monasca Alarm Datasource VM UUID (Resource ID) CPU xxxxxxxx-0003-xxxx 2 Nova API Nova Datasource Keystone Datasource Keystone API VM Owner xxxxxxxx-0003-xxxx Fabio Owner Email Fabio FabioNotRealEmail@cisco.com VM Email xxxxxxxx-0003-xxxx FabioNotRealEmail@cisco.com
  • 15.
    Application Intent SLAusing Congress and Monasca
  • 16.
    VM Evacuation forBiz Critical App if Host has potential health issues  App Intent Policy Example error(vm) :- nova:show(vm, hostID), monasca_alarm:host_issues(hostID) If a Host has issues, for instance: 1. Unhealthy: cannot be pinged and or SSH into 2. Network errors and packet loss 3. Disk space below certain threshold
  • 17.
    App Intent Policy:Metrics Correlation error(vm) :- nova:show(vm, hostID), monasca_alarm:host_issues(hostID) Metric Name Dimensions Value host_alive_status observer_host=fqdn, hostname=supplied hostname being checked, test_type=ping or ssh 0=online, 1=offline disk.space_used_perc device, mount_point The percentage of disk space that is being used on a device net.in_packets_dropped_sec device Number of inbound network packets dropped per second net.out_packets_dropped_sec device Number of outbound network packets dropped per second
  • 18.
    App Intent Policy:Multi-Alarms #1 { "name":”Host is Unhealty", "description":"The host is considered unhealty", "expression":"(host_alive_status{host_id=hostID}) = 1)", "match_by":[ "host_id" ], ... } { "name":”Host disk getting full", "description":"The host disk is reaching capacity", "expression":"(disk.space_used_perc{host_id=hostID}) > 90)", "match_by":[ "host_id" ], ... } Metric Name Value host_alive_status 0=online, 1=offline disk.space_used_perc The percentage of disk space that is being used on a device net.in_packets_dropped_sec Number of inbound network packets dropped per second net.out_packets_dropped_se c Number of outbound network packets dropped per second
  • 19.
    App Intent Policy:Multi-Alarms #2 { "name":”Host is Unhealty", "description":"The host is considered unhealty", "expression":"(net.in_packets_dropped_sec{host_id=hostID}) > 30)", "match_by":[ "host_id" ], ... } { "name":”Host disk getting full", "description":"The host disk is reaching capacity", "expression":"(net.out_packets_dropped_sec{host_id=hostID}) > 30)", "match_by":[ "host_id" ], ... } Metric Name Value host_alive_status 0=online, 1=offline disk.space_used_perc The percentage of disk space that is being used on a device net.in_packets_dropped_sec Number of inbound network packets dropped per second net.out_packets_dropped_sec Number of outbound network packets dropped per second
  • 20.
    Current State andFuture Work
  • 21.
    Overall Architecture Settings DB Metrics DB Monasca Agents Monasca API Keystone Notification Engine Threshold Engine Persister KafkaCluster Congress API Policy Engine Monasca Alarm Datasource Metric Value metric1 val1 metricN valN In Mem DB webhookrpc
  • 22.
    • Done: • Developeda Monasca Datasource to validate integration. • Designed the solution and found the main integration points • To be Done: • Developed a Monasca Alarm Datasource leveraging the RPC capabilties in Congress. • Create a Congress Notification Webhook for Monasca • Develop a policy to alarm conversion component to develop policies prefixed with monasca-alarm. Current Status and Next Steps
  • 23.