SlideShare a Scribd company logo
1 of 79
Download to read offline
By U.S. Navy photo by Mass Communication Specialist 1st Class James E. Foehl [Public domain], via Wikimedia Commons
The Age Old Question: What should
our APM Solution be monitoring?
Follow Us: #ITSMSummit!
Mr. White has fteen years of experience designing and managing the
deployment of Systems Monitoring and Event Management software. Prior
to joining IBM, Mr. White held various positions including the leader of the
Monitoring and Event Management organization of a Fortune 100 company
and developing solutions as a consultant for a wide variety of organizations,
including the Mexican SecretarĂ­a de Hacienda y CrĂŠdito PĂşblico, Telmex,
Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US
Navy Facilities and Engineering Command.
Andrew White
Cloud and Smarter Infrastructure Solution Specialist
IBM Corporation
http://weheartit.com/entry/12433848!
Follow Us: #ITSMSummit!
Ground rules for this
session…
•  If you can’t tell if I am trying to be funny…
–  
GO AHEAD AND LAUGH!
•  Feel free to text, tweet, yammer, or whatever.
Use 
•  If you have a question, no need to wait until
the end. Just interrupt me. Seriously… I
don’t mind.
I have a lot of experience leading
Systems and Event Management teams
My name is Andrew White
I am here today to share some of what I have learned about
Systems Thinking,
and APM.
*Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) – Gomez Mobile Web Experience Survey conducted by Equation Research
58% of mobile phone users expect websites
to load as quickly, almost as quickly or faster
on their mobile phone, compared to the
computer they use at home*
http://www.flickr.com/photos/lucianbickerton/3858380291/sizes/l/!
*Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) – Gomez Mobile Web Experience Survey conducted by Equation Research
60% of mobile web users have had a problem in the
past year when accessing a website on their phone*
http://www.flickr.com/photos/rickyromero/1357938629/sizes/l/!
*Among adults who accessed the internet with a mobile phone in the past 12 months (n=602) – Gomez Mobile Web Experience Survey conducted by Equation Research
Slow load time was the number on issue,
experience by almost 75% of them*
http://bighugelabs.com/onblack.php?id=2497744197&size=large!
Is 5 seconds really bad?
Start…
Start…
Observed Maximum:
90th Percentile:
5.44 seconds…
15.4 seconds…
Start…
Start…
Observed Maximum:
90th Percentile:
DONE!5.44 seconds…
15.4 seconds…
Start…
Start…
Observed Maximum:
90th Percentile:
DONE!
DONE!
5.44 seconds…
15.4 seconds…
If you were the one on the phone
with one of those customers…!
how would you ll that silence?!
The rationality of individuals is
limited by the information they
have. This causes “The
Tragedy of the Commons.”
Follow Us: #ITSMSummit!
What Is a System?
It is a set of interconnected actors that change
over time when they are influenced by other
elements of the system.
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
As we have become more
aware that things are always
happening, our behavior has
changed.
We are no longer thinking,
we are reacting…
Follow Us: #ITSMSummit! http://static4.businessinsider.com/image/5176c232ecad04805d000010-505-277/
screen%20shot%202013-04-23%20at%201.09.49%20pm.png
April 23, 2013
The Twitter account for the
Associated Press was hacked

The hackers posted a fake
notice that the White House
was attacked and President
Obama was injured

The Dow dropped 150 points
in less than 5 minutes
Follow Us: #ITSMSummit!
Systems are Volatile
This change makes it difcult to control the
behavior of the system. The good news is that
systems are perfect. They always deliver the
optimum result given a specic stimuli.
Follow Us: #ITSMSummit!
Anatomy of An OutageP0 - Affecting Multiple apps!
Corporate
LANs & VPNs
Load Balancer
Firewall
Web
Servers
Message
Queue
zOS
CICS
WAS
Database
WAS
Database
zOS
MQ
DB2




4






3






1
5:45-ish pm: CICS ABENDS
start flooding OMEGAMON but
not high enough to ticket






2
6:00-ish pm: MQ flows start are
interrupted and are alerting in
Flow Diagnostics
6:54pm: Support teams
investigate the interrupted
flows and determine it is a
“back-end” problem




5
 10:29pm: Support teams investigate
MQ and ultimately and rule it out and
ultimately decide to reset CICS to
resolve the issue
6:04pm: Synthetic
transactions fail at and
6:14 the Ops Center
conrms the issue and
creates a P0 Incident
Follow Us: #ITSMSummit!
Our Problem Statement:
The business needs to reliably reach its customers and
users regardless of where they may be located. Latency
forces close geographic proximity of the components
and limits the quality of service provided to
geographically distributed customers. 

If the users can’t use it, it doesn’t work.
Follow Us: #ITSMSummit!
Our Constraints
At the same time, there are a few inescapable facts we face:
1.  Today’s users demand reliable systems to do their work
2.  Our systems mirror the complexity of the businesses they
support
3.  Our environments must be massive to scale to handle the
workload
4.  There is too much activity for a single person to be totally
situationally aware
When all of these happen at the same time…
Ug…
Follow Us: #ITSMSummit!
Question
Is there a better way to gure out what
monitoring would help?
Follow Us: #ITSMSummit!
Your monitoring should help you answer:
•  How will we know if the users are getting the experience
they are expecting?
•  How much capacity do we need during normal and peak
times to ensure user expectations are met?
•  How quickly can the provider we select ramp up to meet
our needs if we nd that the service is underperforming?
•  How fast do we need to be able to access additional
capacity once it is ready for us?
What Do You Want To Accomplish?
Follow Us: #ITSMSummit!
When decisions are not made based
on information, it’s called gambling.
Follow Us: #ITSMSummit!
Composite
Applications
Site Content!
Search!
Session!
Information!
User Login!
& Identity Mgmt!
Content Mgmt!
System!
Social Network!
Widgets!
Site Tracking!
& Analytics!
Banner Ads & !
Revenue Generators!
Multimedia &!
CDN Content!
Follow Us: #ITSMSummit!
The Same Old Problem
Corporate!
LANs & VPNs!
ISP!
Connection!
DNS & Internet!
Services!
Content Mgmt!
System!
Social Network!
Widgets!
Site Tracking!
& Analytics!
Banner Ads & !
Revenue Generators!
Multimedia &!
CDN Content!
Home Wireless!
& Broadband!
Mobile Broadband!
Is It My Data Center?!
•  Configuration errors!
•  Application design issues!
•  Code defects!
•  Insufficient infrastructure!
•  Oversubscription Issues!
•  Poor routing optimization!
•  Low cache hit rate!
Is It a Service Provider Problem?!
•  Non-optimized mobile content!
•  Bad performance under load!
•  Blocking content delivery!
•  Incorrect geo-targeted content!
Is it an ISP
Problem?!
•  Peering problems!
•  ISP Outages! Is it My Code or a Browser Problem?!
•  Missing content!
•  Poorly performing JavaScript!
•  Inconsistent CSS rendering!
•  Browser/device incompatibility!
•  Page size too big!
•  Conflicting HTML tag support!
•  Too many objects!
•  Content not optimized for device!
The Cloud!
Distributed
Database
Mainframe
Network
Middleware
Storage
Follow Us: #ITSMSummit!
Cognitive Dissonance
Corporate
LANs & VPNs
Distributed
Database
Mainframe
Network
Middleware
Storage
ISP
Connection
DNS & Internet
Services
Content Mgmt
System
Social Network
Widgets
Site Tracking
& Analytics
Banner Ads & 
Revenue Generators
Multimedia &
CDN Content
Home Wireless
& Broadband
Mobile Broadband
The Part You Control
The Part They Experience
…meanwhile
the user is
NOT
happy
All our systems
look great,
SLA’s are being
met…
You Have More
Control Here Than
You Think
Follow Us: #ITSMSummit!
Gaining Perspective
Requires Balance
Packet Capture!
Synthetic Transactions!
Client Monitoring!
Client Monitoring!
Synthetic Transactions!
Server Probe!
1.  Client to the Server!
2.  Server to the Client!
3.  “3rd Party” Vantage Point!
4.  Synthetic Transactions!
Four Perspectives of User Experience!
Follow Us: #ITSMSummit!
What Does Good
Monitoring Look Like?
Corporate!
LANs & VPNs!
Load Balancer!
Load Balancer!
Firewall!
Switch!
Web Server Farm!
Database!
Data Power!
Mainframe!
Middleware!
Load Balancer!
1.  System Availability
2.  Operating System Performance
3.  Hardware Monitoring
4.  Service/Daemon and Process Availability
5.  Error Logs
6.  Application Resource KPIs
7.  End-to-End Transactions
8.  Point of Failure Transactions
9.  Fail-Over Success
10. “Activity Monitors” and “Reverse Hockey Stick”
Elements of Good Monitoring !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
3!2! 4! 5! 6!1!
!
!
!
!
7!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
8!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
9! !
!
!
!
!
!
10!
Follow Us: #ITSMSummit!
Finding Metrics That Matter
§  Will the metric be used in a report? If so, which one? How is it used in the report?
§  Will the metric be used in a dashboard? If so, which one? How will it be used?
§  What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket
be generated? If so, what severity?
§  How often is this event likely to occur? What is the impact if the event occurs? What
is the likelihood it can be detected by monitoring?
§  Will the metric help identify the source of a problem? Is it a coincident / symptomatic
indicator?
§  Is the metric always associated with a single problem? Could this metric become a
false indicator?
§  What is the impact if this goes undetected?
§  What is the lifespan for this metric? What is the potential for changes that may
reduce the efcacy of the metric?
Evaluating the Effectiveness of a Metric
Follow Us: #ITSMSummit!
Beware of Averages
75th
Percentile!
50th
Percentile!
25th
Percentile!
0.5! 0.7! 0.9! 1.8! 2.5! 2.5! 2.6! 2.9! 3.3! 3.5!
Average!
Follow Us: #ITSMSummit!
What Matters Most?
Dr.	
  Lee	
  
Goldman	
  
Cook	
  County	
  Hospital,	
  
Chicago,	
  IL	
  
§  Is the patient feeling unstable
angina?
§  Is there fluid in the patient’s lungs?
§  Is the patient’s systolic blood
pressure below 100?"

The Goldman Algorithm
Prediction of Patients Expected to
Have a Heart Attack Within 72 Hours
0	
  
20	
  
40	
  
60	
  
80	
  
100	
  
Traditional Techniques
 Goldman Algorithm
By paying attention to what really matters, Dr.
Goldman improved the “false negatives” by 20
percentage points and eliminated the “false
positives” altogether.
Follow Us: #ITSMSummit!
The Goldman Algorithm
ECG Evidence of Acute Ischemia?
ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads
(New or Unknown Age) or
T- Wave Inversion in ≥ 2 Contiguous Leads (New or
Unknown Age) or
Left Bundle-Branch Block (New or Unknown Age)
Observation
Unit
Inpatient
Telemetry Unit
High Risk
 Low Risk
 Very Low Risk
Moderate Risk
Yes
 No
Coronary
Care Unit
No
ECG Evidence of Acute Myocardial Infarction (MI)?
ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous
Leads (New or Unknown Age)
or
Pathologic Q Waves in ≥ 2 Contiguous Leads (New
or Unknown Age)
Yes
Patient suspected of
Acute Cardiac
Ischema
Perform
Electrocardiogram
(EKG)
0 Factors
2 or 3 Factors
 1 Factors
0 or 1 Factors
2 or 3 Factors
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
Follow Us: #ITSMSummit!
Driving the Right Action
Application!
End User
Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
Follow Us: #ITSMSummit!
Driving the Right Action
Application!
End User
Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
Follow Us: #ITSMSummit!
Driving the Right Action
Application!
End User
Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
KPI 1!
KPI 2!
KPI N!
Database!
Follow Us: #ITSMSummit!
Driving the Right Action
Application!
End User
Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
Follow Us: #ITSMSummit!
Driving the Right Action
Application!
End User
Experience!
Gainesville!
Transaction 1!
Transaction 2!
Transaction N!
San Antonio!
Transaction 1!
Transaction 2!
Transaction N!
Des Moines!
Transaction 1!
Transaction 2!
Transaction N!
Columbus!
Transaction 1!
Transaction 2!
Transaction N!
Infrastructure!
Network!
KPI 1!
KPI 2!
KPI N!
Mainframe!
KPI 1!
KPI 2!
KPI N!
Storage!
KPI 1!
KPI 2!
KPI N!
Linux!
KPI 1!
KPI 2!
KPI N!
Middleware!
KPI 1!
KPI 2!
KPI N!
Database!
KPI 1!
KPI 2!
KPI N!
Our success in any endeavor depends directly on
our ability to solve problems
What do we need to do that?
You Gotta Have Skillz…!
Follow Us: #ITSMSummit!
Common Problem Types
§  Design Problems
§  Creative Problems
§  Daily Problems
§  People Problems
Rule-Based
Approach
Event Based
Approach
Follow Us: #ITSMSummit!
Event-Based Problem Solving
•  Appreciative Understanding
•  Know What We Are Solving
•  Create A Common Reality
•  Solutions Based on Causes
Follow Us: #ITSMSummit!
Rules for Causal
Relationships
Database
Down !
(Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause)!
①  Causes are effects, and effects are causes!
Follow Us: #ITSMSummit!
Rules for Causal
Relationships
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
②  You can keep identifying causes – there is no limit!
Follow Us: #ITSMSummit!
Two Important Questions
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
Ask “Why?”!
Ask “What”!
Follow Us: #ITSMSummit!
Rules for Causal
Relationships
③  An Effect is often the result of multiple causes!
SQL Server was
not processing
queries (Effect)!
Transaction log
was unable to grow!
T: Drive at 0 Bytes
free!
Logs were not
truncated!
DBA on
honeymoon
vacation in Fiji!
Logs are truncated
manually!
Company has only
1 DBA!
“Backup” DBA was
not aware the logs
require truncation!
Space allocations
are xed! Lack of Control!
-AND-!
-AND-!
-AND-!
Follow Us: #ITSMSummit!
Rules for Causal
Relationships
④  Causes need to be both necessary and sufficient!
SQL Server was not
processing queries
(Effect)!
Transaction log was
unable to grow
(Transitory Cause)!
T: Drive at 0 Bytes free!
(Non-transitory Cause
& Effect)!
Logs were not
truncated!
(Transitory Cause &
Effect)!
DBA on honeymoon
vacation in Fiji!
(Transitory Cause)!
Logs are truncated
manually!
(Non-Transitory Cause)!
Company has only 1
DBA!
(Non-Transitory Cause)!
“Backup” DBA was not
aware the logs require
truncation!
(Non-Transitory Cause)!
Space allocations are
xed!
(Non-Transitory Cause)!
Lack of Control!
-AND-!
-AND-!
-AND-!
Follow Us: #ITSMSummit!
How Fire Works
Time
Oxygen
Heat
Fuel
Fire
MatchStrike
Transitory
Non-Transitory
Fire
Oxygen
Heat
Fuel
Match
Strike
-AND-
• Transitory Causes act as catalysts to bring
about change (think Transition)
• Non-Transitory Causes are objects,
properties/attributes, and status
Follow Us: #ITSMSummit!
RCA Diagram
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are xed
 Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
 Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Follow Us: #ITSMSummit!
Add Evidence
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
“Backup” DBA was
not aware the logs
require truncation
Space allocations
are xed
 Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
 Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Statistical Data
Situational
Observation
Follow Us: #ITSMSummit!
Failure Modes Analysis
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
“Backup” DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
xed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results 
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is congured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving les in the
TEMP folder
%TEMP% congured to
C:Temp
Kernel able to write to
page le
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Follow Us: #ITSMSummit!
Picking Monitors
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
“Backup” DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
xed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results 
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is congured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving les in the
TEMP folder
%TEMP% congured to
C:Temp
Kernel able to write to
page le
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Monitor the
intersections at
the “OR’s”
At least one point
along each branch
after the “OR”
Follow Us: #ITSMSummit!
FMEA Matrix
(Impact Calculation)
Negligible (1-2): no loss in functionality,
mostly cosmetic
Marginal (3-4): temporary interruptions or the
degradation lasts for a brief period of time
Critical (5-6): the problem will not resolve
itself but a work around exists allowing the
problem to be bypassed
Serious (7-8): the problem will not resolve
itself and no work around is possible.
Functionality is impaired or lost but the
system is usable to some extent
Catastrophic (9-10): the system is
completely unusable
Improbable (1-2): less than 1 time per year
Remote (3-4): 1 time per year
Occasional (5-6): 1 time per month
Probable (7-8): 1 time per day
Chronic (9-10): 1 or more times per day
Very high (1-2): during the design phase
High (3-4): during peer review or unit testing
Moderate (5-6): during system testing or
acceptance testing
Remote (7-8): during or immediately after
production deployment
Very Remote (9-10): only after heavy usage
by users
Follow Us: #ITSMSummit!
FMEA Matrix
(Evidence)
These are the events that help us to RULE IN a
failure mode as a possible cause
These are the events that help us RULE OUT the
failure mode as not relevant
Follow Us: #ITSMSummit!
Determining Severity
Logical Server
Virtual
Machine 1
Virtual
Machine 2
Severity
 Description
Critical
 The component has completely failed
Major
 The component is operating but is in a degraded or crippled state
Minor
 The component is functioning normally but is at risk of a more serious failure
Informational
 The component is functioning normally but is reporting a change in state
Unknown
 The component has changed its operating state but the effect is not known
Clear
 The component is operating normally or a higher severity event has been resolved
•  The event severity is determined with
respect to the component generating the
event
•  The event severity does not consider impact
or urgency
•  The incident priority is not determined by
event severity
•  The event severity helps drive an effective
triage when multiple events arrive at
approximately the same time
•  Only after the effected components and
their relationships to each other have been
determined can impact and urgency be
determined
Six Levels of Severity
Physical Server
Server 1
Server 2
Logical Volumes
Volume
Group 1
Volume
Group 2
Physical Volumes
Hard
Drive 1
Hard
Drive 2
Hard
Drive 3
Follow Us: #ITSMSummit!
Monitoring Patterns
Layers of Pre-Dened Monitoring Patterns 
•  The OS template is deployed when the
server is provisioned
•  As a server is customized to fit its role,
additional templates are deployed
•  Templates are stacked on top of each
other until no gaps remain
•  This approach provides a high degree of
standardization without sacricing the
ability to develop a custom solution
Follow Us: #ITSMSummit!
Application-Technology
Matrix Maps services, applications and technologies
enabling:
• Monitoring investment prioritization
• Monitoring maturity
• Which templates need to be deployed when new
hardware is acquired
• Whether an service has sufficient monitoring
coverage based on its application components
• This approach allows for anticipating changes to
a customer’s monitoring needs
Scores indicate:
0 – No Strategy
1 – Limited Monitoring
2 – Fully Integrated Strategy
Follow Us: #ITSMSummit!
Event Lifecycle
Legend!
Element Manager!
Distributed Collectors!
Object Server Triggers!
Impact Policies!
ITNM RCA Engine!
Gateway Replication!
Webtop Event List!
Software-Operating System!
Data Collection!
Anomaly Detection!
Event Generation!
Integration!
Event Processing!
Enrichment!
Event Suppression!
Correlation!
Root Cause Analysis!
Business Impact Analysis!
Automation!
Notication & Escalation!
Presentation!
User Interaction Tools!
Archiving!
Reporting!
Activity! Responsible Tool!
Trigger Ticket Request!
Create Ticket!
Update Event with IM#!
Trigger Courtesy Pages!
Send Pages!
Activity! Responsible Tool!
Follow Us: #ITSMSummit!
AutomatedAction
Noticationand
Escalation
BusinessImpact
Analysis
RootCauseAnalysis
Correlationand
EventSuppression
Enrichment
Meta-Data Integration Bus
DistributedCollectors
DistributedCollectors
LOB Managed
Monitoring System
Service Provider
Monitoring System
Vendor Managed
Monitoring System
Element
Manager
Element
Manager
Element
Manager
Other
Enterprise
Data
Document
Sharing
Service Desk
 CMDB
Batch
Scheduling
Knowledge
Database
Online Run
Book
PBX/Call
Manager
Visualization Framework
CommonEvent
Format
Topology And
Relationship
Database
Automated Action
Tools
DistributedCollectors
Automated
Provisioning
System
Predictive
Analysis
Automated
Change
Reconciliation
Security
Management
ArchiveandReport
Business
Telemetry Data
Service Center
and Enterprise
Notication Tool
Event Processing
Follow Us: #ITSMSummit!
As you recognize opportunities to
capture knowledge, use it to improve
your Event Management System. 
Iterative Development
How do we keep it evolving?!
Follow Us: #ITSMSummit!
Sometimes We Miss What’s Going On
Say… what’s a mountain
goat doing all the way up
here in a cloud bank?
Follow Us: #ITSMSummit!
The Path to Situational
Awareness
Collection Analytics
Situational
Awareness
PresentationAggregation
Each phase builds on the previous helping to establish
situational awareness:
•  Data is collected from our IT systems
•  These data are aggregated into a central location
•  Correlations transform the data into information and predictive
analytics process them further into knowledge
•  The processed and enriched knowledge is presented to users in a
way that helps them make good decisions
Follow Us: #ITSMSummit!
Cleaning Up the Landscape
Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov
2009 https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391
Silo
Monolithic
Framework
 Niche
Launch Pad
Information Bus
Follow Us: #ITSMSummit!
Directed Workflows
Directed !
Non Directed!
Launchpad!
Executive Dashboard!
Business Area!
Dashboards!
Application PAC!
Dashboards!
Command Center!
Dashboards!
Technology Owner!
Dashboard!
Application Owner!
Dashboard!
Problem
Isolation!
Workspace!
Problem
Diagnostics!
Workspace!
System Detail!
View!
Component
Detail!
View!
Follow Us: #ITSMSummit!
73
Here comes the elevator pitch…
Follow Us: #ITSMSummit!
The IBM Solution
!IBM SmartCloud APM Suite offers essential management capabilities
for applications in complex cloud and hybrid environments. !
!
! !
•  At-a-glance status determination
via network topology graphs!
•  Proactively identify and respond to
compliance issues!
•  Monitor the performance of the
environment and the tenants living
inside of it!
•  Understand the current capacity
needs and forecast future needs!
•  Understand the costs associated
with providing the service and
enable “showback” and charge
back” reporting to the application
owners!
SINGLE POINT OF
MANAGEMENT!
!
•  Minimize service and system
outages!
•  Identify recurring incidents and
implement action to remediate
problems before they cause
impacts!
•  Assist troubleshooting by
suppressing “noise” events and
providing root cause determination!
MAXIMIZE SERVICE
AVAILABILITY!
!
•  Reduce the need for manual
action or intervention!
•  Automate for repeatability and
elimination of human error!
•  Develop standardized practices
for complex business processes!
•  Enable the development of APIs
to allow for self-service
management by the consumers!
IMPROVED OPERATIONAL
EFFICIENCY!
Follow Us: #ITSMSummit!
Understand the 
end-user experience 
Follow changing 
workloads
Mobile devices & "
smart endpoints
Private, public & "
hybrid clouds
Highly virtualized applications,
storage & networks 
Discovery
Visibility into
application
resources
End User
Experience
Transaction
performance
monitoring to
ensure SLA
compliance


Transaction
Tracking
Rapid problem
isolation through
transaction "
path analysis


Diagnostics


Domain-specic
operations tools
for diagnosis and
repair


Predictive
Analytics
Proactive
approach to
reduce outages
& improve
performance


shared data & common services
See steps 
across the cloud 
VISIBILITY, CONTROL AND AUTOMATION TO INTELLIGENTLY MANAGE
CRITICAL APPLICATIONS IN CLOUD AND HYBRID ENVIRONMENTS.
Follow Us: #ITSMSummit!
Tivoli Enterprise Portal
Monitor the complete Application
and Application Infrastructure
Measure, Baseline and Analyze the Service and
Transactions
ITCAM for
Applications
ITM for
Microsoft
Applications
ITM
ITCAM for
Transactions
ITCAM for
SOA Platform
OMEGAMON XE
Tivoli
Enterprise
Portal
Tivoli
Automation
Tivoli Data
Warehouse
Tivoli
Common
Reporting
IBM Tivoli Monitoring Solution
Follow Us: #ITSMSummit!
Business Value of Adopting APM
Predic've	
  Outage	
  
Avoidance	
  
Ensure	
  availability	
  of	
  
applica3ons	
  and	
  services	
  
	
  
	
  
• Use learning tools to
augment custom best
practices
• Leverage statistical
methods to maximize
predictive warning
• Improve problem
detection across IT silos
Predict
Faster	
  Problem	
  
Resolu'on	
  
Find	
  &	
  correct	
  problems	
  faster	
  
with	
  tools	
  that	
  determine	
  ac3ons	
  
required	
  to	
  resolve	
  issues	
  
	
  
	
  
• Identify problems quicker
with insight to large
unstructured repositories
• Isolate problems quicker by
bringing relevant unstructured
data into problem
investigations
• Repair problems quicker with
the right details quickly to hand.
Resolve
Op'mized	
  
Performance	
  	
  
Track,	
  Op3mize,	
  and	
  Predict	
  
capacity	
  and	
  performance	
  needs	
  
over	
  3me	
  
	
  
	
  
• Track capacity and
performance of applications
and services in classic and
cloud environments
• Optimize resource
deployment with what-if and
best fit planning tools
• Escalate capacity and
performance problems before
they cause critical failures
Perform
Improved	
  Insight	
  	
  
Enhance	
  visibility	
  into	
  systems	
  
resource	
  rela3onships	
  while	
  
increasing	
  customer	
  sa3sfac3on	
  	
  
	
  
	
  
• Determine what resources
are interdependent to
assess impact of failures
• Gain insight into what is
important to your customer
• Decrease customer churn
and acquisition costs while
increasing customer
retention and satisfaction
Know
Automated Analytics helps lower IT Administration Costs:
•  Performance and Capacity planning tools monitor appropriately and escalate, reducing time consuming
report browsing
•  Learning tools reduce customization and best practices investment on initial deployment
•  Log Analysis helps speed problem resolution to be able to do more with less
Follow Us: #ITSMSummit!
Let’s keep the
conversation going…
Andrew.P.White@Gmail.com!
ReverendDrew!
SystemsManagementZen.Wordpress.com!
systemsmanagementzen.wordpress.com/feed/!
@SystemsMgmtZen!
ReverendDrew!
APWhite@us.ibm.com!
614-306-3434!
Brighttalk   what should we be monitoring - final

More Related Content

What's hot

How to Design for the Future - Cross Channel Experience Design
How to Design for the Future - Cross Channel Experience DesignHow to Design for the Future - Cross Channel Experience Design
How to Design for the Future - Cross Channel Experience DesignOSCON Byrum
 
How slow load times hurt UX (and what you can do about it) [FluentConf 2016]
How slow load times hurt UX (and what you can do about it) [FluentConf 2016]How slow load times hurt UX (and what you can do about it) [FluentConf 2016]
How slow load times hurt UX (and what you can do about it) [FluentConf 2016]Tammy Everts
 
An Introduction To IT Security And Privacy for Librarians and Libraries
An Introduction To IT Security And Privacy for Librarians and LibrariesAn Introduction To IT Security And Privacy for Librarians and Libraries
An Introduction To IT Security And Privacy for Librarians and LibrariesBlake Carver
 
How to fix the design issues that matter on the pages that matter [2016 Smash...
How to fix the design issues that matter on the pages that matter [2016 Smash...How to fix the design issues that matter on the pages that matter [2016 Smash...
How to fix the design issues that matter on the pages that matter [2016 Smash...Tammy Everts
 
The Soul in The Machine - Developing for Humans
The Soul in The Machine - Developing for HumansThe Soul in The Machine - Developing for Humans
The Soul in The Machine - Developing for HumansChristian Heilmann
 
Progressive Web App Challenges
Progressive Web App ChallengesProgressive Web App Challenges
Progressive Web App ChallengesJason Grigsby
 
February 2018: Sustainable UX Conference
February 2018: Sustainable UX ConferenceFebruary 2018: Sustainable UX Conference
February 2018: Sustainable UX ConferenceLisa Huang
 

What's hot (7)

How to Design for the Future - Cross Channel Experience Design
How to Design for the Future - Cross Channel Experience DesignHow to Design for the Future - Cross Channel Experience Design
How to Design for the Future - Cross Channel Experience Design
 
How slow load times hurt UX (and what you can do about it) [FluentConf 2016]
How slow load times hurt UX (and what you can do about it) [FluentConf 2016]How slow load times hurt UX (and what you can do about it) [FluentConf 2016]
How slow load times hurt UX (and what you can do about it) [FluentConf 2016]
 
An Introduction To IT Security And Privacy for Librarians and Libraries
An Introduction To IT Security And Privacy for Librarians and LibrariesAn Introduction To IT Security And Privacy for Librarians and Libraries
An Introduction To IT Security And Privacy for Librarians and Libraries
 
How to fix the design issues that matter on the pages that matter [2016 Smash...
How to fix the design issues that matter on the pages that matter [2016 Smash...How to fix the design issues that matter on the pages that matter [2016 Smash...
How to fix the design issues that matter on the pages that matter [2016 Smash...
 
The Soul in The Machine - Developing for Humans
The Soul in The Machine - Developing for HumansThe Soul in The Machine - Developing for Humans
The Soul in The Machine - Developing for Humans
 
Progressive Web App Challenges
Progressive Web App ChallengesProgressive Web App Challenges
Progressive Web App Challenges
 
February 2018: Sustainable UX Conference
February 2018: Sustainable UX ConferenceFebruary 2018: Sustainable UX Conference
February 2018: Sustainable UX Conference
 

Viewers also liked

Ss lesson072813.commentary
Ss lesson072813.commentarySs lesson072813.commentary
Ss lesson072813.commentaryJohn Wible
 
Acuerdo 137
Acuerdo 137Acuerdo 137
Acuerdo 137alekan887
 
Ss lesson072813.commentary
Ss lesson072813.commentarySs lesson072813.commentary
Ss lesson072813.commentaryJohn Wible
 
Ss lesson081113.commentary
Ss lesson081113.commentarySs lesson081113.commentary
Ss lesson081113.commentaryJohn Wible
 
It is well with my soul
It is well with my soulIt is well with my soul
It is well with my soulJohn Wible
 
Xmaspoem.2009.printed
Xmaspoem.2009.printedXmaspoem.2009.printed
Xmaspoem.2009.printedJohn Wible
 

Viewers also liked (8)

Ss lesson072813.commentary
Ss lesson072813.commentarySs lesson072813.commentary
Ss lesson072813.commentary
 
Acuerdo 137
Acuerdo 137Acuerdo 137
Acuerdo 137
 
Ss lesson072813.commentary
Ss lesson072813.commentarySs lesson072813.commentary
Ss lesson072813.commentary
 
Ss lesson081113.commentary
Ss lesson081113.commentarySs lesson081113.commentary
Ss lesson081113.commentary
 
Difficulties With Changing To A Lean Culture Part 12 By Mike Thelen
Difficulties With Changing To A Lean Culture Part 12 By Mike ThelenDifficulties With Changing To A Lean Culture Part 12 By Mike Thelen
Difficulties With Changing To A Lean Culture Part 12 By Mike Thelen
 
It is well with my soul
It is well with my soulIt is well with my soul
It is well with my soul
 
Xmaspoem.2009.printed
Xmaspoem.2009.printedXmaspoem.2009.printed
Xmaspoem.2009.printed
 
Sports
SportsSports
Sports
 

Similar to Brighttalk what should we be monitoring - final

Bright talk running a cloud - final
Bright talk   running a cloud - finalBright talk   running a cloud - final
Bright talk running a cloud - finalAndrew White
 
ChefConf 2013 Keynote Session – Opscode – Adam Jacob
ChefConf 2013 Keynote Session – Opscode – Adam JacobChefConf 2013 Keynote Session – Opscode – Adam Jacob
ChefConf 2013 Keynote Session – Opscode – Adam JacobChef Software, Inc.
 
Vivint Wireless How to De-Risk a New Venture & Build a Better ISP - Luke L...
Vivint Wireless   How to De-Risk a New Venture & Build a Better ISP  - Luke L...Vivint Wireless   How to De-Risk a New Venture & Build a Better ISP  - Luke L...
Vivint Wireless How to De-Risk a New Venture & Build a Better ISP - Luke L...Lounge47
 
Testing Mobile App Performance
Testing Mobile App PerformanceTesting Mobile App Performance
Testing Mobile App PerformanceTechWell
 
Who's Afraid of Mobile Capture?
Who's Afraid of Mobile Capture?Who's Afraid of Mobile Capture?
Who's Afraid of Mobile Capture?AIIM International
 
IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...
IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...
IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...John Head
 
Compuware ASEAN APM User Conference 2013 - University of Customer Experience
Compuware ASEAN APM User Conference 2013 - University of Customer ExperienceCompuware ASEAN APM User Conference 2013 - University of Customer Experience
Compuware ASEAN APM User Conference 2013 - University of Customer ExperienceCompuware ASEAN
 
Shoutlet and IBM's Executive Social Marketing Summit
Shoutlet and IBM's Executive Social Marketing SummitShoutlet and IBM's Executive Social Marketing Summit
Shoutlet and IBM's Executive Social Marketing SummitShoutlet, a Spredfast Company
 
Network GRC Delivery
Network GRC DeliveryNetwork GRC Delivery
Network GRC Deliveryiansadler
 
LSI Spring Agent Open House 2014
LSI Spring Agent Open House 2014LSI Spring Agent Open House 2014
LSI Spring Agent Open House 2014Ashlie Steele
 
Progressive Web Apps – the return of the web?
Progressive Web Apps – the return of the web?Progressive Web Apps – the return of the web?
Progressive Web Apps – the return of the web?Christian Heilmann
 
Quick Response Fraud Detection
Quick Response Fraud DetectionQuick Response Fraud Detection
Quick Response Fraud DetectionFraudBusters
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...AppDynamics
 
MacIT 2014 - Essential Security & Risk Fundamentals
MacIT 2014 - Essential Security & Risk FundamentalsMacIT 2014 - Essential Security & Risk Fundamentals
MacIT 2014 - Essential Security & Risk FundamentalsAlison Gianotto
 
Cyber security and the mainframe (v1.3)
Cyber security and the mainframe (v1.3)Cyber security and the mainframe (v1.3)
Cyber security and the mainframe (v1.3)Rui Miguel Feio
 
DevSecCon London 2018: How to fit threat modelling into agile development: sl...
DevSecCon London 2018: How to fit threat modelling into agile development: sl...DevSecCon London 2018: How to fit threat modelling into agile development: sl...
DevSecCon London 2018: How to fit threat modelling into agile development: sl...DevSecCon
 
MongoDB, ANTS, and the IC
MongoDB, ANTS, and the ICMongoDB, ANTS, and the IC
MongoDB, ANTS, and the ICMongoDB
 
Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016
Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016
Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016Advanced monitoring
 

Similar to Brighttalk what should we be monitoring - final (20)

Bright talk running a cloud - final
Bright talk   running a cloud - finalBright talk   running a cloud - final
Bright talk running a cloud - final
 
ChefConf 2013 Keynote Session – Opscode – Adam Jacob
ChefConf 2013 Keynote Session – Opscode – Adam JacobChefConf 2013 Keynote Session – Opscode – Adam Jacob
ChefConf 2013 Keynote Session – Opscode – Adam Jacob
 
Vivint Wireless How to De-Risk a New Venture & Build a Better ISP - Luke L...
Vivint Wireless   How to De-Risk a New Venture & Build a Better ISP  - Luke L...Vivint Wireless   How to De-Risk a New Venture & Build a Better ISP  - Luke L...
Vivint Wireless How to De-Risk a New Venture & Build a Better ISP - Luke L...
 
Fortify technology
Fortify technologyFortify technology
Fortify technology
 
Testing Mobile App Performance
Testing Mobile App PerformanceTesting Mobile App Performance
Testing Mobile App Performance
 
Who's Afraid of Mobile Capture?
Who's Afraid of Mobile Capture?Who's Afraid of Mobile Capture?
Who's Afraid of Mobile Capture?
 
IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...
IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...
IBM Digital Experience 2015 - APPLICATION MODERNIZATION IN THE DIGITAL EXPERI...
 
Compuware ASEAN APM User Conference 2013 - University of Customer Experience
Compuware ASEAN APM User Conference 2013 - University of Customer ExperienceCompuware ASEAN APM User Conference 2013 - University of Customer Experience
Compuware ASEAN APM User Conference 2013 - University of Customer Experience
 
Shoutlet and IBM's Executive Social Marketing Summit
Shoutlet and IBM's Executive Social Marketing SummitShoutlet and IBM's Executive Social Marketing Summit
Shoutlet and IBM's Executive Social Marketing Summit
 
Network GRC Delivery
Network GRC DeliveryNetwork GRC Delivery
Network GRC Delivery
 
LSI Spring Agent Open House 2014
LSI Spring Agent Open House 2014LSI Spring Agent Open House 2014
LSI Spring Agent Open House 2014
 
Progressive Web Apps – the return of the web?
Progressive Web Apps – the return of the web?Progressive Web Apps – the return of the web?
Progressive Web Apps – the return of the web?
 
Quick Response Fraud Detection
Quick Response Fraud DetectionQuick Response Fraud Detection
Quick Response Fraud Detection
 
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
Standard Bank: Agile, DevOps, Engineering Transformation and the Role of AppD...
 
Toronto mule meetup #5
Toronto mule meetup #5Toronto mule meetup #5
Toronto mule meetup #5
 
MacIT 2014 - Essential Security & Risk Fundamentals
MacIT 2014 - Essential Security & Risk FundamentalsMacIT 2014 - Essential Security & Risk Fundamentals
MacIT 2014 - Essential Security & Risk Fundamentals
 
Cyber security and the mainframe (v1.3)
Cyber security and the mainframe (v1.3)Cyber security and the mainframe (v1.3)
Cyber security and the mainframe (v1.3)
 
DevSecCon London 2018: How to fit threat modelling into agile development: sl...
DevSecCon London 2018: How to fit threat modelling into agile development: sl...DevSecCon London 2018: How to fit threat modelling into agile development: sl...
DevSecCon London 2018: How to fit threat modelling into agile development: sl...
 
MongoDB, ANTS, and the IC
MongoDB, ANTS, and the ICMongoDB, ANTS, and the IC
MongoDB, ANTS, and the IC
 
Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016
Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016
Юрий Чемёркин (Yury Chemerkin) Owasp russia 2016
 

More from Andrew White

How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoringAndrew White
 
Brighttalk learning to cook- network management recipes - final
Brighttalk   learning to cook- network management recipes - finalBrighttalk   learning to cook- network management recipes - final
Brighttalk learning to cook- network management recipes - finalAndrew White
 
Brighttalk converged infrastructure and it operations management - final
Brighttalk   converged infrastructure and it operations management - finalBrighttalk   converged infrastructure and it operations management - final
Brighttalk converged infrastructure and it operations management - finalAndrew White
 
Brighttalk outage insurance- what you need to know - final
Brighttalk   outage insurance- what you need to know - finalBrighttalk   outage insurance- what you need to know - final
Brighttalk outage insurance- what you need to know - finalAndrew White
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - finalAndrew White
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - finalAndrew White
 
Brighttalk brining it all together - final
Brighttalk   brining it all together - finalBrighttalk   brining it all together - final
Brighttalk brining it all together - finalAndrew White
 
Brighttalk reason 114 for learning math - final
Brighttalk   reason 114 for learning math - finalBrighttalk   reason 114 for learning math - final
Brighttalk reason 114 for learning math - finalAndrew White
 
Brighttalk getting back on track - final
Brighttalk   getting back on track - finalBrighttalk   getting back on track - final
Brighttalk getting back on track - finalAndrew White
 
Bright talk bringing back the love - final
Bright talk   bringing back the love - finalBright talk   bringing back the love - final
Bright talk bringing back the love - finalAndrew White
 

More from Andrew White (10)

How to improve your system monitoring
How to improve your system monitoringHow to improve your system monitoring
How to improve your system monitoring
 
Brighttalk learning to cook- network management recipes - final
Brighttalk   learning to cook- network management recipes - finalBrighttalk   learning to cook- network management recipes - final
Brighttalk learning to cook- network management recipes - final
 
Brighttalk converged infrastructure and it operations management - final
Brighttalk   converged infrastructure and it operations management - finalBrighttalk   converged infrastructure and it operations management - final
Brighttalk converged infrastructure and it operations management - final
 
Brighttalk outage insurance- what you need to know - final
Brighttalk   outage insurance- what you need to know - finalBrighttalk   outage insurance- what you need to know - final
Brighttalk outage insurance- what you need to know - final
 
Brighttalk high scale low touch and other bedtime stories - final
Brighttalk   high scale low touch and other bedtime stories - finalBrighttalk   high scale low touch and other bedtime stories - final
Brighttalk high scale low touch and other bedtime stories - final
 
Brighttalk understanding the promise of sde - final
Brighttalk   understanding the promise of sde - finalBrighttalk   understanding the promise of sde - final
Brighttalk understanding the promise of sde - final
 
Brighttalk brining it all together - final
Brighttalk   brining it all together - finalBrighttalk   brining it all together - final
Brighttalk brining it all together - final
 
Brighttalk reason 114 for learning math - final
Brighttalk   reason 114 for learning math - finalBrighttalk   reason 114 for learning math - final
Brighttalk reason 114 for learning math - final
 
Brighttalk getting back on track - final
Brighttalk   getting back on track - finalBrighttalk   getting back on track - final
Brighttalk getting back on track - final
 
Bright talk bringing back the love - final
Bright talk   bringing back the love - finalBright talk   bringing back the love - final
Bright talk bringing back the love - final
 

Recently uploaded

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 

Recently uploaded (20)

[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 

Brighttalk what should we be monitoring - final

  • 1. By U.S. Navy photo by Mass Communication Specialist 1st Class James E. Foehl [Public domain], via Wikimedia Commons The Age Old Question: What should our APM Solution be monitoring?
  • 2. Follow Us: #ITSMSummit! Mr. White has fteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican SecretarĂ­a de Hacienda y CrĂŠdito PĂşblico, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command. Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation
  • 4. Follow Us: #ITSMSummit! Ground rules for this session… •  If you can’t tell if I am trying to be funny… –  GO AHEAD AND LAUGH! •  Feel free to text, tweet, yammer, or whatever. Use •  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.
  • 5. I have a lot of experience leading Systems and Event Management teams My name is Andrew White
  • 6. I am here today to share some of what I have learned about Systems Thinking, and APM.
  • 7. *Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) – Gomez Mobile Web Experience Survey conducted by Equation Research 58% of mobile phone users expect websites to load as quickly, almost as quickly or faster on their mobile phone, compared to the computer they use at home* http://www.flickr.com/photos/lucianbickerton/3858380291/sizes/l/!
  • 8. *Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) – Gomez Mobile Web Experience Survey conducted by Equation Research 60% of mobile web users have had a problem in the past year when accessing a website on their phone* http://www.flickr.com/photos/rickyromero/1357938629/sizes/l/!
  • 9. *Among adults who accessed the internet with a mobile phone in the past 12 months (n=602) – Gomez Mobile Web Experience Survey conducted by Equation Research Slow load time was the number on issue, experience by almost 75% of them* http://bighugelabs.com/onblack.php?id=2497744197&size=large!
  • 10. Is 5 seconds really bad?
  • 14. If you were the one on the phone with one of those customers…! how would you ll that silence?!
  • 15. The rationality of individuals is limited by the information they have. This causes “The Tragedy of the Commons.”
  • 16. Follow Us: #ITSMSummit! What Is a System? It is a set of interconnected actors that change over time when they are influenced by other elements of the system. Actor Actor Actor Actor Actor Actor Actor Actor
  • 17. As we have become more aware that things are always happening, our behavior has changed.
  • 18.
  • 19.
  • 20.
  • 21. We are no longer thinking, we are reacting…
  • 22. Follow Us: #ITSMSummit! http://static4.businessinsider.com/image/5176c232ecad04805d000010-505-277/ screen%20shot%202013-04-23%20at%201.09.49%20pm.png April 23, 2013 The Twitter account for the Associated Press was hacked The hackers posted a fake notice that the White House was attacked and President Obama was injured The Dow dropped 150 points in less than 5 minutes
  • 23. Follow Us: #ITSMSummit! Systems are Volatile This change makes it difcult to control the behavior of the system. The good news is that systems are perfect. They always deliver the optimum result given a specic stimuli.
  • 24. Follow Us: #ITSMSummit! Anatomy of An OutageP0 - Affecting Multiple apps! Corporate LANs & VPNs Load Balancer Firewall Web Servers Message Queue zOS CICS WAS Database WAS Database zOS MQ DB2 4 3 1 5:45-ish pm: CICS ABENDS start flooding OMEGAMON but not high enough to ticket 2 6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics 6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem 5 10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue 6:04pm: Synthetic transactions fail at and 6:14 the Ops Center conrms the issue and creates a P0 Incident
  • 25. Follow Us: #ITSMSummit! Our Problem Statement: The business needs to reliably reach its customers and users regardless of where they may be located. Latency forces close geographic proximity of the components and limits the quality of service provided to geographically distributed customers. If the users can’t use it, it doesn’t work.
  • 26. Follow Us: #ITSMSummit! Our Constraints At the same time, there are a few inescapable facts we face: 1.  Today’s users demand reliable systems to do their work 2.  Our systems mirror the complexity of the businesses they support 3.  Our environments must be massive to scale to handle the workload 4.  There is too much activity for a single person to be totally situationally aware
  • 27. When all of these happen at the same time… Ug…
  • 28. Follow Us: #ITSMSummit! Question Is there a better way to gure out what monitoring would help?
  • 29. Follow Us: #ITSMSummit! Your monitoring should help you answer: •  How will we know if the users are getting the experience they are expecting? •  How much capacity do we need during normal and peak times to ensure user expectations are met? •  How quickly can the provider we select ramp up to meet our needs if we nd that the service is underperforming? •  How fast do we need to be able to access additional capacity once it is ready for us? What Do You Want To Accomplish?
  • 30. Follow Us: #ITSMSummit! When decisions are not made based on information, it’s called gambling.
  • 31. Follow Us: #ITSMSummit! Composite Applications Site Content! Search! Session! Information! User Login! & Identity Mgmt! Content Mgmt! System! Social Network! Widgets! Site Tracking! & Analytics! Banner Ads & ! Revenue Generators! Multimedia &! CDN Content!
  • 32. Follow Us: #ITSMSummit! The Same Old Problem Corporate! LANs & VPNs! ISP! Connection! DNS & Internet! Services! Content Mgmt! System! Social Network! Widgets! Site Tracking! & Analytics! Banner Ads & ! Revenue Generators! Multimedia &! CDN Content! Home Wireless! & Broadband! Mobile Broadband! Is It My Data Center?! •  Conguration errors! •  Application design issues! •  Code defects! •  Insufcient infrastructure! •  Oversubscription Issues! •  Poor routing optimization! •  Low cache hit rate! Is It a Service Provider Problem?! •  Non-optimized mobile content! •  Bad performance under load! •  Blocking content delivery! •  Incorrect geo-targeted content! Is it an ISP Problem?! •  Peering problems! •  ISP Outages! Is it My Code or a Browser Problem?! •  Missing content! •  Poorly performing JavaScript! •  Inconsistent CSS rendering! •  Browser/device incompatibility! •  Page size too big! •  Conflicting HTML tag support! •  Too many objects! •  Content not optimized for device! The Cloud! Distributed Database Mainframe Network Middleware Storage
  • 33. Follow Us: #ITSMSummit! Cognitive Dissonance Corporate LANs & VPNs Distributed Database Mainframe Network Middleware Storage ISP Connection DNS & Internet Services Content Mgmt System Social Network Widgets Site Tracking & Analytics Banner Ads & Revenue Generators Multimedia & CDN Content Home Wireless & Broadband Mobile Broadband The Part You Control The Part They Experience …meanwhile the user is NOT happy All our systems look great, SLA’s are being met… You Have More Control Here Than You Think
  • 34. Follow Us: #ITSMSummit! Gaining Perspective Requires Balance Packet Capture! Synthetic Transactions! Client Monitoring! Client Monitoring! Synthetic Transactions! Server Probe! 1.  Client to the Server! 2.  Server to the Client! 3.  “3rd Party” Vantage Point! 4.  Synthetic Transactions! Four Perspectives of User Experience!
  • 35. Follow Us: #ITSMSummit! What Does Good Monitoring Look Like? Corporate! LANs & VPNs! Load Balancer! Load Balancer! Firewall! Switch! Web Server Farm! Database! Data Power! Mainframe! Middleware! Load Balancer! 1.  System Availability 2.  Operating System Performance 3.  Hardware Monitoring 4.  Service/Daemon and Process Availability 5.  Error Logs 6.  Application Resource KPIs 7.  End-to-End Transactions 8.  Point of Failure Transactions 9.  Fail-Over Success 10. “Activity Monitors” and “Reverse Hockey Stick” Elements of Good Monitoring ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 3!2! 4! 5! 6!1! ! ! ! ! 7! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 8! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 9! ! ! ! ! ! ! 10!
  • 36. Follow Us: #ITSMSummit! Finding Metrics That Matter §  Will the metric be used in a report? If so, which one? How is it used in the report? §  Will the metric be used in a dashboard? If so, which one? How will it be used? §  What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket be generated? If so, what severity? §  How often is this event likely to occur? What is the impact if the event occurs? What is the likelihood it can be detected by monitoring? §  Will the metric help identify the source of a problem? Is it a coincident / symptomatic indicator? §  Is the metric always associated with a single problem? Could this metric become a false indicator? §  What is the impact if this goes undetected? §  What is the lifespan for this metric? What is the potential for changes that may reduce the efcacy of the metric? Evaluating the Effectiveness of a Metric
  • 37. Follow Us: #ITSMSummit! Beware of Averages 75th Percentile! 50th Percentile! 25th Percentile! 0.5! 0.7! 0.9! 1.8! 2.5! 2.5! 2.6! 2.9! 3.3! 3.5! Average!
  • 38. Follow Us: #ITSMSummit! What Matters Most? Dr.  Lee   Goldman   Cook  County  Hospital,   Chicago,  IL   §  Is the patient feeling unstable angina? §  Is there fluid in the patient’s lungs? §  Is the patient’s systolic blood pressure below 100?" The Goldman Algorithm Prediction of Patients Expected to Have a Heart Attack Within 72 Hours 0   20   40   60   80   100   Traditional Techniques Goldman Algorithm By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20 percentage points and eliminated the “false positives” altogether.
  • 39. Follow Us: #ITSMSummit! The Goldman Algorithm ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age) Observation Unit Inpatient Telemetry Unit High Risk Low Risk Very Low Risk Moderate Risk Yes No Coronary Care Unit No ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age) Yes Patient suspected of Acute Cardiac Ischema Perform Electrocardiogram (EKG) 0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease
  • 40. Follow Us: #ITSMSummit! Driving the Right Action Application! End User Experience! Gainesville! Transaction 1! Transaction 2! Transaction N! San Antonio! Transaction 1! Transaction 2! Transaction N! Des Moines! Transaction 1! Transaction 2! Transaction N! Columbus! Transaction 1! Transaction 2! Transaction N! Infrastructure! Network! KPI 1! KPI 2! KPI N! Mainframe! KPI 1! KPI 2! KPI N! Storage! KPI 1! KPI 2! KPI N! Linux! KPI 1! KPI 2! KPI N! Middleware! KPI 1! KPI 2! KPI N! Database! KPI 1! KPI 2! KPI N!
  • 41. Follow Us: #ITSMSummit! Driving the Right Action Application! End User Experience! Gainesville! Transaction 1! Transaction 2! Transaction N! San Antonio! Transaction 1! Transaction 2! Transaction N! Des Moines! Transaction 1! Transaction 2! Transaction N! Columbus! Transaction 1! Transaction 2! Transaction N! Infrastructure! Network! KPI 1! KPI 2! KPI N! Mainframe! KPI 1! KPI 2! KPI N! Storage! KPI 1! KPI 2! KPI N! Linux! KPI 1! KPI 2! KPI N! Middleware! KPI 1! KPI 2! KPI N! Database! KPI 1! KPI 2! KPI N!
  • 42. Follow Us: #ITSMSummit! Driving the Right Action Application! End User Experience! Gainesville! Transaction 1! Transaction 2! Transaction N! San Antonio! Transaction 1! Transaction 2! Transaction N! Des Moines! Transaction 1! Transaction 2! Transaction N! Columbus! Transaction 1! Transaction 2! Transaction N! Infrastructure! Network! KPI 1! KPI 2! KPI N! Mainframe! KPI 1! KPI 2! KPI N! Storage! KPI 1! KPI 2! KPI N! Linux! KPI 1! KPI 2! KPI N! Middleware! KPI 1! KPI 2! KPI N! KPI 1! KPI 2! KPI N! Database!
  • 43. Follow Us: #ITSMSummit! Driving the Right Action Application! End User Experience! Gainesville! Transaction 1! Transaction 2! Transaction N! San Antonio! Transaction 1! Transaction 2! Transaction N! Des Moines! Transaction 1! Transaction 2! Transaction N! Columbus! Transaction 1! Transaction 2! Transaction N! Infrastructure! Network! KPI 1! KPI 2! KPI N! Mainframe! KPI 1! KPI 2! KPI N! Storage! KPI 1! KPI 2! KPI N! Linux! KPI 1! KPI 2! KPI N! Middleware! KPI 1! KPI 2! KPI N! Database! KPI 1! KPI 2! KPI N!
  • 44. Follow Us: #ITSMSummit! Driving the Right Action Application! End User Experience! Gainesville! Transaction 1! Transaction 2! Transaction N! San Antonio! Transaction 1! Transaction 2! Transaction N! Des Moines! Transaction 1! Transaction 2! Transaction N! Columbus! Transaction 1! Transaction 2! Transaction N! Infrastructure! Network! KPI 1! KPI 2! KPI N! Mainframe! KPI 1! KPI 2! KPI N! Storage! KPI 1! KPI 2! KPI N! Linux! KPI 1! KPI 2! KPI N! Middleware! KPI 1! KPI 2! KPI N! Database! KPI 1! KPI 2! KPI N!
  • 45.
  • 46. Our success in any endeavor depends directly on our ability to solve problems What do we need to do that?
  • 47. You Gotta Have Skillz…!
  • 48. Follow Us: #ITSMSummit! Common Problem Types §  Design Problems §  Creative Problems §  Daily Problems §  People Problems Rule-Based Approach Event Based Approach
  • 49. Follow Us: #ITSMSummit! Event-Based Problem Solving •  Appreciative Understanding •  Know What We Are Solving •  Create A Common Reality •  Solutions Based on Causes
  • 50. Follow Us: #ITSMSummit! Rules for Causal Relationships Database Down ! (Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause)! ①  Causes are effects, and effects are causes!
  • 51. Follow Us: #ITSMSummit! Rules for Causal Relationships End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! ②  You can keep identifying causes – there is no limit!
  • 52. Follow Us: #ITSMSummit! Two Important Questions End of the Universe (Effect)! Database Down ! (Primary Effect)! Drive Full (Cause/Effect)! Logs Not Truncated (Cause/Effect)! Beginning of Time (Cause)! Ask “Why?”! Ask “What”!
  • 53. Follow Us: #ITSMSummit! Rules for Causal Relationships ③  An Effect is often the result of multiple causes! SQL Server was not processing queries (Effect)! Transaction log was unable to grow! T: Drive at 0 Bytes free! Logs were not truncated! DBA on honeymoon vacation in Fiji! Logs are truncated manually! Company has only 1 DBA! “Backup” DBA was not aware the logs require truncation! Space allocations are xed! Lack of Control! -AND-! -AND-! -AND-!
  • 54. Follow Us: #ITSMSummit! Rules for Causal Relationships ④  Causes need to be both necessary and sufcient! SQL Server was not processing queries (Effect)! Transaction log was unable to grow (Transitory Cause)! T: Drive at 0 Bytes free! (Non-transitory Cause & Effect)! Logs were not truncated! (Transitory Cause & Effect)! DBA on honeymoon vacation in Fiji! (Transitory Cause)! Logs are truncated manually! (Non-Transitory Cause)! Company has only 1 DBA! (Non-Transitory Cause)! “Backup” DBA was not aware the logs require truncation! (Non-Transitory Cause)! Space allocations are xed! (Non-Transitory Cause)! Lack of Control! -AND-! -AND-! -AND-!
  • 55. Follow Us: #ITSMSummit! How Fire Works Time Oxygen Heat Fuel Fire MatchStrike Transitory Non-Transitory Fire Oxygen Heat Fuel Match Strike -AND- • Transitory Causes act as catalysts to bring about change (think Transition) • Non-Transitory Causes are objects, properties/attributes, and status
  • 56. Follow Us: #ITSMSummit! RCA Diagram Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are xed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND-
  • 57. Follow Us: #ITSMSummit! Add Evidence Customers Complaining Web Server returning 500 errors The application server was timing out SQL Server was not processing queries Transaction log was unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation Space allocations are xed Lack of Control Only one database cluster in use DR SQL Cluster DR Cluster being used for UAT testing More Information Needed One one application server exists More Information Needed Trying to do business on the website Desired Condition -AND- -AND- -AND- -AND- -AND- -AND- -AND- Statistical Data Situational Observation
  • 58. Follow Us: #ITSMSummit! Failure Modes Analysis SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are xed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is congured to write to C: Drive Server was ASRing frequently Software distributions were leaving les in the TEMP folder %TEMP% congured to C:Temp Kernel able to write to page le -AND- -AND- -AND- -AND- -OR- -AND- -OR-
  • 59. Follow Us: #ITSMSummit! Picking Monitors SQL Server Not Available Transaction log is unable to grow T: Drive at 0 Bytes free Logs were not truncated DBA on honeymoon vacation in Fiji Logs are truncated manually Company has only 1 DBA “Backup” DBA was not aware the logs require truncation (Condition Cause) Space allocations are xed (Condition Cause) Lack of Control SQL is unable to cache query results Available RAM at 0 Bytes Free C: Drive at 0 Bytes free Minidump is congured to write to C: Drive Server was ASRing frequently Software distributions were leaving les in the TEMP folder %TEMP% congured to C:Temp Kernel able to write to page le -AND- -AND- -AND- -AND- -OR- -AND- -OR- Monitor the intersections at the “OR’s” At least one point along each branch after the “OR”
  • 60. Follow Us: #ITSMSummit! FMEA Matrix (Impact Calculation) Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users
  • 61. Follow Us: #ITSMSummit! FMEA Matrix (Evidence) These are the events that help us to RULE IN a failure mode as a possible cause These are the events that help us RULE OUT the failure mode as not relevant
  • 62. Follow Us: #ITSMSummit! Determining Severity Logical Server Virtual Machine 1 Virtual Machine 2 Severity Description Critical The component has completely failed Major The component is operating but is in a degraded or crippled state Minor The component is functioning normally but is at risk of a more serious failure Informational The component is functioning normally but is reporting a change in state Unknown The component has changed its operating state but the effect is not known Clear The component is operating normally or a higher severity event has been resolved •  The event severity is determined with respect to the component generating the event •  The event severity does not consider impact or urgency •  The incident priority is not determined by event severity •  The event severity helps drive an effective triage when multiple events arrive at approximately the same time •  Only after the effected components and their relationships to each other have been determined can impact and urgency be determined Six Levels of Severity Physical Server Server 1 Server 2 Logical Volumes Volume Group 1 Volume Group 2 Physical Volumes Hard Drive 1 Hard Drive 2 Hard Drive 3
  • 63. Follow Us: #ITSMSummit! Monitoring Patterns Layers of Pre-Dened Monitoring Patterns •  The OS template is deployed when the server is provisioned •  As a server is customized to t its role, additional templates are deployed •  Templates are stacked on top of each other until no gaps remain •  This approach provides a high degree of standardization without sacricing the ability to develop a custom solution
  • 64. Follow Us: #ITSMSummit! Application-Technology Matrix Maps services, applications and technologies enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufcient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy
  • 65. Follow Us: #ITSMSummit! Event Lifecycle Legend! Element Manager! Distributed Collectors! Object Server Triggers! Impact Policies! ITNM RCA Engine! Gateway Replication! Webtop Event List! Software-Operating System! Data Collection! Anomaly Detection! Event Generation! Integration! Event Processing! Enrichment! Event Suppression! Correlation! Root Cause Analysis! Business Impact Analysis! Automation! Notication & Escalation! Presentation! User Interaction Tools! Archiving! Reporting! Activity! Responsible Tool! Trigger Ticket Request! Create Ticket! Update Event with IM#! Trigger Courtesy Pages! Send Pages! Activity! Responsible Tool!
  • 66. Follow Us: #ITSMSummit! AutomatedAction Noticationand Escalation BusinessImpact Analysis RootCauseAnalysis Correlationand EventSuppression Enrichment Meta-Data Integration Bus DistributedCollectors DistributedCollectors LOB Managed Monitoring System Service Provider Monitoring System Vendor Managed Monitoring System Element Manager Element Manager Element Manager Other Enterprise Data Document Sharing Service Desk CMDB Batch Scheduling Knowledge Database Online Run Book PBX/Call Manager Visualization Framework CommonEvent Format Topology And Relationship Database Automated Action Tools DistributedCollectors Automated Provisioning System Predictive Analysis Automated Change Reconciliation Security Management ArchiveandReport Business Telemetry Data Service Center and Enterprise Notication Tool Event Processing
  • 67. Follow Us: #ITSMSummit! As you recognize opportunities to capture knowledge, use it to improve your Event Management System. Iterative Development
  • 68. How do we keep it evolving?!
  • 69. Follow Us: #ITSMSummit! Sometimes We Miss What’s Going On Say… what’s a mountain goat doing all the way up here in a cloud bank?
  • 70. Follow Us: #ITSMSummit! The Path to Situational Awareness Collection Analytics Situational Awareness PresentationAggregation Each phase builds on the previous helping to establish situational awareness: •  Data is collected from our IT systems •  These data are aggregated into a central location •  Correlations transform the data into information and predictive analytics process them further into knowledge •  The processed and enriched knowledge is presented to users in a way that helps them make good decisions
  • 71. Follow Us: #ITSMSummit! Cleaning Up the Landscape Adapted from: Akella, Janaki. “IT Architecture: Cutting costs and complexity.” McKinsey Quarterly 13 Nov 2009 https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391 Silo Monolithic Framework Niche Launch Pad Information Bus
  • 72. Follow Us: #ITSMSummit! Directed Workflows Directed ! Non Directed! Launchpad! Executive Dashboard! Business Area! Dashboards! Application PAC! Dashboards! Command Center! Dashboards! Technology Owner! Dashboard! Application Owner! Dashboard! Problem Isolation! Workspace! Problem Diagnostics! Workspace! System Detail! View! Component Detail! View!
  • 73. Follow Us: #ITSMSummit! 73 Here comes the elevator pitch…
  • 74. Follow Us: #ITSMSummit! The IBM Solution !IBM SmartCloud APM Suite offers essential management capabilities for applications in complex cloud and hybrid environments. ! ! ! ! •  At-a-glance status determination via network topology graphs! •  Proactively identify and respond to compliance issues! •  Monitor the performance of the environment and the tenants living inside of it! •  Understand the current capacity needs and forecast future needs! •  Understand the costs associated with providing the service and enable “showback” and charge back” reporting to the application owners! SINGLE POINT OF MANAGEMENT! ! •  Minimize service and system outages! •  Identify recurring incidents and implement action to remediate problems before they cause impacts! •  Assist troubleshooting by suppressing “noise” events and providing root cause determination! MAXIMIZE SERVICE AVAILABILITY! ! •  Reduce the need for manual action or intervention! •  Automate for repeatability and elimination of human error! •  Develop standardized practices for complex business processes! •  Enable the development of APIs to allow for self-service management by the consumers! IMPROVED OPERATIONAL EFFICIENCY!
  • 75. Follow Us: #ITSMSummit! Understand the end-user experience Follow changing workloads Mobile devices & " smart endpoints Private, public & " hybrid clouds Highly virtualized applications, storage & networks Discovery Visibility into application resources End User Experience Transaction performance monitoring to ensure SLA compliance Transaction Tracking Rapid problem isolation through transaction " path analysis Diagnostics Domain-specic operations tools for diagnosis and repair Predictive Analytics Proactive approach to reduce outages & improve performance shared data & common services See steps across the cloud VISIBILITY, CONTROL AND AUTOMATION TO INTELLIGENTLY MANAGE CRITICAL APPLICATIONS IN CLOUD AND HYBRID ENVIRONMENTS.
  • 76. Follow Us: #ITSMSummit! Tivoli Enterprise Portal Monitor the complete Application and Application Infrastructure Measure, Baseline and Analyze the Service and Transactions ITCAM for Applications ITM for Microsoft Applications ITM ITCAM for Transactions ITCAM for SOA Platform OMEGAMON XE Tivoli Enterprise Portal Tivoli Automation Tivoli Data Warehouse Tivoli Common Reporting IBM Tivoli Monitoring Solution
  • 77. Follow Us: #ITSMSummit! Business Value of Adopting APM Predic've  Outage   Avoidance   Ensure  availability  of   applica3ons  and  services       • Use learning tools to augment custom best practices • Leverage statistical methods to maximize predictive warning • Improve problem detection across IT silos Predict Faster  Problem   Resolu'on   Find  &  correct  problems  faster   with  tools  that  determine  ac3ons   required  to  resolve  issues       • Identify problems quicker with insight to large unstructured repositories • Isolate problems quicker by bringing relevant unstructured data into problem investigations • Repair problems quicker with the right details quickly to hand. Resolve Op'mized   Performance     Track,  Op3mize,  and  Predict   capacity  and  performance  needs   over  3me       • Track capacity and performance of applications and services in classic and cloud environments • Optimize resource deployment with what-if and best fit planning tools • Escalate capacity and performance problems before they cause critical failures Perform Improved  Insight     Enhance  visibility  into  systems   resource  rela3onships  while   increasing  customer  sa3sfac3on         • Determine what resources are interdependent to assess impact of failures • Gain insight into what is important to your customer • Decrease customer churn and acquisition costs while increasing customer retention and satisfaction Know Automated Analytics helps lower IT Administration Costs: •  Performance and Capacity planning tools monitor appropriately and escalate, reducing time consuming report browsing •  Learning tools reduce customization and best practices investment on initial deployment •  Log Analysis helps speed problem resolution to be able to do more with less
  • 78. Follow Us: #ITSMSummit! Let’s keep the conversation going… Andrew.P.White@Gmail.com! ReverendDrew! SystemsManagementZen.Wordpress.com! systemsmanagementzen.wordpress.com/feed/! @SystemsMgmtZen! ReverendDrew! APWhite@us.ibm.com! 614-306-3434!