Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Â
Brighttalk what should we be monitoring - final
1. By U.S. Navy photo by Mass Communication Specialist 1st Class James E. Foehl [Public domain], via Wikimedia Commons
The Age Old Question: What should
our APM Solution be monitoring?
2. Follow Us: #ITSMSummit!
Mr. White has ďŹfteen years of experience designing and managing the
deployment of Systems Monitoring and Event Management software. Prior
to joining IBM, Mr. White held various positions including the leader of the
Monitoring and Event Management organization of a Fortune 100 company
and developing solutions as a consultant for a wide variety of organizations,
including the Mexican SecretarĂa de Hacienda y CrĂŠdito PĂşblico, Telmex,
Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US
Navy Facilities and Engineering Command.
Andrew White
Cloud and Smarter Infrastructure Solution Specialist
IBM Corporation
4. Follow Us: #ITSMSummit!
Ground rules for this
sessionâŚ
â˘âŻ If you canât tell if I am trying to be funnyâŚ
ââŻ
GO AHEAD AND LAUGH!
â˘âŻ Feel free to text, tweet, yammer, or whatever.
Use
â˘âŻ If you have a question, no need to wait until
the end. Just interrupt me. Seriously⌠I
donât mind.
5. I have a lot of experience leading
Systems and Event Management teams
My name is Andrew White
6. I am here today to share some of what I have learned about
Systems Thinking,
and APM.
7. *Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) â Gomez Mobile Web Experience Survey conducted by Equation Research
58% of mobile phone users expect websites
to load as quickly, almost as quickly or faster
on their mobile phone, compared to the
computer they use at home*
http://www.ďŹickr.com/photos/lucianbickerton/3858380291/sizes/l/!
8. *Among adults who accessed the internet with a mobile phone in the past 12 months (n=1,001) â Gomez Mobile Web Experience Survey conducted by Equation Research
60% of mobile web users have had a problem in the
past year when accessing a website on their phone*
http://www.ďŹickr.com/photos/rickyromero/1357938629/sizes/l/!
9. *Among adults who accessed the internet with a mobile phone in the past 12 months (n=602) â Gomez Mobile Web Experience Survey conducted by Equation Research
Slow load time was the number on issue,
experience by almost 75% of them*
http://bighugelabs.com/onblack.php?id=2497744197&size=large!
14. If you were the one on the phone
with one of those customersâŚ!
how would you ďŹll that silence?!
15. The rationality of individuals is
limited by the information they
have. This causes âThe
Tragedy of the Commons.â
16. Follow Us: #ITSMSummit!
What Is a System?
It is a set of interconnected actors that change
over time when they are inďŹuenced by other
elements of the system.
Actor
Actor
Actor
Actor
Actor
Actor
Actor
Actor
17. As we have become more
aware that things are always
happening, our behavior has
changed.
22. Follow Us: #ITSMSummit! http://static4.businessinsider.com/image/5176c232ecad04805d000010-505-277/
screen%20shot%202013-04-23%20at%201.09.49%20pm.png
April 23, 2013
The Twitter account for the
Associated Press was hacked
The hackers posted a fake
notice that the White House
was attacked and President
Obama was injured
The Dow dropped 150 points
in less than 5 minutes
23. Follow Us: #ITSMSummit!
Systems are Volatile
This change makes it difďŹcult to control the
behavior of the system. The good news is that
systems are perfect. They always deliver the
optimum result given a speciďŹc stimuli.
24. Follow Us: #ITSMSummit!
Anatomy of An OutageP0 - Affecting Multiple apps!
Corporate
LANs & VPNs
Load Balancer
Firewall
Web
Servers
Message
Queue
zOS
CICS
WAS
Database
WAS
Database
zOS
MQ
DB2
4
3
1
5:45-ish pm: CICS ABENDS
start ďŹooding OMEGAMON but
not high enough to ticket
2
6:00-ish pm: MQ ďŹows start are
interrupted and are alerting in
Flow Diagnostics
6:54pm: Support teams
investigate the interrupted
ďŹows and determine it is a
âback-endâ problem
5
10:29pm: Support teams investigate
MQ and ultimately and rule it out and
ultimately decide to reset CICS to
resolve the issue
6:04pm: Synthetic
transactions fail at and
6:14 the Ops Center
conďŹrms the issue and
creates a P0 Incident
25. Follow Us: #ITSMSummit!
Our Problem Statement:
The business needs to reliably reach its customers and
users regardless of where they may be located. Latency
forces close geographic proximity of the components
and limits the quality of service provided to
geographically distributed customers.
If the users canât use it, it doesnât work.
26. Follow Us: #ITSMSummit!
Our Constraints
At the same time, there are a few inescapable facts we face:
1.⯠Todayâs users demand reliable systems to do their work
2.⯠Our systems mirror the complexity of the businesses they
support
3.⯠Our environments must be massive to scale to handle the
workload
4.⯠There is too much activity for a single person to be totally
situationally aware
27. When all of these happen at the same timeâŚ
UgâŚ
29. Follow Us: #ITSMSummit!
Your monitoring should help you answer:
â˘âŻ How will we know if the users are getting the experience
they are expecting?
â˘âŻ How much capacity do we need during normal and peak
times to ensure user expectations are met?
â˘âŻ How quickly can the provider we select ramp up to meet
our needs if we ďŹnd that the service is underperforming?
â˘âŻ How fast do we need to be able to access additional
capacity once it is ready for us?
What Do You Want To Accomplish?
31. Follow Us: #ITSMSummit!
Composite
Applications
Site Content!
Search!
Session!
Information!
User Login!
& Identity Mgmt!
Content Mgmt!
System!
Social Network!
Widgets!
Site Tracking!
& Analytics!
Banner Ads & !
Revenue Generators!
Multimedia &!
CDN Content!
32. Follow Us: #ITSMSummit!
The Same Old Problem
Corporate!
LANs & VPNs!
ISP!
Connection!
DNS & Internet!
Services!
Content Mgmt!
System!
Social Network!
Widgets!
Site Tracking!
& Analytics!
Banner Ads & !
Revenue Generators!
Multimedia &!
CDN Content!
Home Wireless!
& Broadband!
Mobile Broadband!
Is It My Data Center?!
â˘âŻ ConďŹguration errors!
â˘âŻ Application design issues!
â˘âŻ Code defects!
â˘âŻ InsufďŹcient infrastructure!
â˘âŻ Oversubscription Issues!
â˘âŻ Poor routing optimization!
â˘âŻ Low cache hit rate!
Is It a Service Provider Problem?!
â˘âŻ Non-optimized mobile content!
â˘âŻ Bad performance under load!
â˘âŻ Blocking content delivery!
â˘âŻ Incorrect geo-targeted content!
Is it an ISP
Problem?!
â˘âŻ Peering problems!
â˘âŻ ISP Outages! Is it My Code or a Browser Problem?!
â˘âŻ Missing content!
â˘âŻ Poorly performing JavaScript!
â˘âŻ Inconsistent CSS rendering!
â˘âŻ Browser/device incompatibility!
â˘âŻ Page size too big!
â˘âŻ ConďŹicting HTML tag support!
â˘âŻ Too many objects!
â˘âŻ Content not optimized for device!
The Cloud!
Distributed
Database
Mainframe
Network
Middleware
Storage
33. Follow Us: #ITSMSummit!
Cognitive Dissonance
Corporate
LANs & VPNs
Distributed
Database
Mainframe
Network
Middleware
Storage
ISP
Connection
DNS & Internet
Services
Content Mgmt
System
Social Network
Widgets
Site Tracking
& Analytics
Banner Ads &
Revenue Generators
Multimedia &
CDN Content
Home Wireless
& Broadband
Mobile Broadband
The Part You Control
The Part They Experience
âŚmeanwhile
the user is
NOT
happy
All our systems
look great,
SLAâs are being
metâŚ
You Have More
Control Here Than
You Think
34. Follow Us: #ITSMSummit!
Gaining Perspective
Requires Balance
Packet Capture!
Synthetic Transactions!
Client Monitoring!
Client Monitoring!
Synthetic Transactions!
Server Probe!
1.⯠Client to the Server!
2.⯠Server to the Client!
3.⯠â3rd Partyâ Vantage Point!
4.⯠Synthetic Transactions!
Four Perspectives of User Experience!
35. Follow Us: #ITSMSummit!
What Does Good
Monitoring Look Like?
Corporate!
LANs & VPNs!
Load Balancer!
Load Balancer!
Firewall!
Switch!
Web Server Farm!
Database!
Data Power!
Mainframe!
Middleware!
Load Balancer!
1.⯠System Availability
2.⯠Operating System Performance
3.⯠Hardware Monitoring
4.⯠Service/Daemon and Process Availability
5.⯠Error Logs
6.⯠Application Resource KPIs
7.⯠End-to-End Transactions
8.⯠Point of Failure Transactions
9.⯠Fail-Over Success
10.âŻâActivity Monitorsâ and âReverse Hockey Stickâ
Elements of Good Monitoring !
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
3!2! 4! 5! 6!1!
!
!
!
!
7!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
8!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
!
9! !
!
!
!
!
!
10!
36. Follow Us: #ITSMSummit!
Finding Metrics That Matter
§ď§âŻ Will the metric be used in a report? If so, which one? How is it used in the report?
§ď§âŻ Will the metric be used in a dashboard? If so, which one? How will it be used?
§ď§âŻ What action(s) will be taken if an alert is generated? Who are the actors? Will a ticket
be generated? If so, what severity?
§ď§âŻ How often is this event likely to occur? What is the impact if the event occurs? What
is the likelihood it can be detected by monitoring?
§ď§âŻ Will the metric help identify the source of a problem? Is it a coincident / symptomatic
indicator?
§ď§âŻ Is the metric always associated with a single problem? Could this metric become a
false indicator?
§ď§âŻ What is the impact if this goes undetected?
§ď§âŻ What is the lifespan for this metric? What is the potential for changes that may
reduce the efďŹcacy of the metric?
Evaluating the Effectiveness of a Metric
38. Follow Us: #ITSMSummit!
What Matters Most?
Dr.
 Lee
Â
Goldman
Â
Cook
 County
 Hospital,
Â
Chicago,
 IL
Â
§ď§âŻ Is the patient feeling unstable
angina?
§ď§âŻ Is there ďŹuid in the patientâs lungs?
§ď§âŻ Is the patientâs systolic blood
pressure below 100?"
The Goldman Algorithm
Prediction of Patients Expected to
Have a Heart Attack Within 72 Hours
0
Â
20
Â
40
Â
60
Â
80
Â
100
Â
Traditional Techniques
Goldman Algorithm
By paying attention to what really matters, Dr.
Goldman improved the âfalse negativesâ by 20
percentage points and eliminated the âfalse
positivesâ altogether.
39. Follow Us: #ITSMSummit!
The Goldman Algorithm
ECG Evidence of Acute Ischemia?
ST-Segment Depression ⼠1mm in ⼠2 Contiguous Leads
(New or Unknown Age) or
T- Wave Inversion in ⼠2 Contiguous Leads (New or
Unknown Age) or
Left Bundle-Branch Block (New or Unknown Age)
Observation
Unit
Inpatient
Telemetry Unit
High Risk
Low Risk
Very Low Risk
Moderate Risk
Yes
No
Coronary
Care Unit
No
ECG Evidence of Acute Myocardial Infarction (MI)?
ST-Segment Elevation ⼠1mm in ⼠2 Contiguous
Leads (New or Unknown Age)
or
Pathologic Q Waves in ⼠2 Contiguous Leads (New
or Unknown Age)
Yes
Patient suspected of
Acute Cardiac
Ischema
Perform
Electrocardiogram
(EKG)
0 Factors
2 or 3 Factors
1 Factors
0 or 1 Factors
2 or 3 Factors
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
Urgent Factors Present?
Rates Above Both Lung Bases
Systolic Blood Pressure <100 mm Hg
Unstable Ischemic Heart Disease
48. Follow Us: #ITSMSummit!
Common Problem Types
§ď§âŻ Design Problems
§ď§âŻ Creative Problems
§ď§âŻ Daily Problems
§ď§âŻ People Problems
Rule-Based
Approach
Event Based
Approach
49. Follow Us: #ITSMSummit!
Event-Based Problem Solving
â˘âŻ Appreciative Understanding
â˘âŻ Know What We Are Solving
â˘âŻ Create A Common Reality
â˘âŻ Solutions Based on Causes
50. Follow Us: #ITSMSummit!
Rules for Causal
Relationships
Database
Down !
(Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause)!
â ⯠Causes are effects, and effects are causes!
51. Follow Us: #ITSMSummit!
Rules for Causal
Relationships
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
âĄâŻ You can keep identifying causes â there is no limit!
52. Follow Us: #ITSMSummit!
Two Important Questions
End of the
Universe
(Effect)!
Database Down !
(Primary Effect)!
Drive Full
(Cause/Effect)!
Logs Not
Truncated
(Cause/Effect)!
Beginning of
Time (Cause)!
Ask âWhy?â!
Ask âWhatâ!
53. Follow Us: #ITSMSummit!
Rules for Causal
Relationships
â˘âŻ An Effect is often the result of multiple causes!
SQL Server was
not processing
queries (Effect)!
Transaction log
was unable to grow!
T: Drive at 0 Bytes
free!
Logs were not
truncated!
DBA on
honeymoon
vacation in Fiji!
Logs are truncated
manually!
Company has only
1 DBA!
âBackupâ DBA was
not aware the logs
require truncation!
Space allocations
are ďŹxed! Lack of Control!
-AND-!
-AND-!
-AND-!
54. Follow Us: #ITSMSummit!
Rules for Causal
Relationships
âŁâŻ Causes need to be both necessary and sufďŹcient!
SQL Server was not
processing queries
(Effect)!
Transaction log was
unable to grow
(Transitory Cause)!
T: Drive at 0 Bytes free!
(Non-transitory Cause
& Effect)!
Logs were not
truncated!
(Transitory Cause &
Effect)!
DBA on honeymoon
vacation in Fiji!
(Transitory Cause)!
Logs are truncated
manually!
(Non-Transitory Cause)!
Company has only 1
DBA!
(Non-Transitory Cause)!
âBackupâ DBA was not
aware the logs require
truncation!
(Non-Transitory Cause)!
Space allocations are
ďŹxed!
(Non-Transitory Cause)!
Lack of Control!
-AND-!
-AND-!
-AND-!
55. Follow Us: #ITSMSummit!
How Fire Works
Time
Oxygen
Heat
Fuel
Fire
MatchStrike
Transitory
Non-Transitory
Fire
Oxygen
Heat
Fuel
Match
Strike
-AND-
â˘âŻTransitory Causes act as catalysts to bring
about change (think Transition)
â˘âŻNon-Transitory Causes are objects,
properties/attributes, and status
56. Follow Us: #ITSMSummit!
RCA Diagram
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
âBackupâ DBA was
not aware the logs
require truncation
Space allocations
are ďŹxed
Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
57. Follow Us: #ITSMSummit!
Add Evidence
Customers
Complaining
Web Server returning
500 errors
The application
server was timing
out
SQL Server was not
processing queries
Transaction log was
unable to grow
T: Drive at 0 Bytes
free
Logs were not
truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1
DBA
âBackupâ DBA was
not aware the logs
require truncation
Space allocations
are ďŹxed
Lack of Control
Only one database
cluster in use
DR SQL Cluster
DR Cluster being
used for UAT testing
More Information
Needed
One one application
server exists
More Information
Needed
Trying to do business
on the website
Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Statistical Data
Situational
Observation
58. Follow Us: #ITSMSummit!
Failure Modes Analysis
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
âBackupâ DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
ďŹxed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is conďŹgured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving ďŹles in the
TEMP folder
%TEMP% conďŹgured to
C:Temp
Kernel able to write to
page ďŹle
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
59. Follow Us: #ITSMSummit!
Picking Monitors
SQL Server Not Available
Transaction log is unable
to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon
vacation in Fiji
Logs are truncated
manually
Company has only 1 DBA
âBackupâ DBA was not
aware the logs require
truncation
(Condition Cause)
Space allocations are
ďŹxed
(Condition Cause)
Lack of Control
SQL is unable to cache
query results
Available RAM at 0 Bytes
Free
C: Drive at 0 Bytes free
Minidump is conďŹgured to
write to C: Drive
Server was ASRing
frequently
Software distributions
were leaving ďŹles in the
TEMP folder
%TEMP% conďŹgured to
C:Temp
Kernel able to write to
page ďŹle
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Monitor the
intersections at
the âORâsâ
At least one point
along each branch
after the âORâ
60. Follow Us: #ITSMSummit!
FMEA Matrix
(Impact Calculation)
Negligible (1-2): no loss in functionality,
mostly cosmetic
Marginal (3-4): temporary interruptions or the
degradation lasts for a brief period of time
Critical (5-6): the problem will not resolve
itself but a work around exists allowing the
problem to be bypassed
Serious (7-8): the problem will not resolve
itself and no work around is possible.
Functionality is impaired or lost but the
system is usable to some extent
Catastrophic (9-10): the system is
completely unusable
Improbable (1-2): less than 1 time per year
Remote (3-4): 1 time per year
Occasional (5-6): 1 time per month
Probable (7-8): 1 time per day
Chronic (9-10): 1 or more times per day
Very high (1-2): during the design phase
High (3-4): during peer review or unit testing
Moderate (5-6): during system testing or
acceptance testing
Remote (7-8): during or immediately after
production deployment
Very Remote (9-10): only after heavy usage
by users
61. Follow Us: #ITSMSummit!
FMEA Matrix
(Evidence)
These are the events that help us to RULE IN a
failure mode as a possible cause
These are the events that help us RULE OUT the
failure mode as not relevant
62. Follow Us: #ITSMSummit!
Determining Severity
Logical Server
Virtual
Machine 1
Virtual
Machine 2
Severity
Description
Critical
The component has completely failed
Major
The component is operating but is in a degraded or crippled state
Minor
The component is functioning normally but is at risk of a more serious failure
Informational
The component is functioning normally but is reporting a change in state
Unknown
The component has changed its operating state but the effect is not known
Clear
The component is operating normally or a higher severity event has been resolved
â˘âŻ The event severity is determined with
respect to the component generating the
event
â˘âŻ The event severity does not consider impact
or urgency
â˘âŻ The incident priority is not determined by
event severity
â˘âŻ The event severity helps drive an effective
triage when multiple events arrive at
approximately the same time
â˘âŻ Only after the effected components and
their relationships to each other have been
determined can impact and urgency be
determined
Six Levels of Severity
Physical Server
Server 1
Server 2
Logical Volumes
Volume
Group 1
Volume
Group 2
Physical Volumes
Hard
Drive 1
Hard
Drive 2
Hard
Drive 3
63. Follow Us: #ITSMSummit!
Monitoring Patterns
Layers of Pre-DeďŹned Monitoring Patterns
â˘âŻ The OS template is deployed when the
server is provisioned
â˘âŻ As a server is customized to ďŹt its role,
additional templates are deployed
â˘âŻ Templates are stacked on top of each
other until no gaps remain
â˘âŻ This approach provides a high degree of
standardization without sacriďŹcing the
ability to develop a custom solution
64. Follow Us: #ITSMSummit!
Application-Technology
Matrix Maps services, applications and technologies
enabling:
â˘âŻMonitoring investment prioritization
â˘âŻMonitoring maturity
â˘âŻWhich templates need to be deployed when new
hardware is acquired
â˘âŻWhether an service has sufďŹcient monitoring
coverage based on its application components
â˘âŻThis approach allows for anticipating changes to
a customerâs monitoring needs
Scores indicate:
0 â No Strategy
1 â Limited Monitoring
2 â Fully Integrated Strategy
66. Follow Us: #ITSMSummit!
AutomatedAction
NotiďŹcationand
Escalation
BusinessImpact
Analysis
RootCauseAnalysis
Correlationand
EventSuppression
Enrichment
Meta-Data Integration Bus
DistributedCollectors
DistributedCollectors
LOB Managed
Monitoring System
Service Provider
Monitoring System
Vendor Managed
Monitoring System
Element
Manager
Element
Manager
Element
Manager
Other
Enterprise
Data
Document
Sharing
Service Desk
CMDB
Batch
Scheduling
Knowledge
Database
Online Run
Book
PBX/Call
Manager
Visualization Framework
CommonEvent
Format
Topology And
Relationship
Database
Automated Action
Tools
DistributedCollectors
Automated
Provisioning
System
Predictive
Analysis
Automated
Change
Reconciliation
Security
Management
ArchiveandReport
Business
Telemetry Data
Service Center
and Enterprise
NotiďŹcation Tool
Event Processing
67. Follow Us: #ITSMSummit!
As you recognize opportunities to
capture knowledge, use it to improve
your Event Management System.
Iterative Development
69. Follow Us: #ITSMSummit!
Sometimes We Miss Whatâs Going On
Say⌠whatâs a mountain
goat doing all the way up
here in a cloud bank?
70. Follow Us: #ITSMSummit!
The Path to Situational
Awareness
Collection Analytics
Situational
Awareness
PresentationAggregation
Each phase builds on the previous helping to establish
situational awareness:
â˘âŻ Data is collected from our IT systems
â˘âŻ These data are aggregated into a central location
â˘âŻ Correlations transform the data into information and predictive
analytics process them further into knowledge
â˘âŻ The processed and enriched knowledge is presented to users in a
way that helps them make good decisions
71. Follow Us: #ITSMSummit!
Cleaning Up the Landscape
Adapted from: Akella, Janaki. âIT Architecture: Cutting costs and complexity.â McKinsey Quarterly 13 Nov
2009 https://www.mckinseyquarterly.com/IT_architecture_Cutting_costs_and_complexity_2391
Silo
Monolithic
Framework
Niche
Launch Pad
Information Bus
72. Follow Us: #ITSMSummit!
Directed Workflows
Directed !
Non Directed!
Launchpad!
Executive Dashboard!
Business Area!
Dashboards!
Application PAC!
Dashboards!
Command Center!
Dashboards!
Technology Owner!
Dashboard!
Application Owner!
Dashboard!
Problem
Isolation!
Workspace!
Problem
Diagnostics!
Workspace!
System Detail!
View!
Component
Detail!
View!
74. Follow Us: #ITSMSummit!
The IBM Solution
!IBM SmartCloud APM Suite offers essential management capabilities
for applications in complex cloud and hybrid environments. !
!
! !
â˘âŻ At-a-glance status determination
via network topology graphs!
â˘âŻ Proactively identify and respond to
compliance issues!
â˘âŻ Monitor the performance of the
environment and the tenants living
inside of it!
â˘âŻ Understand the current capacity
needs and forecast future needs!
â˘âŻ Understand the costs associated
with providing the service and
enable âshowbackâ and charge
backâ reporting to the application
owners!
SINGLE POINT OF
MANAGEMENT!
!
â˘âŻ Minimize service and system
outages!
â˘âŻ Identify recurring incidents and
implement action to remediate
problems before they cause
impacts!
â˘âŻ Assist troubleshooting by
suppressing ânoiseâ events and
providing root cause determination!
MAXIMIZE SERVICE
AVAILABILITY!
!
â˘âŻ Reduce the need for manual
action or intervention!
â˘âŻ Automate for repeatability and
elimination of human error!
â˘âŻ Develop standardized practices
for complex business processes!
â˘âŻ Enable the development of APIs
to allow for self-service
management by the consumers!
IMPROVED OPERATIONAL
EFFICIENCY!
75. Follow Us: #ITSMSummit!
Understand the
end-user experience
Follow changing
workloads
Mobile devices & "
smart endpoints
Private, public & "
hybrid clouds
Highly virtualized applications,
storage & networks
Discovery
Visibility into
application
resources
End User
Experience
Transaction
performance
monitoring to
ensure SLA
compliance
Transaction
Tracking
Rapid problem
isolation through
transaction "
path analysis
Diagnostics
Domain-speciďŹc
operations tools
for diagnosis and
repair
Predictive
Analytics
Proactive
approach to
reduce outages
& improve
performance
shared data & common services
See steps
across the cloud
VISIBILITY, CONTROL AND AUTOMATION TO INTELLIGENTLY MANAGE
CRITICAL APPLICATIONS IN CLOUD AND HYBRID ENVIRONMENTS.
76. Follow Us: #ITSMSummit!
Tivoli Enterprise Portal
Monitor the complete Application
and Application Infrastructure
Measure, Baseline and Analyze the Service and
Transactions
ITCAM for
Applications
ITM for
Microsoft
Applications
ITM
ITCAM for
Transactions
ITCAM for
SOA Platform
OMEGAMON XE
Tivoli
Enterprise
Portal
Tivoli
Automation
Tivoli Data
Warehouse
Tivoli
Common
Reporting
IBM Tivoli Monitoring Solution
77. Follow Us: #ITSMSummit!
Business Value of Adopting APM
Predic've
 Outage
Â
Avoidance
Â
Ensure
 availability
 of
Â
applica3ons
 and
 services
Â
Â
Â
â˘âŻUse learning tools to
augment custom best
practices
â˘âŻLeverage statistical
methods to maximize
predictive warning
â˘âŻImprove problem
detection across IT silos
Predict
Faster
 Problem
Â
Resolu'on
Â
Find
 &
 correct
 problems
 faster
Â
with
 tools
 that
 determine
 ac3ons
Â
required
 to
 resolve
 issues
Â
Â
Â
â˘âŻIdentify problems quicker
with insight to large
unstructured repositories
â˘âŻIsolate problems quicker by
bringing relevant unstructured
data into problem
investigations
â˘âŻRepair problems quicker with
the right details quickly to hand.
Resolve
Op'mized
Â
Performance
Â
Â
Track,
 Op3mize,
 and
 Predict
Â
capacity
 and
 performance
 needs
Â
over
 3me
Â
Â
Â
â˘âŻTrack capacity and
performance of applications
and services in classic and
cloud environments
â˘âŻOptimize resource
deployment with what-if and
best fit planning tools
â˘âŻEscalate capacity and
performance problems before
they cause critical failures
Perform
Improved
 Insight
Â
Â
Enhance
 visibility
 into
 systems
Â
resource
 rela3onships
 while
Â
increasing
 customer
 sa3sfac3on
Â
Â
Â
Â
â˘âŻDetermine what resources
are interdependent to
assess impact of failures
â˘âŻGain insight into what is
important to your customer
â˘âŻDecrease customer churn
and acquisition costs while
increasing customer
retention and satisfaction
Know
Automated Analytics helps lower IT Administration Costs:
â˘âŻ Performance and Capacity planning tools monitor appropriately and escalate, reducing time consuming
report browsing
â˘âŻ Learning tools reduce customization and best practices investment on initial deployment
â˘âŻ Log Analysis helps speed problem resolution to be able to do more with less