Tech	Talk:	Give	Me	the	Bad	News	
Straight:	 Why	Models	are	a	Broken
Approach	to	Alerting
David	B.	Martin
DevOps:	Agile	Ops	
CA	Technologies
APM	Product	Manager
DO5T41T
#CAWorld
2 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Give	Me	the	Bad	News	Straight:	
Why	Models	are	a	Broken	Approach	to	Alerting
The industry standard approach to automatic alerts is to create models
from base-lining application latencies. But when something goes wrong,
is it because something is really broken or because the model was
incorrect? Training the model to avoid mistakes is complex and time-
intensive. CA Application Performance Management (CA APM) 10
replaces the whole approach with a brand new one: react to changes in
application stability as they occur. Outliers are automatically ignored,
while tremors in latency register progressively bigger values for the
intensity of an event, a little like the richter scale for earthquakes. Join
the discussion and learn how CA APM transforms automatic alerting.
David	B.	Martin
CA	Technologies
Product	Manager
3 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
©	2015	CA.	All	rights	reserved.	All	trademarks	referenced	herein	belong	to	their	respective	companies.
The	content	provided	in	this CA	World	2015	presentation	is	intended	for	informational	purposes	only	and	does	not	form	any	type	of	
warranty. The information	provided	by	a	CA	partner	and/or	CA	customer	has	not	been	reviewed	for	accuracy	by	CA.	
For	Informational	Purposes	Only	
Terms	of	this	Presentation
4 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Agenda
WHY	MODELS	ARE	FAILING
A	BRIEF	HISTORY	OF	APM	ALERTING
CA	TECHNOLOGIES	DIFFERENTIAL	ANALYSIS
MODELS	ARE	MADE	TO	BE	BROKEN
DATA-DRIVEN	DIVE	INTO	AUTOMATIC	ALERTING	MODELS
SHEWHART	SAVES	THE	DAY
1
2
3
4
5
6
5 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Keeping	my	promise!
§ I	will	begin	this	session	by	making	a	detailed,	data-centric	case	
for	why	CA	Technologies	new	differential	analysis	feature	is	a	
superior,	market-leading	approach	to	automatic	alerting.
§ No,	I	will	not	then	pull	a	rabbit	out	of	a	hat.	‘Cuz this	ain’t
magic	people	…	even	if	it	looks	like	magic.
§ “Any	sufficiently	advanced	technology	is	indistinguishable	
from	magic.”—A.C.	Clarke
6 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
What	was	CA’s	last	answer?
§ In	the	early	90s,	Wily	implemented	Holt’s	Linear	Exponential	
Smooth	(HLES)	to	calculate	baselines for metrics.
§ Baselines	were	fooled	by	regular	production	events—many	
were	more	about	regular	patterns	in	load	than	about	
maintenance	events.	Seasonality	debuts	to	address	it.
§ This	leads	to	rules—and	rules	engines—to	address	edge	cases	
that	seasonality	does	not	address	(e.g.	“+3	std dev from	
baseline”	to	deaden	the	sensitivity	of	triggers).
And	what	are	our	competitors	doing?
7 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
What’s	the	problem	with	the	state-of-the-art?
§ As	the	following	slides	will	explain,	seasonal	baselines	miss	
problems	that	you	don’t	want	to	miss.
§ Inevitably,	they	also	report	too	often.
§ When	they	do,	you	have	to	write	rules	resolve	the	issue	with	
your	issues.
§ Now	you’ve	failed	to	find	the	automatic	alerting	grail.
§ It	may	actually	be	more	efficient	to	go	back	to	writing	static	
thresholds	for	your	key	components.
Or,	a	good	reason	for	teaching	you	some	interesting	math.
8 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
440
460
480
500
520
540
560
580
600
620
Average	Response	Time
+1	StdDev
+2	StdDev
+3	StdDev
This	is	a	stable	application	response	time,	with	bands	of	standard	deviation.
Most	baselines	are	fancy	forms	of	standard	deviation	that	take	into	account	things	like	seasonality.
9 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
0
200
400
600
800
1000
1200
1400
1600
1800
An	outlier…
What	to	do?		If	it’s	in	a	seasonal	window,	it	has	to	be	a	bigger	
outlier,	but	the	problem	of,	“To	Alert	or	Not	to	Alert,”	remains	
the	same.
You	must	either	send	an	alert	for	this	single	spike	or	write	a	rule	
to	say	that	the	spike	has	to	be	“so	big”	before	you	care	
(which	is	usually	done	with	a	manually	written	rule	like	
“>	3	stddev”).
“Mr.	Ops	won’t	even	put	down	his	sandwich	
for	a	single	failed	transaction.”
10 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
0
500
1000
1500
2000
2500
What	about	the	situation	of	a	sustained	spike?
Supposedly,	seasonality	cancels	out	the	normal	
operations.	But	how	many	of	you	have	apps	in	
which	a	single	user	logs	in	and	starts	running	
expensive	(e.g.	reporting)	transactions?
Traditional	approach	has	to	again	decide:	when	to	
alert?	If	app	users	login	at	irregular	intervals	and	
perform	this	type	of	transaction,	then	triggering	
alerts	on	their	normal	(non-seasonal)	activity?
“cat	alerts/dev/null”.
But	how	long	do	you	wait	then?
Once	again,	a	decision	you have	to	make	and	
configure	for	each	of	your	apps.
11 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
0
500
1000
1500
2000
2500
3000
Better	hope	that	sustained,	normal	changes	in	response	time	are	seasonal	when	they	happen…	
If	not,	you	must	write	rules!
And	if	you	write	rules,	you	might	accidentally	deaden	the	threshold	to	actual	problems.	
Dang,	gum!
12 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Our	Hero:	Walter	Shewhart
§ In	the	1920s,	Walter	Shewhart et	al	worked	on	quality	
control	for	buried	telephone	lines.
§ Shewhart observed	that	while	every	line	displays	variation,	
some	lines	occasionally	display	uncontrolled	variation.	Like	a	
seismometer,	there	are	normal	fluctuations	and	then	there	
are	earthquakes.
§ Shewhart invented	control	charts	and	the	Western	Electric	
Rules	to	identify	uncontrolled	variance,	earning	himself	the	
title:	“Father	of	Statistical	Quality	Control.”
13 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Translation	please!
§ Shewhart taught	us	to	favor	real	time	observation	over	mathematical	
models	of	a	signal’s	behavior.
§ We	still	baseline	the	signal,	but	the	Western	Electric	Rules	define	the	
situations	in	which	the	signal	should	be	considered	in	a	bad	state	and	not	a	
simple	delta	from	the	baseline	model.
§ Shewhart’s method	of	characterizing	the	quality	of	a	signal	mirrors	the	
behavior	of	a	human	observer.
Trust	us,	you	will	understand	this	math.
14 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Shewhart’s Western	Electric	Rules
Straight	off	Wikipedia…	
The	canonical	Western	Electric	Rules	use	plain,	
old	standard	deviation	as	their	real	time	
measure.	Each	rule	identifies	a	pattern	in	the	
signal:
Rule	#1	– A	statistically	interesting	outlier
Rule	#2	– Two	somewhat	interesting	outliers	out	
of	three	measurements.
Rule	#3	– Four	smaller	outliers	out	of	five	
measurements.
Rule	#4	– Many	small	outliers	over	many	
measurements.	
This	much	we	flat	out	stole	from	math	history!
See	Comments	to	the	right
15 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
CA	Technologies	Innovation
§ Western	Electric	Rules	are	brilliant	for	both	real	time	analysis	of	telephone	
signals	and	application	signals.
§ A	single	rule	breach,	however,	is	too	dull	a	blade	for	slicing	through	this	
tough	problem.
§ By	assigning	weights	to	each	rule	breach,	keeping	a	running	sum	and	aging	
out	old	breaches,	we	can	produce	a	single,	normalized	value	for	variance	
intensity.
CA	APM	10	has	several	patents	pending.
16 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
In	a	busy	system,	there	are	always	
varying	levels	of	stability.
In	this	picture,	can	you	tell	which	
signals	are	least	stable?
17 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
This	signal	experienced	an	outlier,	
but	it	didn’t	turn	blue.
A	single	rule	breach	isn’t	enough	for	
“Pete	to	put	down	his	sandwich.”
18 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
In	this	case,	the	change	in	stability	was	
sustained	over	about	forty	minutes.
What	happened?		Click	to	find	out…
19 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
This	application	experienced	a	remarkable	degradation	in	performance	
over	a	forty-minute	period	of	time.
Both	old	and	our	new	approach	would	alert	here,	but	CA’s	alert	would	
happen	early	in	the	event	and	trigger	trace	collection	automatically.
The	old	approach	might	not	have	let	an	operator	know	for	thirty	minutes	
or	more,	based	on	the	rules	they	configured.
20 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Triage	is	a	battlefield	medicine	term:		where	are	the	wounded	soldiers?
CA’s	approach	means	identifying	chronic	problems	as	well	as	acute	ones.	Which	of	these	
lines	are	more	stable,	but	still	having	chronic	stability	events	at	regular	intervals?
21 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
22 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
23 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Differential	Analysis	Default	Configuration
24 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
25 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
CA	TECHNOLOGIES	TEAM	PEGASUS
Clockwise	from	left:
Prashant	Pathak,	Mark	LoSacco,	Weini	Yu,	
Prasanna	Ram	Venkatachalam,	Naresh	Chippada,	Carey	Feldstein,
Paul	Callahan	and	Sai Krishna	Rayanapati.
[not	pictured:	me]
26 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Recommended	Sessions
SESSION	# TITLE DATE/TIME
DO5X189S
How	to	Achieve	a	Customer-Centric	View	in	an	Omni-
Channel	World
11/18/2015	at	1:00	pm
DO5X194S
Monitor	Microservices,	Containers,	Cloud	Foundry	and	
Node	with	CA	Application	Performance	Management
11/18/2015	at	4:30	pm
DO5X193S
Customize	CA	Application	Performance	Management		
with	Tips	for	Using	the	CA	Application	Performance	
Management	Open	APIs
11/19/2015	at	4:30	pm
27 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Must	See	Demos
Application	Performance	
Management	and	DevOps,	
featuring	APM	use	in	
preproduction	 scenarios
Application	Performance	
Management
Theater	5
Application	Performance	
Management,	Modern	
Monitoring,	featuring	the	
new	APM	Team	Center
Application	Performance	
Management	
Theater	5
Ensuring	a	“5	star”	mobile	
app	experience	with	CA	
Mobile	App	Analytics	
Mobile	App	Analytics	
Theater	5
Unified	Monitoring:	APM	
Integrations	including	UIM
Application	Performance	
Management	
Theater	5
28 ©	2015	CA.	ALL	RIGHTS	RESERVED.@CAWORLD #CAWORLD
Follow	On	Conversations	At…
Smart	Bar
Application	
Performance	
Management
Theater	5
Tech	Talks
Application	
Performance	
Management
Theater	5
Question	and	Answer
DAVID.B.MARTIN@CA.COM

Give Me the Bad News Straight:  Why Models are a Broken Approach to Alerting