SlideShare a Scribd company logo
1 of 48
Download to read offline
The	road	to	monitoring	Nirvana
June	2017
Pedro	Araújo
Who	am	I?
Studied	Computer	Engineering
Did	web	development	for	a	couple	of	years
Moved	to	systems	administration	for	a	couple	of	years
Had	a	run	at	build	and	automation	engineering
Landed	in	SRE,	<3	it
135	million	daily	transactions
4.7	billion	daily	API	calls	most	weeks	(55k/s,	100k/s	peak)
2.5	terabytes	of	daily	log	data	output
250,000-per-second	time	series	data	points
14k	nagios	checks
What	this	talk	is	not	going	to	be	about
What	is	monitoring?
Different	things	to	different	people
Everyone	starts	with	(nagios-style)	checks
Also,	black-box	monitoring
(from:	https://www.novell.com/coolsolutions/feature/16723.html)
Logs	give	you	initial	insights
at	the	cost	of	brittle	and	heavy	scripting
RSS_URI="/rss"
LOG_FILE="/var/log/httpd/access_log"
LOG_DATE_FORMAT="%d/%b/%Y"
DATE="-1	day"
LOG_FDATE=`date	-d	"$DATE"	"+${LOG_DATE_FORMAT}"`
#	Unique	IPs	requesting	RSS,	except	those	reporting	"subscribers":
IPSUBS=$(
		fgrep	"$LOG_FDATE"	"$LOG_FILE"	
				|	fgrep	"	$RSS_URI"	
				|	egrep	-v	'[0-9]+	subscribers'	
				|	cut	-d'	'	-f	1	
				|	sort	
				|	uniq	
				|	wc	-l
)
#	Other	user-agents	reporting	"subscribers",	for	which	we'll	use	the	entire
#	user-agent	string	for	uniqueness:
OTHERSUBS=$(
		fgrep	"$LOG_FDATE"	"$LOG_FILE"	
				|	fgrep	"	$RSS_URI"	
				|	fgrep	-v	'subscribers;	feed-id='	
				|	egrep	'[0-9]+	subscribers'	
				|	egrep	-o	'"[^"]+"$'	
				|	sort	-t(	-k2	-sr	
				|	awk	'!x[$1]++'	
				|	egrep	-o	'[0-9]+	subscribers'	
				|	awk	'{s+=$1}	END	{print	s}'
)
(from:	https://gist.github.com/marcoarment/3783146	(abbr.))
Client-side	analytics
Online
Crash	reporting
Synthetic	monitoring
RUM
Offline
Detailed	logging
Round-Robin	Databases
(from:	https://megalytic.com/blog/tips-for-segmenting-stats-by-geography)
(from:	https://piwik.org/docs/piwik-tour/#toc-dashboard-widgets)
Time	series	are	like	check	results	saved
continuously
But	we	can	do	so	much	more
Named	value	at	some	time.
Metric	identity
name
dimensions
Metric	value
Timestamp
os.filesystem.size	1486469296	961130496	mount=/	type=Used
os.filesystem.size	1486469296	8903143424	mount=/	type=Free
os.filesystem.size	1486469296	1143103488	mount=/var	type=Used
os.filesystem.size	1486469296	249068044288	mount=/var	type=Free
os.filesystem.size	1486469296	0	mount=/dev/shm	type=Used
os.filesystem.size	1486469296	50682404864	mount=/dev/shm	type=Free
os.filesystem.size-inodes	1486469296	18862	mount=/	type=Used
os.filesystem.size-inodes	1486469296	2602578	mount=/	type=Free
os.filesystem.size-inodes	1486469296	14518	mount=/var	type=Used
os.filesystem.size-inodes	1486469296	66504522	mount=/var	type=Free
os.filesystem.size-inodes	1486469296	1	mount=/dev/shm	type=Used
os.filesystem.size-inodes	1486469296	12373633	mount=/dev/shm	type=Free
Common	datapoint	types
Counters
Counter	example
Counter	rated	example
Common	datapoint	types
Gauges
Gauge	example
Advanced	datapoint	types
Histogram
Example	histogram	sample
#	HELP	The	number	of	chunks	persisted	per	series.
#	TYPE	prometheus_local_storage_series_chunks_persisted	histogram
prometheus_local_storage_series_chunks_persisted_bucket{le="1"}	3.205911e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="2"}	3.652375e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="4"}	4.405614e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="8"}	5.66866e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="16"}	8.226382e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="32"}	8.73615e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="64"}	8.770525e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="128"}	8.770525e+06
prometheus_local_storage_series_chunks_persisted_bucket{le="+Inf"}	8.770525e+06
prometheus_local_storage_series_chunks_persisted_sum	5.5495433e+07
prometheus_local_storage_series_chunks_persisted_count	8.770525e+06
Histogram	example
White-box	monitoring
Detailed	insight	via	native	instrumentation
You	can	roll	your	own	instrumentation.
@contextmanager
def	op(what):
		start	=	time.time()
		yield
		increment('hitcount.total_s',
												value=(time.time()	-	start),
												tags=["op:"	+	what])
while	True:
		with	op('receive'):
				req	=	queue.pop()
		with	op('compute_route'):
				route	=	compute_route(req)
		with	op('update'):
				db.execute('''
						UPDATE	hitcount	WHERE	route	=	?	SET	hits=hits	+	1
						''',	(route,	))
		with	op('finish'):
				req.finish()
(from:	https://honeycomb.io/blog/2017/01/instrumentation-measuring-capacity-through-utilization/)
Aspects	help	instrumentation.
@Controller
public	class	MyController	{
		@RequestMapping("/")
		@TimeMethod(name	=	"app_duration_seconds",	help	=	"Some	helpful	info	here")
		public	Object	handleMain()	{
				//	Do	something
		}
}
	
c	=	Counter('request_failure_total',	'Description	of	counter')
h	=	Histogram('request_latency_seconds',	'Description	of	histogram')
@c.count_exceptions()
@h.time()
def	businessFunction():
		#	Do	something
		pass
Collectors
or,	how	to	get	all	that	interesting	data
Event-based	monitoring
grep'ing	logs	across	hosts	is	hard
I	CAN	HAZ	LOG	AGGREGATION
(from:	https://dzone.com/articles/getting-started-splunk)
(from:	http://blog.takipi.com/splunk-vs-elk-the-log-management-tools-decision-making-guide/)
Micro-services	-	perf	debugging	is	hard
Distributed	tracing	to	the	rescue
(from:	http://opentracing.io/documentation/)
(from:	http://opentracing.io/documentation/)
(from:	http://jaeger.readthedocs.io/en/latest/#trace-detail-view)
Advanced	visualisations
(from:	https://www.circonus.com/2012/09/understanding-data-with-histograms/)
(from:	http://www.brendangregg.com/frequencytrails.html)
Anomaly	detection
(from:	https://eng.uber.com/argos/)
Alerting
not	covered	here
Conclusion
Thank	you
Pedro	Araújo
https://keybase.io/phcrva

More Related Content

Similar to The road to monitoring Nirvana

Similar to The road to monitoring Nirvana (20)

Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Milan 2017 - D...
 
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
Demi Ben-Ari - Monitoring Big Data Systems Done "The Simple Way" - Codemotion...
 
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
Monitoring Big Data Systems Done "The Simple Way" - Codemotion Berlin 2017
 
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
 
Taking Splunk to the Next Level - Manager
Taking Splunk to the Next Level - ManagerTaking Splunk to the Next Level - Manager
Taking Splunk to the Next Level - Manager
 
Iot presentation and hand on building tools
Iot presentation and hand on building toolsIot presentation and hand on building tools
Iot presentation and hand on building tools
 
Spring and Pivotal Application Service - SpringOne Tour - Boston
Spring and Pivotal Application Service - SpringOne Tour - BostonSpring and Pivotal Application Service - SpringOne Tour - Boston
Spring and Pivotal Application Service - SpringOne Tour - Boston
 
What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)What does "monitoring" mean? (FOSDEM 2017)
What does "monitoring" mean? (FOSDEM 2017)
 
Cytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis ToolsCytoscape and External Data Analysis Tools
Cytoscape and External Data Analysis Tools
 
Taking Splunk to the Next Level - Manager
Taking Splunk to the Next Level - ManagerTaking Splunk to the Next Level - Manager
Taking Splunk to the Next Level - Manager
 
On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...On the Application of AI for Failure Management: Problems, Solutions and Algo...
On the Application of AI for Failure Management: Problems, Solutions and Algo...
 
OSMC 2021 | Current State of Icinga
OSMC 2021 | Current State of IcingaOSMC 2021 | Current State of Icinga
OSMC 2021 | Current State of Icinga
 
A Call for Sanity in NoSQL
A Call for Sanity in NoSQLA Call for Sanity in NoSQL
A Call for Sanity in NoSQL
 
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems Done "The Simple Way" - Demi Ben-Ari - Codemotion...
 
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
Monitoring Big Data Systems "Done the simple way" - Demi Ben-Ari - Codemotion...
 
Neo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform OverviewNeo4j Database and Graph Platform Overview
Neo4j Database and Graph Platform Overview
 
Spring Boot & Spring Cloud Apps on Pivotal Application Service - Daniel Lavoie
Spring Boot & Spring Cloud Apps on Pivotal Application Service - Daniel LavoieSpring Boot & Spring Cloud Apps on Pivotal Application Service - Daniel Lavoie
Spring Boot & Spring Cloud Apps on Pivotal Application Service - Daniel Lavoie
 
SpringOne Tour Denver - Spring Boot & Spring Cloud on Pivotal Application Ser...
SpringOne Tour Denver - Spring Boot & Spring Cloud on Pivotal Application Ser...SpringOne Tour Denver - Spring Boot & Spring Cloud on Pivotal Application Ser...
SpringOne Tour Denver - Spring Boot & Spring Cloud on Pivotal Application Ser...
 
Final viva
Final vivaFinal viva
Final viva
 
From 12 to 3500 deployments per year in production
From 12 to 3500 deployments per year in production From 12 to 3500 deployments per year in production
From 12 to 3500 deployments per year in production
 

Recently uploaded

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 

Recently uploaded (20)

“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
Design and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data ScienceDesign and Development of a Provenance Capture Platform for Data Science
Design and Development of a Provenance Capture Platform for Data Science
 
How we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdfHow we scaled to 80K users by doing nothing!.pdf
How we scaled to 80K users by doing nothing!.pdf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Intro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджераIntro in Product Management - Коротко про професію продакт менеджера
Intro in Product Management - Коротко про професію продакт менеджера
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!State of the Smart Building Startup Landscape 2024!
State of the Smart Building Startup Landscape 2024!
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024TopCryptoSupers 12thReport OrionX May2024
TopCryptoSupers 12thReport OrionX May2024
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Google I/O Extended 2024 Warsaw
Google I/O Extended 2024 WarsawGoogle I/O Extended 2024 Warsaw
Google I/O Extended 2024 Warsaw
 
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdfSimplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
Simplified FDO Manufacturing Flow with TPMs _ Liam at Infineon.pdf
 
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
Event-Driven Architecture Masterclass: Integrating Distributed Data Stores Ac...
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 

The road to monitoring Nirvana