This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.
4. Intelligent Monitoring
Motivation:
Only ponctual monitoring available
Decrease time to repair incidents
Proactive monitoring
Realistic view from live environment
5. Intelligent Monitoring
Motivation:
Learn (identify patterns )
Automation
Store historical data with no loss
Improve credibility and Situational Awareness
7. Intelligent Monitoring
Where are we?:
Lots of information (1200 servers with more than 14000 monitors)
– more than 40000 graphs being plot
Lots of tools for monitoring running (SME, IPMonitor, Cricket,
SiteScope, SiteSeer, Logs)
Difficulties with specific customizations, performance and cost
No credibility (lots of emails) with alarms. But much better than
before.
9. Intelligent Monitoring
Were are we going:
Use of events. E.g.: Appenders for log frameworks to integrate
information from applications
Knowledge to antecipate undesired situations
Unified interface for monitoring
Root cause detection
11. Intelligent Monitoring
Action Plan:
Unify the monitoring tools with Nagios (scalability and integration)
Integrate Nagios with correlation system using NEB (Nagios Event
Broker)
available ate:
code.google.com/p/neb2activemq
Map event and systems to correlate
(manual and analytic task)
12. Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Orverview and system architecture
Event Bus
Correlation tecnique
Correlation egine
Visualization
Machine Learning
Project
13. Overview and system architecture
Modular and event-driven architecture
CORRELATION
COLLECTOR
ENGINE
EVENT BUS
MACHINE LEARN VISUALIZATION
14. Overview and system architecture
What is the system architecture?
Unique bus for message exchange
Modules are separte process for operating system and can be on
differente machines
Modules can publish / subscribe to queue / topic from bus
Why an Event Driven Architecture ?
Loose coupled e Distributed
Less intrusive for monitored systems
Modules are independent
15. Event bus
Open source project
Chosen Apache ActiveMQ:
Stable
Performance
Active Comunity
Conectivity
JMS
STOMP
REST
XMPP (...)
16. Event Bus
Message format
JSON ( not XML)
Simplicity
Structure
Header : channel type(queue or topic) and event type
Body: data
$ curl -d "type=queue&body={'idle'=70, 'sys’=20,
'usr'=10, 'host'='ws122' }&eventtype=CPU"
http://barramento/message/events;
17. Correlation Technique
CEP (Complex Event Processing )
Technology that enables processing mutiple events in real time with
the goal to identify meaningful events
Based on rules or queries (“SQL like”)
Queries created on execution time
History
On1995, professor David Luckham from Stanford, working on Rapide
project coined the term CEP
Database research topic: Data Stream Management Systems (DSMS)
18. Correlation technique
“upside down database”
query answer continuos
answer
query
Processamento de
Query Processing
dados consultas dados
Memory Memória
Data stream
Dados
Dados
Data
Persistents relations
19. Correlation Technique
Marketing
Trend(Buzz)
CEP market is estimated on 460 milion dolars by 2010 (source: IEEE
Computer Society – April 2009)
Useful where there are data streams and necessity to extract
information on real time from that data
Financial Market
Logistic process (RFID)
Airport control
ICUs
Datacenters
21. Correlation Technique
Open Source Players
Academic projects:
STREAM – Stanford – 2003 (officialy deprecated)
TelegraphCQ – Berkeley - 2003
Based on PostgreSQL 7.3.2
No activity
Cayuga – Cornell
From the industry:
Esper, a codehaus project complete in terms features
Compact syntax and flexible
Excelent documentation
Performance
Our choice!
22. Correlation Engine
Application
If session raised 10% on the
last 3 min, and the average
from Servers cpu didn’t raise
5%, and Mysql slow queries
are above 10, so there is a
database retention causing
users to queue
32. Machine Learning
Choice for non-supervised and incremental algorithms
Incremental PCA
Transforms a number of possible correlated variables in a minor
number of non-correlated, the principal componnents
A change on principal componnents means a broken correlation, or
annomaly
Can be used for data compression
Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006)
Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf
Implementation had two main challenges: measures with missing values
and different scales
35. Machine Learning
Second principal component
sensibility
three annomaly
36. Project
Status
Developed all functionalities
Algorithms being validated through tests with
RRDs and meeting with operation team
Performance tests on going
System on live enviroment with reduced scope
37. Project at Globo.com – Next challenges
Scale
Events“Sharding”
Rule balance
Cache
Otimize algorithm
Adaptative control of memory and sensibility parameters
Insert a supervisioned layer
Other algorithms to cooperate
40. Questions
Contacts
Denis A. Vieira Jr
denis@corp.globo.com (www.globo.com)
Ricardo Clemente
ricardo@intelie.com.br (www.intelie.com.br)
Globo.com stand
This afternoon
Raise your hand!