Intelligent Monitoring

3,616 views
3,414 views

Published on

This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.

Published in: Technology, Business

Intelligent Monitoring

  1. 1. Intelligent Monitoring Denis A. Vieira Jr. Ricardo Clemente
  2. 2. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  3. 3. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  4. 4. Intelligent Monitoring Motivation:  Only ponctual monitoring available  Decrease time to repair incidents  Proactive monitoring  Realistic view from live environment
  5. 5. Intelligent Monitoring Motivation:  Learn (identify patterns )  Automation  Store historical data with no loss  Improve credibility and Situational Awareness
  6. 6. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  7. 7. Intelligent Monitoring Where are we?:  Lots of information (1200 servers with more than 14000 monitors) – more than 40000 graphs being plot  Lots of tools for monitoring running (SME, IPMonitor, Cricket, SiteScope, SiteSeer, Logs)  Difficulties with specific customizations, performance and cost  No credibility (lots of emails) with alarms. But much better than before.
  8. 8. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  9. 9. Intelligent Monitoring Were are we going:  Use of events. E.g.: Appenders for log frameworks to integrate information from applications  Knowledge to antecipate undesired situations  Unified interface for monitoring  Root cause detection
  10. 10. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  11. 11. Intelligent Monitoring Action Plan:  Unify the monitoring tools with Nagios (scalability and integration)  Integrate Nagios with correlation system using NEB (Nagios Event Broker)  available ate: code.google.com/p/neb2activemq  Map event and systems to correlate (manual and analytic task)
  12. 12. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation  Orverview and system architecture  Event Bus  Correlation tecnique  Correlation egine  Visualization  Machine Learning  Project
  13. 13. Overview and system architecture  Modular and event-driven architecture CORRELATION COLLECTOR ENGINE EVENT BUS MACHINE LEARN VISUALIZATION
  14. 14. Overview and system architecture What is the system architecture?  Unique bus for message exchange  Modules are separte process for operating system and can be on differente machines  Modules can publish / subscribe to queue / topic from bus Why an Event Driven Architecture ?  Loose coupled e Distributed  Less intrusive for monitored systems  Modules are independent
  15. 15. Event bus Open source project Chosen Apache ActiveMQ:  Stable  Performance  Active Comunity  Conectivity  JMS  STOMP  REST  XMPP (...)
  16. 16. Event Bus Message format  JSON ( not XML)  Simplicity  Structure  Header : channel type(queue or topic) and event type  Body: data $ curl -d "type=queue&body={'idle'=70, 'sys’=20, 'usr'=10, 'host'='ws122' }&eventtype=CPU" http://barramento/message/events;
  17. 17. Correlation Technique CEP (Complex Event Processing )  Technology that enables processing mutiple events in real time with the goal to identify meaningful events  Based on rules or queries (“SQL like”)  Queries created on execution time History  On1995, professor David Luckham from Stanford, working on Rapide project coined the term CEP  Database research topic: Data Stream Management Systems (DSMS)
  18. 18. Correlation technique “upside down database” query answer continuos answer query Processamento de Query Processing dados consultas dados Memory Memória Data stream Dados Dados Data Persistents relations
  19. 19. Correlation Technique Marketing Trend(Buzz)  CEP market is estimated on 460 milion dolars by 2010 (source: IEEE Computer Society – April 2009) Useful where there are data streams and necessity to extract information on real time from that data  Financial Market  Logistic process (RFID)  Airport control  ICUs  Datacenters
  20. 20. Correlation Technique Big Players
  21. 21. Correlation Technique Open Source Players Academic projects:  STREAM – Stanford – 2003 (officialy deprecated)  TelegraphCQ – Berkeley - 2003  Based on PostgreSQL 7.3.2  No activity  Cayuga – Cornell From the industry: Esper, a codehaus project complete in terms features  Compact syntax and flexible  Excelent documentation  Performance  Our choice!
  22. 22. Correlation Engine Application If session raised 10% on the last 3 min, and the average from Servers cpu didn’t raise 5%, and Mysql slow queries are above 10, so there is a database retention causing users to queue
  23. 23. Correlation Engine Application t – 3 min t Vip session t – 3 min t Server cpu_usr t Mysql slow_query
  24. 24. Correlation Engine Application SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session, Vip_PAST.session, Mysql.slow_query FROM Server.win:time(1 min) as Server, Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST, Vip.win:time(1 min) as Vip, Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST , Mysql.win:time (1min) as Mysql HAVING Vip.session > Vip_PAST.session * 1.10 AND avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND Mysql.slow_query > 10
  25. 25. Correlation Engine Identifing na outlier select host, free, avg(free) from Memory.win:time(240 sec) group by host having free < avg(free) Events sequence select * from pattern [every Memory(free < 10) -> (timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ] Schedule and extensions select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30 sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id, “Sports.BigFarm")
  26. 26. Motor de correlação Performance Esper Item Especificação HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon Consulta # cons. evt/s Latência Latência Nota média select '$' as ticker from 1000 519 728 99.66% < 2.8us CPU com 85%, Market(ticker='$').win:lengt 10us 70 Mbit/s h(1000).stat:weighted_avg('p rice', 'volume') output last every 30 seconds Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
  27. 27. Correlation engine Process inside Correlaion engine
  28. 28. Visualization – Console Quering the live environment
  29. 29. Visualization – Troubleshooting Antecipating and solving incidents quicker
  30. 30. Visualization- Dashboard Consolidate view of environment
  31. 31. What about unseen problems?
  32. 32. Machine Learning Choice for non-supervised and incremental algorithms Incremental PCA  Transforms a number of possible correlated variables in a minor number of non-correlated, the principal componnents  A change on principal componnents means a broken correlation, or annomaly  Can be used for data compression Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006) Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf Implementation had two main challenges: measures with missing values and different scales
  33. 33. Machine Learning 60 input signals
  34. 34. Machine Learning Summarized on 1 principal component + gerenation matriz
  35. 35. Machine Learning Second principal component sensibility three annomaly
  36. 36. Project Status  Developed all functionalities  Algorithms being validated through tests with RRDs and meeting with operation team  Performance tests on going  System on live enviroment with reduced scope
  37. 37. Project at Globo.com – Next challenges Scale Events“Sharding” Rule balance Cache Otimize algorithm Adaptative control of memory and sensibility parameters Insert a supervisioned layer Other algorithms to cooperate
  38. 38. Intelligent Monitoring Final considerations
  39. 39. References http://delicious.com/fisl10
  40. 40. Questions Contacts Denis A. Vieira Jr denis@corp.globo.com (www.globo.com) Ricardo Clemente ricardo@intelie.com.br (www.intelie.com.br) Globo.com stand This afternoon Raise your hand!

×