Intelligent Monitoring

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Intelligent Monitoring - Presentation Transcript

    1. Intelligent Monitoring Denis A. Vieira Jr. Ricardo Clemente
    2. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
    3. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
    4. Intelligent Monitoring Motivation:  Only ponctual monitoring available  Decrease time to repair incidents  Proactive monitoring  Realistic view from live environment
    5. Intelligent Monitoring Motivation:  Learn (identify patterns )  Automation  Store historical data with no loss  Improve credibility and Situational Awareness
    6. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
    7. Intelligent Monitoring Where are we?:  Lots of information (1200 servers with more than 14000 monitors) – more than 40000 graphs being plot  Lots of tools for monitoring running (SME, IPMonitor, Cricket, SiteScope, SiteSeer, Logs)  Difficulties with specific customizations, performance and cost  No credibility (lots of emails) with alarms. But much better than before.
    8. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
    9. Intelligent Monitoring Were are we going:  Use of events. E.g.: Appenders for log frameworks to integrate information from applications  Knowledge to antecipate undesired situations  Unified interface for monitoring  Root cause detection
    10. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
    11. Intelligent Monitoring Action Plan:  Unify the monitoring tools with Nagios (scalability and integration)  Integrate Nagios with correlation system using NEB (Nagios Event Broker)  available ate: code.google.com/p/neb2activemq  Map event and systems to correlate (manual and analytic task)
    12. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation  Orverview and system architecture  Event Bus  Correlation tecnique  Correlation egine  Visualization  Machine Learning  Project
    13. Overview and system architecture  Modular and event-driven architecture CORRELATION COLLECTOR ENGINE EVENT BUS MACHINE LEARN VISUALIZATION
    14. Overview and system architecture What is the system architecture?  Unique bus for message exchange  Modules are separte process for operating system and can be on differente machines  Modules can publish / subscribe to queue / topic from bus Why an Event Driven Architecture ?  Loose coupled e Distributed  Less intrusive for monitored systems  Modules are independent
    15. Event bus Open source project Chosen Apache ActiveMQ:  Stable  Performance  Active Comunity  Conectivity  JMS  STOMP  REST  XMPP (...)
    16. Event Bus Message format  JSON ( not XML)  Simplicity  Structure  Header : channel type(queue or topic) and event type  Body: data $ curl -d "type=queue&body={'idle'=70, 'sys’=20, 'usr'=10, 'host'='ws122' }&eventtype=CPU" http://barramento/message/events;
    17. Correlation Technique CEP (Complex Event Processing )  Technology that enables processing mutiple events in real time with the goal to identify meaningful events  Based on rules or queries (“SQL like”)  Queries created on execution time History  On1995, professor David Luckham from Stanford, working on Rapide project coined the term CEP  Database research topic: Data Stream Management Systems (DSMS)
    18. Correlation technique “upside down database” query answer continuos answer query Processamento de Query Processing dados consultas dados Memory Memória Data stream Dados Dados Data Persistents relations
    19. Correlation Technique Marketing Trend(Buzz)  CEP market is estimated on 460 milion dolars by 2010 (source: IEEE Computer Society – April 2009) Useful where there are data streams and necessity to extract information on real time from that data  Financial Market  Logistic process (RFID)  Airport control  ICUs  Datacenters
    20. Correlation Technique Big Players
    21. Correlation Technique Open Source Players Academic projects:  STREAM – Stanford – 2003 (officialy deprecated)  TelegraphCQ – Berkeley - 2003  Based on PostgreSQL 7.3.2  No activity  Cayuga – Cornell From the industry: Esper, a codehaus project complete in terms features  Compact syntax and flexible  Excelent documentation  Performance  Our choice!
    22. Correlation Engine Application If session raised 10% on the last 3 min, and the average from Servers cpu didn’t raise 5%, and Mysql slow queries are above 10, so there is a database retention causing users to queue
    23. Correlation Engine Application t – 3 min t Vip session t – 3 min t Server cpu_usr t Mysql slow_query
    24. Correlation Engine Application SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session, Vip_PAST.session, Mysql.slow_query FROM Server.win:time(1 min) as Server, Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST, Vip.win:time(1 min) as Vip, Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST , Mysql.win:time (1min) as Mysql HAVING Vip.session > Vip_PAST.session * 1.10 AND avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND Mysql.slow_query > 10
    25. Correlation Engine Identifing na outlier select host, free, avg(free) from Memory.win:time(240 sec) group by host having free < avg(free) Events sequence select * from pattern [every Memory(free < 10) -> (timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ] Schedule and extensions select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30 sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id, “Sports.BigFarm")
    26. Motor de correlação Performance Esper Item Especificação HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon Consulta # cons. evt/s Latência Latência Nota média select '$' as ticker from 1000 519 728 99.66% < 2.8us CPU com 85%, Market(ticker='$').win:lengt 10us 70 Mbit/s h(1000).stat:weighted_avg('p rice', 'volume') output last every 30 seconds Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
    27. Correlation engine Process inside Correlaion engine
    28. Visualization – Console Quering the live environment
    29. Visualization – Troubleshooting Antecipating and solving incidents quicker
    30. Visualization- Dashboard Consolidate view of environment
    31. What about unseen problems?
    32. Machine Learning Choice for non-supervised and incremental algorithms Incremental PCA  Transforms a number of possible correlated variables in a minor number of non-correlated, the principal componnents  A change on principal componnents means a broken correlation, or annomaly  Can be used for data compression Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006) Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf Implementation had two main challenges: measures with missing values and different scales
    33. Machine Learning 60 input signals
    34. Machine Learning Summarized on 1 principal component + gerenation matriz
    35. Machine Learning Second principal component sensibility three annomaly
    36. Project Status  Developed all functionalities  Algorithms being validated through tests with RRDs and meeting with operation team  Performance tests on going  System on live enviroment with reduced scope
    37. Project at Globo.com – Next challenges Scale Events“Sharding” Rule balance Cache Otimize algorithm Adaptative control of memory and sensibility parameters Insert a supervisioned layer Other algorithms to cooperate
    38. Intelligent Monitoring Final considerations
    39. References http://delicious.com/fisl10
    40. Questions Contacts Denis A. Vieira Jr denis@corp.globo.com (www.globo.com) Ricardo Clemente ricardo@intelie.com.br (www.intelie.com.br) Globo.com stand This afternoon Raise your hand!
    SlideShare Zeitgeist 2009

    + IntelieIntelie Nominate

    custom

    293 views, 0 favs, 0 embeds more stats

    This presentation describes a intelligent IT monito more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 293
      • 293 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 17
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories