SlideShare a Scribd company logo
Intelligent Monitoring

        Denis A. Vieira Jr.
       Ricardo Clemente
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


Motivation:

    Only ponctual monitoring available

    Decrease time to repair incidents

    Proactive monitoring

    Realistic view from live environment
Intelligent Monitoring


Motivation:

    Learn (identify patterns )

    Automation

    Store historical data with no loss

    Improve credibility and Situational Awareness
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


 Where are we?:

    Lots of information (1200 servers with more than 14000 monitors)
     – more than 40000 graphs being plot

    Lots of tools for monitoring running (SME, IPMonitor, Cricket,
     SiteScope, SiteSeer, Logs)

    Difficulties with specific customizations, performance and cost

    No credibility (lots of emails) with alarms. But much better than
     before.
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


Were are we going:

    Use of events. E.g.: Appenders for log frameworks to integrate
     information from applications

    Knowledge to antecipate undesired situations

    Unified interface for monitoring

    Root cause detection
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
Intelligent Monitoring


Action Plan:

    Unify the monitoring tools with Nagios (scalability and integration)

    Integrate Nagios with correlation system using NEB (Nagios Event
     Broker)
    available ate:
         code.google.com/p/neb2activemq

    Map event and systems to correlate
   (manual and analytic task)
Intelligent Monitoring

Summary:


 Motivation
 Where are we?
 Where are we going?
 Action Plan
 Event Correlation
    Orverview and system architecture
    Event Bus
    Correlation tecnique
    Correlation egine
    Visualization
    Machine Learning
    Project
Overview and system architecture

 Modular and event-driven architecture



                                  CORRELATION
             COLLECTOR
                                    ENGINE




                              EVENT BUS




                     MACHINE LEARN        VISUALIZATION
Overview and system architecture
What is the system architecture?

 Unique bus for message exchange
 Modules are separte process for operating system and can be on
  differente machines
 Modules can publish / subscribe to queue / topic from bus

Why an Event Driven Architecture ?

 Loose coupled e Distributed
    Less intrusive for monitored systems
    Modules are independent
Event bus
Open source project

Chosen Apache ActiveMQ:
 Stable
 Performance
 Active Comunity
 Conectivity
     JMS
     STOMP
     REST
     XMPP (...)
Event Bus
Message format

 JSON ( not XML)
     Simplicity
 Structure
     Header : channel type(queue or topic) and event type
     Body: data



 $ curl -d "type=queue&body={'idle'=70, 'sys’=20,
 'usr'=10, 'host'='ws122' }&eventtype=CPU"
 http://barramento/message/events;
Correlation Technique

CEP (Complex Event Processing )
 Technology that enables processing mutiple events in real time with
  the goal to identify meaningful events
 Based on rules or queries (“SQL like”)
 Queries created on execution time

History
 On1995, professor David Luckham from Stanford, working on Rapide
  project coined the term CEP
 Database research topic: Data Stream Management Systems (DSMS)
Correlation technique
                 “upside down database”

 query                answer        continuos
                                                                answer
                                    query


                                                Processamento de
         Query Processing
                                dados               consultas            dados
             Memory                                Memória


                                                  Data stream



            Dados
             Dados
               Data

    Persistents relations
Correlation Technique
 Marketing
 Trend(Buzz)
  CEP market is estimated on 460 milion dolars by 2010 (source: IEEE
   Computer Society – April 2009)

 Useful where there are data streams and necessity to extract
   information on real time from that data
  Financial Market
  Logistic process (RFID)
  Airport control
  ICUs
  Datacenters
Correlation Technique
 Big Players
Correlation Technique
 Open Source Players
 Academic projects:
  STREAM – Stanford – 2003 (officialy deprecated)
  TelegraphCQ – Berkeley - 2003
      Based on PostgreSQL 7.3.2
      No activity
      Cayuga – Cornell

 From the industry:
 Esper, a codehaus project complete in terms features
  Compact syntax and flexible
  Excelent documentation
  Performance
  Our choice!
Correlation Engine
 Application




                     If session raised 10% on the
                     last 3 min, and the average
                     from Servers cpu didn’t raise
                     5%, and Mysql slow queries
                     are above 10, so there is a
                     database retention causing
                     users to queue
Correlation Engine
Application
                t – 3 min      t


              Vip           session

                t – 3 min      t


              Server        cpu_usr

                                   t


              Mysql         slow_query
Correlation Engine
 Application

 SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session,
   Vip_PAST.session, Mysql.slow_query
   FROM
          Server.win:time(1 min) as Server,
          Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST,
          Vip.win:time(1 min) as Vip,
          Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST ,
          Mysql.win:time (1min) as Mysql
   HAVING
          Vip.session > Vip_PAST.session * 1.10 AND
          avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND
          Mysql.slow_query > 10
Correlation Engine
 Identifing na outlier
     select host, free, avg(free)
     from Memory.win:time(240 sec) group by host
     having free < avg(free)

 Events sequence
    select * from
       pattern [every Memory(free < 10) ->
            (timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ]

 Schedule and extensions
     select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30
        sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id,
        “Sports.BigFarm")
Motor de correlação
 Performance Esper

      Item                     Especificação
      HW Servidor Esper        2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM
      VM config                -Xms2g -Xmx2g -Xns128m -Xgc:gencon


  Consulta                         # cons.    evt/s     Latência      Latência         Nota
                                                                      média
  select '$' as ticker from    1000           519 728 99.66% <        2.8us            CPU com 85%,
  Market(ticker='$').win:lengt                        10us                             70 Mbit/s
  h(1000).stat:weighted_avg('p
  rice', 'volume') output last
  every 30 seconds


Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
Correlation engine
 Process inside Correlaion engine
Visualization – Console
Quering the live environment
Visualization – Troubleshooting
Antecipating and solving incidents quicker
Visualization- Dashboard
Consolidate view of environment
What about unseen problems?
Machine Learning

Choice for non-supervised and incremental algorithms

Incremental PCA
 Transforms a number of possible correlated variables in a minor
  number of non-correlated, the principal componnents
 A change on principal componnents means a broken correlation, or
  annomaly
 Can be used for data compression

Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006)
Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf


Implementation had two main challenges: measures with missing values
  and different scales
Machine Learning

60 input signals
Machine Learning

Summarized on 1 principal component + gerenation matriz
Machine Learning


                      Second principal component




                   sensibility




                                              three annomaly
Project

Status

 Developed all functionalities

 Algorithms being validated through tests with
  RRDs and meeting with operation team

 Performance tests on going

 System on live enviroment with reduced scope
Project at Globo.com – Next challenges


Scale
    Events“Sharding”
    Rule balance
    Cache

Otimize algorithm
    Adaptative control of memory and sensibility parameters
    Insert a supervisioned layer
    Other algorithms to cooperate
Intelligent Monitoring

      Final considerations
References




       http://delicious.com/fisl10
Questions

 Contacts
   Denis A. Vieira Jr
   denis@corp.globo.com (www.globo.com)
   Ricardo Clemente
   ricardo@intelie.com.br (www.intelie.com.br)

 Globo.com stand
    This afternoon

 Raise your hand!

More Related Content

Viewers also liked

INTELIE - Inteligência em Operação
INTELIE - Inteligência em OperaçãoINTELIE - Inteligência em Operação
INTELIE - Inteligência em OperaçãoDC-DinsmoreCompass
 
Security Events correlation with ESPER
Security Events correlation with ESPERSecurity Events correlation with ESPER
Security Events correlation with ESPER
Nikolay Klendar
 
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Intelie
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
Amazon Web Services
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
António Alegria
 

Viewers also liked (6)

Intelie BPMS
Intelie BPMSIntelie BPMS
Intelie BPMS
 
INTELIE - Inteligência em Operação
INTELIE - Inteligência em OperaçãoINTELIE - Inteligência em Operação
INTELIE - Inteligência em Operação
 
Security Events correlation with ESPER
Security Events correlation with ESPERSecurity Events correlation with ESPER
Security Events correlation with ESPER
 
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
Gartner ITxpo 2015 - 3 casos de operações digitais mais inteligentes usando r...
 
Machine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWSMachine Learning & Data Lake for IoT scenarios on AWS
Machine Learning & Data Lake for IoT scenarios on AWS
 
Complex Event Processing with Esper
Complex Event Processing with EsperComplex Event Processing with Esper
Complex Event Processing with Esper
 

Similar to Intelligent Monitoring

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
In-Memory Computing Summit
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
Sigmoid
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Ivo Andreev
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
iguazio
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
Rajesh Gupta
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overview
István Dávid
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
Alluxio, Inc.
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
inside-BigData.com
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
C4Media
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architectures
Boyan Dimitrov
 
Linux capacity planning
Linux capacity planningLinux capacity planning
Linux capacity planning
Francisco Gonçalves
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
MongoDB
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
Maycon Viana Bordin
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
Fulvio Corno
 
Chapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technologyChapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technology
BATMUNHMUNHZAYA
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Priyanka Aash
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Altinity Ltd
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
Marcel Birkner
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
Data Con LA
 

Similar to Intelligent Monitoring (20)

Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...Flink Forward San Francisco 2018:  David Reniz & Dahyr Vergara - "Real-time m...
Flink Forward San Francisco 2018: David Reniz & Dahyr Vergara - "Real-time m...
 
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
IMC Summit 2016 Breakout - Matt Coventon - Test Driving Streaming and CEP on ...
 
ML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time SeriesML on Big Data: Real-Time Analysis on Time Series
ML on Big Data: Real-Time Analysis on Time Series
 
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest FiresAdvanced Open IoT Platform for Prevention and Early Detection of Forest Fires
Advanced Open IoT Platform for Prevention and Early Detection of Forest Fires
 
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to ProductionWebinar: Cutting Time, Complexity and Cost from Data Science to Production
Webinar: Cutting Time, Complexity and Cost from Data Science to Production
 
Embedded Intro India05
Embedded Intro India05Embedded Intro India05
Embedded Intro India05
 
Complex Event Processing - A brief overview
Complex Event Processing - A brief overviewComplex Event Processing - A brief overview
Complex Event Processing - A brief overview
 
The hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at HelixaThe hidden engineering behind machine learning products at Helixa
The hidden engineering behind machine learning products at Helixa
 
HPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural NetworksHPC Impact: EDA Telemetry Neural Networks
HPC Impact: EDA Telemetry Neural Networks
 
Mantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing SystemMantis: Netflix's Event Stream Processing System
Mantis: Netflix's Event Stream Processing System
 
Observability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architecturesObservability foundations in dynamically evolving architectures
Observability foundations in dynamically evolving architectures
 
Linux capacity planning
Linux capacity planningLinux capacity planning
Linux capacity planning
 
MongoDB for Time Series Data
MongoDB for Time Series DataMongoDB for Time Series Data
MongoDB for Time Series Data
 
Stream Processing Overview
Stream Processing OverviewStream Processing Overview
Stream Processing Overview
 
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
spChains: A Declarative Framework for Data Stream Processing in Pervasive App...
 
Chapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technologyChapter 1 computer abstractions and technology
Chapter 1 computer abstractions and technology
 
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
Transfer Learning: Repurposing ML Algorithms from Different Domains to Cloud ...
 
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 Adventures in Observability: How in-house ClickHouse deployment enabled Inst... Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
Adventures in Observability: How in-house ClickHouse deployment enabled Inst...
 
Adventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and InstanaAdventures in Observability - Clickhouse and Instana
Adventures in Observability - Clickhouse and Instana
 
Data Con LA 2022 Keynote
Data Con LA 2022 KeynoteData Con LA 2022 Keynote
Data Con LA 2022 Keynote
 

Recently uploaded

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 

Recently uploaded (20)

Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 

Intelligent Monitoring

  • 1. Intelligent Monitoring Denis A. Vieira Jr. Ricardo Clemente
  • 2. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 3. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 4. Intelligent Monitoring Motivation:  Only ponctual monitoring available  Decrease time to repair incidents  Proactive monitoring  Realistic view from live environment
  • 5. Intelligent Monitoring Motivation:  Learn (identify patterns )  Automation  Store historical data with no loss  Improve credibility and Situational Awareness
  • 6. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 7. Intelligent Monitoring Where are we?:  Lots of information (1200 servers with more than 14000 monitors) – more than 40000 graphs being plot  Lots of tools for monitoring running (SME, IPMonitor, Cricket, SiteScope, SiteSeer, Logs)  Difficulties with specific customizations, performance and cost  No credibility (lots of emails) with alarms. But much better than before.
  • 8. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 9. Intelligent Monitoring Were are we going:  Use of events. E.g.: Appenders for log frameworks to integrate information from applications  Knowledge to antecipate undesired situations  Unified interface for monitoring  Root cause detection
  • 10. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation
  • 11. Intelligent Monitoring Action Plan:  Unify the monitoring tools with Nagios (scalability and integration)  Integrate Nagios with correlation system using NEB (Nagios Event Broker)  available ate: code.google.com/p/neb2activemq  Map event and systems to correlate (manual and analytic task)
  • 12. Intelligent Monitoring Summary:  Motivation  Where are we?  Where are we going?  Action Plan  Event Correlation  Orverview and system architecture  Event Bus  Correlation tecnique  Correlation egine  Visualization  Machine Learning  Project
  • 13. Overview and system architecture  Modular and event-driven architecture CORRELATION COLLECTOR ENGINE EVENT BUS MACHINE LEARN VISUALIZATION
  • 14. Overview and system architecture What is the system architecture?  Unique bus for message exchange  Modules are separte process for operating system and can be on differente machines  Modules can publish / subscribe to queue / topic from bus Why an Event Driven Architecture ?  Loose coupled e Distributed  Less intrusive for monitored systems  Modules are independent
  • 15. Event bus Open source project Chosen Apache ActiveMQ:  Stable  Performance  Active Comunity  Conectivity  JMS  STOMP  REST  XMPP (...)
  • 16. Event Bus Message format  JSON ( not XML)  Simplicity  Structure  Header : channel type(queue or topic) and event type  Body: data $ curl -d "type=queue&body={'idle'=70, 'sys’=20, 'usr'=10, 'host'='ws122' }&eventtype=CPU" http://barramento/message/events;
  • 17. Correlation Technique CEP (Complex Event Processing )  Technology that enables processing mutiple events in real time with the goal to identify meaningful events  Based on rules or queries (“SQL like”)  Queries created on execution time History  On1995, professor David Luckham from Stanford, working on Rapide project coined the term CEP  Database research topic: Data Stream Management Systems (DSMS)
  • 18. Correlation technique “upside down database” query answer continuos answer query Processamento de Query Processing dados consultas dados Memory Memória Data stream Dados Dados Data Persistents relations
  • 19. Correlation Technique Marketing Trend(Buzz)  CEP market is estimated on 460 milion dolars by 2010 (source: IEEE Computer Society – April 2009) Useful where there are data streams and necessity to extract information on real time from that data  Financial Market  Logistic process (RFID)  Airport control  ICUs  Datacenters
  • 21. Correlation Technique Open Source Players Academic projects:  STREAM – Stanford – 2003 (officialy deprecated)  TelegraphCQ – Berkeley - 2003  Based on PostgreSQL 7.3.2  No activity  Cayuga – Cornell From the industry: Esper, a codehaus project complete in terms features  Compact syntax and flexible  Excelent documentation  Performance  Our choice!
  • 22. Correlation Engine Application If session raised 10% on the last 3 min, and the average from Servers cpu didn’t raise 5%, and Mysql slow queries are above 10, so there is a database retention causing users to queue
  • 23. Correlation Engine Application t – 3 min t Vip session t – 3 min t Server cpu_usr t Mysql slow_query
  • 24. Correlation Engine Application SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session, Vip_PAST.session, Mysql.slow_query FROM Server.win:time(1 min) as Server, Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST, Vip.win:time(1 min) as Vip, Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST , Mysql.win:time (1min) as Mysql HAVING Vip.session > Vip_PAST.session * 1.10 AND avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND Mysql.slow_query > 10
  • 25. Correlation Engine Identifing na outlier select host, free, avg(free) from Memory.win:time(240 sec) group by host having free < avg(free) Events sequence select * from pattern [every Memory(free < 10) -> (timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ] Schedule and extensions select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30 sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id, “Sports.BigFarm")
  • 26. Motor de correlação Performance Esper Item Especificação HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon Consulta # cons. evt/s Latência Latência Nota média select '$' as ticker from 1000 519 728 99.66% < 2.8us CPU com 85%, Market(ticker='$').win:lengt 10us 70 Mbit/s h(1000).stat:weighted_avg('p rice', 'volume') output last every 30 seconds Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
  • 27. Correlation engine Process inside Correlaion engine
  • 28. Visualization – Console Quering the live environment
  • 29. Visualization – Troubleshooting Antecipating and solving incidents quicker
  • 31. What about unseen problems?
  • 32. Machine Learning Choice for non-supervised and incremental algorithms Incremental PCA  Transforms a number of possible correlated variables in a minor number of non-correlated, the principal componnents  A change on principal componnents means a broken correlation, or annomaly  Can be used for data compression Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006) Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf Implementation had two main challenges: measures with missing values and different scales
  • 34. Machine Learning Summarized on 1 principal component + gerenation matriz
  • 35. Machine Learning Second principal component sensibility three annomaly
  • 36. Project Status  Developed all functionalities  Algorithms being validated through tests with RRDs and meeting with operation team  Performance tests on going  System on live enviroment with reduced scope
  • 37. Project at Globo.com – Next challenges Scale Events“Sharding” Rule balance Cache Otimize algorithm Adaptative control of memory and sensibility parameters Insert a supervisioned layer Other algorithms to cooperate
  • 38. Intelligent Monitoring Final considerations
  • 39. References http://delicious.com/fisl10
  • 40. Questions Contacts Denis A. Vieira Jr denis@corp.globo.com (www.globo.com) Ricardo Clemente ricardo@intelie.com.br (www.intelie.com.br) Globo.com stand This afternoon Raise your hand!