SlideShare a Scribd company logo
Effective AIOps using
Open Source Software
...building an AIOps solution in a week.
Murali Kaundinya
Keith Kroculick
Suresh Veda Keith Kroculick Sandeep Mehandru
Sarath Parvathaneni Javin Sword Murali Kaundinya
Agenda
§ Overview
§ AI Ops Quick Overview
§ Building an Open Source AIOps Platform
§ AIOps Operational Events
§ Legacy Operations Platform Architecture
§ New AIOps Platform “Open” Architecture
§ AIOps Predictive Modeling & Analytics
§ AIOps Data Collection & Visualization
Effective AIOps
What is AIOps?
AIOps Quick Overview
• AIOps is the evolution of Operations and uses Artificial Intelligence (AI) and Machine
Learning (ML) technologies to see and predict the health of a company’s mission critical
systems
• AIOps tools harvest data such as logs, alerts, traces and other data from applications, servers
and infrastructure to build datasets for analytics, event correlation and to find relationships
between events
• AIOps technologies can show the graph dependencies across all environments as well as
analyze that data to extract significant events related to slow downs or outages, etc.
• Notifications and alerts can be generated to inform IT staff and predict problems, provide root
causes and recommend solutions
Building an Open Source AIOps Platform
Reasons for building a custom solution
• Needed to build a solution to rapidly to solve our operational issues with our mission critical
systems
• A solution was tailored to solve the company’s specific “Ops” related problems
• Almost all COTS AIOps solutions need to be tailored and modified regardless of vendor
• Same amount of work as customizing an COTS AIOps products
• Components can be modified or extended replaced with other solutions
• Maintain (own) the source code and allow people contribute to it
Custom AI Ops Solution vs Vendor Solution
AIOps & Open Source Technologies
Technologies used to rapidly build our AIOps Solution
Technologies
• Apache Flink Distributed Stream Processing Framework
• Python Keras Deep Learning Framework
• MySQL Database
• Grafana Observability Platform
• Prometheus Monitoring & Time Series Database (Optional)
AIOps & Operational Events and Data
Operational Events & Data from Applications, Web Servers, Network Devices & Hardware
AIOps platforms ingest “Ops” data from applications, web servers, network devices and
hardware
• Logs
• Traces
• Alerts
• Configuration Management
• Reference & Lookup Details
• Summarized Metrics
Legacy Operations Platform Architecture
Point to Point Operations (Logging & Monitoring) Architecture
Monitoring Agents
Splunk
Application
Splunk
IBM Tivoli
Netcool
IBM
Netcool
Monitoring
Systems
Other Monitoring
Agents
Remedy
Alert Incident Ticket
Ticket
Logs
Monitoring
• Current Eventing State
Problem, Incident & Ticket
• Complex “Point to Point” monitoring &
eventing are replicated 1000s of times
• Tickets may take several
days to resolve due to
limited correlation between
systems
• Logs are sent to tools
like Splunk with limited
logic on what gets
ingested
• Continuing growth of
applications and monitoring
make this difficult to support
• Multiple monitoring and
logging agents are installed
for an application depending
on application type
Application
Monitoring
Application Landscape
New AIOps Platform “Open”
Architecture
Building An AIOps Solution using Apache Flink, Python Keras & Grafana
Apache Flink Job
AIOps Event Correlation
Apache Flink and Event Correlation
Example of “Ops” Event(s) Correlation & Logic
Combined “Ops”
Event Dataset
App Logs
Server Logs
Alerts
Configuration
Management
ü AppName
ü AppLogLevel
ü AppLogMessage
ü ServerLogMessage
ü HostType
ü Hostname
ü Process
ü Datacenter
ü IP Address
ü Timestamp
ü AlertErrorMessage
ü Actions
ü ConfigurationItems
AppName, LogLevel, LogMessage, Hostname, IPAddress, Timestamp
AlertErrorMessage, Actions, Hostname, IPAddress, Timestamp
ConfigurationItems, Hostname, IPAddress
ServerLogMessage, LogLevel, HostType, Process, Datacenter, Hostname, IPAddress, Timestamp
Operational
Events
Common Data Sources
**Events can be correlated (joined) on common fields like IP Address or Hostname and Timestamps to create a unified
dataset
AIOps Predictive Modeling & Analytics
Correlation & Predictive Analytics
• Events from applications, servers & infrastructure are combined to create a “view” of its
health of the platform at any point in time (i.e. graph of events)
• Our “event” sources (application & server logs, alerts , traces) are combined and transformed
into individual datasets that can be used in ML models (e.g. training, validation, test)
• We use Apache Flink stateful functions with PyFlink and the Python Keras TensorFlow
libraries to apply ML models to process our “Ops” datasets
• We start with a simple Linear Regression approach to modeling the relationships between
applications, middleware, servers based on the events we ingest
• Linear Regression is the a simple starting point for predicting behaviors
Apache Flink with AIOps & Machine Learning
AIOps Predictive Modeling and Analytics
Apache Flink & Machine Learning
Using Python Keras & Linear Regression Logic in our Apache Flink Application
Ops Events
Training Dataset
Test Dataset
Ops Events
Validation Dataset
Ops Events Final
Dataset
Using Supervised Learning Models for ML
4. Compare final dataset results
to sample test dataset
2. Compare the training dataset to validation set.
Validation set contains values with expected ranges
3. Tune training dataset models based
on previous results
5. Repeat the process as necessary to harden your models
Combined “Ops”
Event Dataset
1. Our combined dataset
is used to create our
training dataset, validation
dataset and test dataset
Apache Flink Job
Python AI ML Step
AIOps Data Collection and Visualization
Grafana & MySQL
• Grafana provides dashboarding, visualization and alerting capabilities for our AIOps platform
• Data produced by our Apache Flink jobs i.e. summarized application & server ”Ops” event
details as well as our Python Keras ML outputs were loaded into a simple MySQL database
• Aggregated Metrics were exposed to Grafana using common SQL VIEWs and SQL idioms i.e.
GROUP BY, COUNT(*), MAX, AVG with Grafana’s MySQL connector
• The database used a very simple schema for our “Ops” event data to allow us to create
queries to show relationships between applications, app components, servers, etc. based on
the the details we ingested from our Flink job
• Grafana allowed us to quickly use various visualizations to display our data using histograms,
pie charts, scatter plots
AIOps Dashboards & Reporting
AIOps Data Collection and Visualization
Final AIOps Dashboard with Grafana
AIOps Data Collection and Visualization
Grafana with Prometheus Time Series Database
• MySQL was used to house our data for
our AI Ops solution due to its ease of use
and availability
• A time series database is a better fit for
reporting on operational events i.e.
Prometheus
• Aggregated Metrics were exposed to
Grafana using common SQL VIEWs and
SQL idioms i.e. GROUP BY, COUNT(*),
MAX, AVG with Grafana’s MySQL
connector
• Apache Flink can write directly to
Prometheus using via Prometheus
Reporter Library
• Prometheus also supports a native
MySQL exporter which can use our
existing MySQL database and not require
us to change our application logic
In closing….
AIOps Overview
• Legacy Operations vs. a new AIOps Solution
• Building our Solution with Open Source Tools, Frameworks and Libaries
• Using Event Correlation & Machine Learning
• Building Dashboards and Reporting
Thanks!
• Thanks to the Databricks team for providing us the opportunity to speak about AIOps!
Discusssion Summary & Thanks
CommentsQuestions:
Murali Kaundinya Murali.Kaundinya@gmail.com
Keith Kroculick kkroculick@gmail.com
Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

More Related Content

What's hot

Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
Tyler Treat
 
데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...
데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...
데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...
Amazon Web Services Korea
 
Introduction to Azure monitor
Introduction to Azure monitorIntroduction to Azure monitor
Introduction to Azure monitor
Praveen Nair
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
GetInData
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
Amazon Web Services
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
Snowflake Computing
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
Ana Medina
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
MoovingON
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Soluciones Dynatrace
Soluciones DynatraceSoluciones Dynatrace
Soluciones Dynatrace
Innovation Strategies
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
Amazon Web Services
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
Elasticsearch
 
Observability
ObservabilityObservability
Observability
Ebru Cucen Çüçen
 
Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)
Szabolcs Zajdó
 
FinOps introduction
FinOps introductionFinOps introduction
FinOps introduction
Alexander Tokarev
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
Brett VanderPlaats
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
confluent
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
Amazon Web Services
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
gjuljo
 
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Amazon Web Services
 

What's hot (20)

Cloud-Native Observability
Cloud-Native ObservabilityCloud-Native Observability
Cloud-Native Observability
 
데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...
데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...
데이터 분석플랫폼을 위한 데이터 전처리부터 시각화까지 한번에 보기 - 노인철 AWS 솔루션즈 아키텍트 :: AWS Summit Seoul ...
 
Introduction to Azure monitor
Introduction to Azure monitorIntroduction to Azure monitor
Introduction to Azure monitor
 
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
Best Practices for ETL with Apache NiFi on Kubernetes - Albert Lewandowski, G...
 
Introduction to AWS Cost Management
Introduction to AWS Cost ManagementIntroduction to AWS Cost Management
Introduction to AWS Cost Management
 
Introducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data WarehouseIntroducing the Snowflake Computing Cloud Data Warehouse
Introducing the Snowflake Computing Cloud Data Warehouse
 
Introduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft AzureIntroduction to Chaos Engineering with Microsoft Azure
Introduction to Chaos Engineering with Microsoft Azure
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Soluciones Dynatrace
Soluciones DynatraceSoluciones Dynatrace
Soluciones Dynatrace
 
Building-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWSBuilding-a-Data-Lake-on-AWS
Building-a-Data-Lake-on-AWS
 
Combining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observabilityCombining logs, metrics, and traces for unified observability
Combining logs, metrics, and traces for unified observability
 
Observability
ObservabilityObservability
Observability
 
Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)Cloud cost optimization (AWS, GCP)
Cloud cost optimization (AWS, GCP)
 
FinOps introduction
FinOps introductionFinOps introduction
FinOps introduction
 
Zero to Snowflake Presentation
Zero to Snowflake Presentation Zero to Snowflake Presentation
Zero to Snowflake Presentation
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
Migrating to the Cloud
Migrating to the CloudMigrating to the Cloud
Migrating to the Cloud
 
Azure Monitoring Overview
Azure Monitoring OverviewAzure Monitoring Overview
Azure Monitoring Overview
 
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...Using AWS Control Tower to govern multi-account AWS environments at scale - G...
Using AWS Control Tower to govern multi-account AWS environments at scale - G...
 

Similar to Effective AIOps with Open Source Software in a Week

Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200
Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200
Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200
Karl Ots
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
DataWorks Summit
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Value Association
 
Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...
Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...
Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...
VMware Tanzu
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
gjuljo
 
Avanttic tech dates - de la monitorización a la 'observabilidad'
Avanttic tech dates - de la monitorización a la 'observabilidad'Avanttic tech dates - de la monitorización a la 'observabilidad'
Avanttic tech dates - de la monitorización a la 'observabilidad'
avanttic Consultoría Tecnológica
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
confluent
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
Splunk
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
Michael Stephenson
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringModernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoring
ManageEngine, Zoho Corporation
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
Splunk
 
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Adin Ermie
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Dataconomy Media
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
Mark Kromer
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
Nisha Talagala
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
Mickey Boxell
 
Getting Started with Innoslate
Getting Started with InnoslateGetting Started with Innoslate
Getting Started with Innoslate
Elizabeth Steiner
 
The State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWSThe State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWS
Trevor Parsons
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Lucas Jellema
 

Similar to Effective AIOps with Open Source Software in a Week (20)

Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200
Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200
Monitoring advanced Azure PaaS workloads in the enterprise - Level: 200
 
Visualizing Big Data in Realtime
Visualizing Big Data in RealtimeVisualizing Big Data in Realtime
Visualizing Big Data in Realtime
 
Big Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICSBig Data Analytics Platforms by KTH and RISE SICS
Big Data Analytics Platforms by KTH and RISE SICS
 
Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...
Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...
Real-time Analysis of Data Processing Pipelines with Spring Cloud Data Flow a...
 
DevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft AzureDevOps in the Cloud with Microsoft Azure
DevOps in the Cloud with Microsoft Azure
 
Avanttic tech dates - de la monitorización a la 'observabilidad'
Avanttic tech dates - de la monitorización a la 'observabilidad'Avanttic tech dates - de la monitorización a la 'observabilidad'
Avanttic tech dates - de la monitorización a la 'observabilidad'
 
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
Kafka: Journey from Just Another Software to Being a Critical Part of PayPal ...
 
SplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding OverviewSplunkLive! Munich 2018: Data Onboarding Overview
SplunkLive! Munich 2018: Data Onboarding Overview
 
LeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration ServicesLeedsSharp May 2023 - Azure Integration Services
LeedsSharp May 2023 - Azure Integration Services
 
Modernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoringModernizing Cloud and Hyperconverged Infrastructure monitoring
Modernizing Cloud and Hyperconverged Infrastructure monitoring
 
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding OverviewSplunkLive! Frankfurt 2018 - Data Onboarding Overview
SplunkLive! Frankfurt 2018 - Data Onboarding Overview
 
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
Global Azure Bootcamp 2017 - Performance and Health Management for Modern App...
 
Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex Big Data Berlin v8.0 Stream Processing with Apache Apex
Big Data Berlin v8.0 Stream Processing with Apache Apex
 
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
Thomas Weise, Apache Apex PMC Member and Architect/Co-Founder, DataTorrent - ...
 
Microsoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the CloudMicrosoft Azure BI Solutions in the Cloud
Microsoft Azure BI Solutions in the Cloud
 
Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017Strata parallel m-ml-ops_sept_2017
Strata parallel m-ml-ops_sept_2017
 
ADDO Open Source Observability Tools
ADDO Open Source Observability Tools ADDO Open Source Observability Tools
ADDO Open Source Observability Tools
 
Getting Started with Innoslate
Getting Started with InnoslateGetting Started with Innoslate
Getting Started with Innoslate
 
The State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWSThe State of Log Management & Analytics for AWS
The State of Log Management & Analytics for AWS
 
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
Live Application and Infrastructure Monitoring and Root Cause Log Analysis wi...
 

More from Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
Databricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
Databricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
Databricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Databricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
Databricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Databricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
Databricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Databricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
Databricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Databricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
Databricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
Databricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
Databricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
Databricks
 

More from Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Recently uploaded

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
Social Samosa
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
ElizabethGarrettChri
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
bmucuha
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
nyfuhyz
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
z6osjkqvd
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
AlessioFois2
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
Vineet
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
VyNguyen709676
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
slg6lamcq
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
aguty
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
tzu5xla
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Kaxil Naik
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
inaya7568
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
vasanthatpuram
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
ywqeos
 

Recently uploaded (20)

The Ipsos - AI - Monitor 2024 Report.pdf
The  Ipsos - AI - Monitor 2024 Report.pdfThe  Ipsos - AI - Monitor 2024 Report.pdf
The Ipsos - AI - Monitor 2024 Report.pdf
 
Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024Open Source Contributions to Postgres: The Basics POSETTE 2024
Open Source Contributions to Postgres: The Basics POSETTE 2024
 
一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理一比一原版(UO毕业证)渥太华大学毕业证如何办理
一比一原版(UO毕业证)渥太华大学毕业证如何办理
 
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
一比一原版(UMN文凭证书)明尼苏达大学毕业证如何办理
 
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
一比一原版英属哥伦比亚大学毕业证(UBC毕业证书)学历如何办理
 
A presentation that explain the Power BI Licensing
A presentation that explain the Power BI LicensingA presentation that explain the Power BI Licensing
A presentation that explain the Power BI Licensing
 
Sample Devops SRE Product Companies .pdf
Sample Devops SRE  Product Companies .pdfSample Devops SRE  Product Companies .pdf
Sample Devops SRE Product Companies .pdf
 
writing report business partner b1+ .pdf
writing report business partner b1+ .pdfwriting report business partner b1+ .pdf
writing report business partner b1+ .pdf
 
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
一比一原版南十字星大学毕业证(SCU毕业证书)学历如何办理
 
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
一比一原版澳洲西澳大学毕业证(uwa毕业证书)如何办理
 
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理 原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
原版一比一爱尔兰都柏林大学毕业证(UCD毕业证书)如何办理
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
Orchestrating the Future: Navigating Today's Data Workflow Challenges with Ai...
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
Jio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdfJio cinema Retention & Engagement Strategy.pdf
Jio cinema Retention & Engagement Strategy.pdf
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
Cell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docxCell The Unit of Life for NEET Multiple Choice Questions.docx
Cell The Unit of Life for NEET Multiple Choice Questions.docx
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
一比一原版(lbs毕业证书)伦敦商学院毕业证如何办理
 

Effective AIOps with Open Source Software in a Week

  • 1. Effective AIOps using Open Source Software ...building an AIOps solution in a week. Murali Kaundinya Keith Kroculick
  • 2. Suresh Veda Keith Kroculick Sandeep Mehandru Sarath Parvathaneni Javin Sword Murali Kaundinya
  • 3. Agenda § Overview § AI Ops Quick Overview § Building an Open Source AIOps Platform § AIOps Operational Events § Legacy Operations Platform Architecture § New AIOps Platform “Open” Architecture § AIOps Predictive Modeling & Analytics § AIOps Data Collection & Visualization
  • 4. Effective AIOps What is AIOps? AIOps Quick Overview • AIOps is the evolution of Operations and uses Artificial Intelligence (AI) and Machine Learning (ML) technologies to see and predict the health of a company’s mission critical systems • AIOps tools harvest data such as logs, alerts, traces and other data from applications, servers and infrastructure to build datasets for analytics, event correlation and to find relationships between events • AIOps technologies can show the graph dependencies across all environments as well as analyze that data to extract significant events related to slow downs or outages, etc. • Notifications and alerts can be generated to inform IT staff and predict problems, provide root causes and recommend solutions
  • 5. Building an Open Source AIOps Platform Reasons for building a custom solution • Needed to build a solution to rapidly to solve our operational issues with our mission critical systems • A solution was tailored to solve the company’s specific “Ops” related problems • Almost all COTS AIOps solutions need to be tailored and modified regardless of vendor • Same amount of work as customizing an COTS AIOps products • Components can be modified or extended replaced with other solutions • Maintain (own) the source code and allow people contribute to it Custom AI Ops Solution vs Vendor Solution
  • 6. AIOps & Open Source Technologies Technologies used to rapidly build our AIOps Solution Technologies • Apache Flink Distributed Stream Processing Framework • Python Keras Deep Learning Framework • MySQL Database • Grafana Observability Platform • Prometheus Monitoring & Time Series Database (Optional)
  • 7. AIOps & Operational Events and Data Operational Events & Data from Applications, Web Servers, Network Devices & Hardware AIOps platforms ingest “Ops” data from applications, web servers, network devices and hardware • Logs • Traces • Alerts • Configuration Management • Reference & Lookup Details • Summarized Metrics
  • 8. Legacy Operations Platform Architecture Point to Point Operations (Logging & Monitoring) Architecture Monitoring Agents Splunk Application Splunk IBM Tivoli Netcool IBM Netcool Monitoring Systems Other Monitoring Agents Remedy Alert Incident Ticket Ticket Logs Monitoring • Current Eventing State Problem, Incident & Ticket • Complex “Point to Point” monitoring & eventing are replicated 1000s of times • Tickets may take several days to resolve due to limited correlation between systems • Logs are sent to tools like Splunk with limited logic on what gets ingested • Continuing growth of applications and monitoring make this difficult to support • Multiple monitoring and logging agents are installed for an application depending on application type Application Monitoring Application Landscape
  • 9. New AIOps Platform “Open” Architecture Building An AIOps Solution using Apache Flink, Python Keras & Grafana
  • 10. Apache Flink Job AIOps Event Correlation Apache Flink and Event Correlation Example of “Ops” Event(s) Correlation & Logic Combined “Ops” Event Dataset App Logs Server Logs Alerts Configuration Management ü AppName ü AppLogLevel ü AppLogMessage ü ServerLogMessage ü HostType ü Hostname ü Process ü Datacenter ü IP Address ü Timestamp ü AlertErrorMessage ü Actions ü ConfigurationItems AppName, LogLevel, LogMessage, Hostname, IPAddress, Timestamp AlertErrorMessage, Actions, Hostname, IPAddress, Timestamp ConfigurationItems, Hostname, IPAddress ServerLogMessage, LogLevel, HostType, Process, Datacenter, Hostname, IPAddress, Timestamp Operational Events Common Data Sources **Events can be correlated (joined) on common fields like IP Address or Hostname and Timestamps to create a unified dataset
  • 11. AIOps Predictive Modeling & Analytics Correlation & Predictive Analytics • Events from applications, servers & infrastructure are combined to create a “view” of its health of the platform at any point in time (i.e. graph of events) • Our “event” sources (application & server logs, alerts , traces) are combined and transformed into individual datasets that can be used in ML models (e.g. training, validation, test) • We use Apache Flink stateful functions with PyFlink and the Python Keras TensorFlow libraries to apply ML models to process our “Ops” datasets • We start with a simple Linear Regression approach to modeling the relationships between applications, middleware, servers based on the events we ingest • Linear Regression is the a simple starting point for predicting behaviors Apache Flink with AIOps & Machine Learning
  • 12. AIOps Predictive Modeling and Analytics Apache Flink & Machine Learning Using Python Keras & Linear Regression Logic in our Apache Flink Application Ops Events Training Dataset Test Dataset Ops Events Validation Dataset Ops Events Final Dataset Using Supervised Learning Models for ML 4. Compare final dataset results to sample test dataset 2. Compare the training dataset to validation set. Validation set contains values with expected ranges 3. Tune training dataset models based on previous results 5. Repeat the process as necessary to harden your models Combined “Ops” Event Dataset 1. Our combined dataset is used to create our training dataset, validation dataset and test dataset Apache Flink Job Python AI ML Step
  • 13. AIOps Data Collection and Visualization Grafana & MySQL • Grafana provides dashboarding, visualization and alerting capabilities for our AIOps platform • Data produced by our Apache Flink jobs i.e. summarized application & server ”Ops” event details as well as our Python Keras ML outputs were loaded into a simple MySQL database • Aggregated Metrics were exposed to Grafana using common SQL VIEWs and SQL idioms i.e. GROUP BY, COUNT(*), MAX, AVG with Grafana’s MySQL connector • The database used a very simple schema for our “Ops” event data to allow us to create queries to show relationships between applications, app components, servers, etc. based on the the details we ingested from our Flink job • Grafana allowed us to quickly use various visualizations to display our data using histograms, pie charts, scatter plots AIOps Dashboards & Reporting
  • 14. AIOps Data Collection and Visualization Final AIOps Dashboard with Grafana
  • 15. AIOps Data Collection and Visualization Grafana with Prometheus Time Series Database • MySQL was used to house our data for our AI Ops solution due to its ease of use and availability • A time series database is a better fit for reporting on operational events i.e. Prometheus • Aggregated Metrics were exposed to Grafana using common SQL VIEWs and SQL idioms i.e. GROUP BY, COUNT(*), MAX, AVG with Grafana’s MySQL connector • Apache Flink can write directly to Prometheus using via Prometheus Reporter Library • Prometheus also supports a native MySQL exporter which can use our existing MySQL database and not require us to change our application logic
  • 16. In closing…. AIOps Overview • Legacy Operations vs. a new AIOps Solution • Building our Solution with Open Source Tools, Frameworks and Libaries • Using Event Correlation & Machine Learning • Building Dashboards and Reporting Thanks! • Thanks to the Databricks team for providing us the opportunity to speak about AIOps! Discusssion Summary & Thanks CommentsQuestions: Murali Kaundinya Murali.Kaundinya@gmail.com Keith Kroculick kkroculick@gmail.com
  • 17. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions.