SlideShare a Scribd company logo
ELK
Not just for InfoSec any more
Who are we?
Russel Havens
● Monitoring Architect for
Adobe Digital Marketing
Business Unit
● Adjunct Professor, BYU
● 25 years in IT Operations, 2
years consulting, 5 years in
software development
Hayden Panike
● Former log analytics
undergraduate researcher
● Storage Engineer
Tanner Lund
● Former log analytics
undergraduate researcher
● Service Engineer
ELK
● As an developer, I write logs
● As an operations engineer, I live in the log
files while troubleshooting
● However: in most organizations where I’ve
worked, formal log aggregation and analysis
is owned by InfoSec, and access to those
were limited (Splunk or Syslog)
ELK background and approach
● Monitoring is a broad area
○ Up/Down
○ Historical trending
○ Log Analysis
● Using ELK for managing logs of 60+ Nagios
servers, web servers, etc.
● Worked with a BYU Capstone team 2013-
2014
ELK approach
Hayden & Tanner
- Took over log management project from
previous team
- Expanded partnership with BYU OIT to
gather logs from servers, SANs, and network
equipment (including Wifi hotspots)
- Expanded cluster size massively
Our ELK History
We do what we must...
Partnership with BYU OIT
Production ELK
Deployment
Every 60 seconds at
BYU we ingest event
logs:*
-2,431 network
-430,000 IDS
-59,000 wireless
62,300,000 total a
day*
Simplest Architecture
Common Architecture
Highly Available Enterprise
Architecture
This process ought to be formalized
Proactive monitoring
Logs and Enterprise Monitoring
Monitoring as a Discipline
SOURCES OF DATA
-SNMP -stdout
-/proc *stat
-Logs
SYSTEM PIECES
-Collector agents
-Aggregator nodes
-Analysis platform
PRINCIPLES
-Aggregation
-Cause Analysis
(reactive)
-Behavior Analysis
(proactive)
Queries: Simple Yet Elegant
● As Operation
Engineers, we
already know lots of
keywords. These can
be easily leveraged
for simple queries.
● error
● failure
● root
● port flapping
● memory
Simple Yet Elegant
● A simple search for the word “memory” over a 30 day
period.
Simple search
● Nagios 4 can be set for a maximum number of concurrent processes. Here we
see 2 overloaded monitoring servers.
Simple Yet Elegant
Simple Yet Elegant
Simple Bin and Count
Simple Bin and Count
Simple Bin and Count
● Campus wide phone OS stats
Simple Bin and Count
Business School
Administrative Building
Simple Bin and Count
Simple Bin and Count
Simple Bin and Count
Simple Bin and Count
-We are at the tip of a
transformative iceberg
-Machine Learning and
Statisticians needed
A Call to Arms!

More Related Content

Similar to Elk for Sysadmins

Splunk in Rakuten: Splunk as a Service for all
Splunk in Rakuten: Splunk as a Service for allSplunk in Rakuten: Splunk as a Service for all
Splunk in Rakuten: Splunk as a Service for all
Timur Bagirov
 
SplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps BreakoutSplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps Breakout
Splunk
 
Splunk and Cisco UCS Breakout Session
Splunk and Cisco UCS Breakout SessionSplunk and Cisco UCS Breakout Session
Splunk and Cisco UCS Breakout Session
Splunk
 
MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019
Ieva Navickaite
 
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
AgileNetwork
 
IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...
IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...
IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...
Senturus
 
2010/10 - Database Architechs - Perf. & Tuning Tools
2010/10 - Database Architechs - Perf. & Tuning Tools2010/10 - Database Architechs - Perf. & Tuning Tools
2010/10 - Database Architechs - Perf. & Tuning Tools
Database Architechs
 
SplunkLive! Chicago April 2013 - CME Group
SplunkLive! Chicago April 2013 - CME GroupSplunkLive! Chicago April 2013 - CME Group
SplunkLive! Chicago April 2013 - CME Group
Splunk
 
IoT Analytics @ splunk
IoT Analytics @ splunkIoT Analytics @ splunk
IoT Analytics @ splunk
Splunk
 
VASU_VALLABHUNI_INFOSYS
VASU_VALLABHUNI_INFOSYSVASU_VALLABHUNI_INFOSYS
VASU_VALLABHUNI_INFOSYS
Vasu VALLABHUNI
 
How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?
Denodo
 
SFBA Usergroup meeting November 2, 2022
SFBA Usergroup meeting November 2, 2022SFBA Usergroup meeting November 2, 2022
SFBA Usergroup meeting November 2, 2022
Becky Burwell
 
F-Secure Cloud Software icgse2013
F-Secure Cloud Software icgse2013F-Secure Cloud Software icgse2013
F-Secure Cloud Software icgse2013
Janne Järvinen
 
NVReddy
NVReddyNVReddy
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
OpsTree solutions
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Denodo
 
SharePoint Troubleshooting
SharePoint TroubleshootingSharePoint Troubleshooting
SharePoint Troubleshooting
Toby McGrail
 
NVReddy
NVReddyNVReddy
Case Study: University of Chicago Achieves High Availability through a Centr...
Case Study:  University of Chicago Achieves High Availability through a Centr...Case Study:  University of Chicago Achieves High Availability through a Centr...
Case Study: University of Chicago Achieves High Availability through a Centr...
CA Technologies
 

Similar to Elk for Sysadmins (20)

Splunk in Rakuten: Splunk as a Service for all
Splunk in Rakuten: Splunk as a Service for allSplunk in Rakuten: Splunk as a Service for all
Splunk in Rakuten: Splunk as a Service for all
 
SplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps BreakoutSplunkLive! London 2015 - DevOps Breakout
SplunkLive! London 2015 - DevOps Breakout
 
Splunk and Cisco UCS Breakout Session
Splunk and Cisco UCS Breakout SessionSplunk and Cisco UCS Breakout Session
Splunk and Cisco UCS Breakout Session
 
MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019MuleSoft Manchester Meetup #2 slides 29th October 2019
MuleSoft Manchester Meetup #2 slides 29th October 2019
 
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
Splunk Discovery: Warsaw 2018 - Legacy SIEM to Splunk, How to Conquer Migrati...
 
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
Agile Gurugram 2023 | Observability for Modern Applications. How does it help...
 
IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...
IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...
IBM Cognos Analytics Release 7+ Authoring Improvements: Demos of New and Rein...
 
2010/10 - Database Architechs - Perf. & Tuning Tools
2010/10 - Database Architechs - Perf. & Tuning Tools2010/10 - Database Architechs - Perf. & Tuning Tools
2010/10 - Database Architechs - Perf. & Tuning Tools
 
SplunkLive! Chicago April 2013 - CME Group
SplunkLive! Chicago April 2013 - CME GroupSplunkLive! Chicago April 2013 - CME Group
SplunkLive! Chicago April 2013 - CME Group
 
IoT Analytics @ splunk
IoT Analytics @ splunkIoT Analytics @ splunk
IoT Analytics @ splunk
 
VASU_VALLABHUNI_INFOSYS
VASU_VALLABHUNI_INFOSYSVASU_VALLABHUNI_INFOSYS
VASU_VALLABHUNI_INFOSYS
 
How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?How Does the Denodo Platform Accelerate Your Time to Insights?
How Does the Denodo Platform Accelerate Your Time to Insights?
 
SFBA Usergroup meeting November 2, 2022
SFBA Usergroup meeting November 2, 2022SFBA Usergroup meeting November 2, 2022
SFBA Usergroup meeting November 2, 2022
 
F-Secure Cloud Software icgse2013
F-Secure Cloud Software icgse2013F-Secure Cloud Software icgse2013
F-Secure Cloud Software icgse2013
 
NVReddy
NVReddyNVReddy
NVReddy
 
Observability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptxObservability for Application Developers (1)-1.pptx
Observability for Application Developers (1)-1.pptx
 
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
Wie beschleunigt die Denodo Plattform Ihre Zeit der Erkenntnisgewinnung?
 
SharePoint Troubleshooting
SharePoint TroubleshootingSharePoint Troubleshooting
SharePoint Troubleshooting
 
NVReddy
NVReddyNVReddy
NVReddy
 
Case Study: University of Chicago Achieves High Availability through a Centr...
Case Study:  University of Chicago Achieves High Availability through a Centr...Case Study:  University of Chicago Achieves High Availability through a Centr...
Case Study: University of Chicago Achieves High Availability through a Centr...
 

Recently uploaded

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
Rohit Gautam
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
Claudio Di Ciccio
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
Neo4j
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Speck&Tech
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
Kumud Singh
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
Matthew Sinclair
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
SOFTTECHHUB
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 

Recently uploaded (20)

Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
Large Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial ApplicationsLarge Language Model (LLM) and it’s Geospatial Applications
Large Language Model (LLM) and it’s Geospatial Applications
 
“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”“I’m still / I’m still / Chaining from the Block”
“I’m still / I’m still / Chaining from the Block”
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
GraphSummit Singapore | Neo4j Product Vision & Roadmap - Q2 2024
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
Cosa hanno in comune un mattoncino Lego e la backdoor XZ?
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Mind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AIMind map of terminologies used in context of Generative AI
Mind map of terminologies used in context of Generative AI
 
TrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy SurveyTrustArc Webinar - 2024 Global Privacy Survey
TrustArc Webinar - 2024 Global Privacy Survey
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
20240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 202420240609 QFM020 Irresponsible AI Reading List May 2024
20240609 QFM020 Irresponsible AI Reading List May 2024
 
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
Why You Should Replace Windows 11 with Nitrux Linux 3.5.0 for enhanced perfor...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 

Elk for Sysadmins

  • 1. ELK Not just for InfoSec any more
  • 2. Who are we? Russel Havens ● Monitoring Architect for Adobe Digital Marketing Business Unit ● Adjunct Professor, BYU ● 25 years in IT Operations, 2 years consulting, 5 years in software development Hayden Panike ● Former log analytics undergraduate researcher ● Storage Engineer Tanner Lund ● Former log analytics undergraduate researcher ● Service Engineer
  • 3. ELK
  • 4. ● As an developer, I write logs ● As an operations engineer, I live in the log files while troubleshooting ● However: in most organizations where I’ve worked, formal log aggregation and analysis is owned by InfoSec, and access to those were limited (Splunk or Syslog) ELK background and approach
  • 5. ● Monitoring is a broad area ○ Up/Down ○ Historical trending ○ Log Analysis ● Using ELK for managing logs of 60+ Nagios servers, web servers, etc. ● Worked with a BYU Capstone team 2013- 2014 ELK approach
  • 6. Hayden & Tanner - Took over log management project from previous team - Expanded partnership with BYU OIT to gather logs from servers, SANs, and network equipment (including Wifi hotspots) - Expanded cluster size massively Our ELK History
  • 7. We do what we must...
  • 8. Partnership with BYU OIT Production ELK Deployment Every 60 seconds at BYU we ingest event logs:* -2,431 network -430,000 IDS -59,000 wireless 62,300,000 total a day*
  • 12. This process ought to be formalized Proactive monitoring Logs and Enterprise Monitoring
  • 13. Monitoring as a Discipline SOURCES OF DATA -SNMP -stdout -/proc *stat -Logs SYSTEM PIECES -Collector agents -Aggregator nodes -Analysis platform PRINCIPLES -Aggregation -Cause Analysis (reactive) -Behavior Analysis (proactive)
  • 14. Queries: Simple Yet Elegant ● As Operation Engineers, we already know lots of keywords. These can be easily leveraged for simple queries. ● error ● failure ● root ● port flapping ● memory
  • 15. Simple Yet Elegant ● A simple search for the word “memory” over a 30 day period.
  • 16. Simple search ● Nagios 4 can be set for a maximum number of concurrent processes. Here we see 2 overloaded monitoring servers.
  • 19. Simple Bin and Count
  • 20. Simple Bin and Count
  • 21. Simple Bin and Count ● Campus wide phone OS stats
  • 22. Simple Bin and Count Business School Administrative Building
  • 23. Simple Bin and Count
  • 24. Simple Bin and Count
  • 25. Simple Bin and Count
  • 26. Simple Bin and Count
  • 27. -We are at the tip of a transformative iceberg -Machine Learning and Statisticians needed A Call to Arms!

Editor's Notes

  1. Presentation ideas Who are we What is ELK (everybody there knows this, but this seems obligatory) Background of our approach to the topic, including a short history of what lead to this work SysAdmins use logs for detailed troubleshooting, but usually, formalized log collection and analysis is owned by the InfoSec team. SysAdmins should collect and formally analyze logs as well! (This topic will be large-ish.) RH - Capstone team 2013-2014 HP/TL project this year 1-4: 3 mins OIT partnership give scope of log collection - number & types of logs, teams utilizing the system, etc. Benefits of Simple Search <lots of examples> Simple Bin and Count metrics <some more examples> 5-7: 7 mins Call to use more advanced statistical techniques <one example?> Apply machine learning. Wrap this all in a process 8-9: remaining time Ideas: Lire Lots of Kibana Screenshots Dashboards Wiki/Knowledge Base Statistical Crunching of Data
  2. This slide should tailor to Prof. Havens and his background with ELK.
  3. You can start with something as simple as this. ...okay, maybe not quite THIS simple, but simple. One laptop running the whole system.
  4. This is what you are more likely to deploy as you start implementing ELK.
  5. And this is what a big, highly available enterprise architecture might look like. You can be as simple or minimalist or as professional/robust as you choose.
  6. Let’s address monitoring as a discipline SOURCES You have your common SNMP data, which is the core of most modern monitoring There are also more anecdotal or specific sources such as stdout and /proc Lastly there’s logs, which in fact already include some of the information in these other sources -ELK allows us to look at log data AT SCALE, like we can do for SNMP SYSTEM PIECES You must collect data, put it all in one place, and then use some sort of tool to make sense of it. -Maybe it sends out alerts on certain thresholds. Maybe it has a nice GUI. Maybe it just lets you poke around the data. The nice thing about ELK is PASSIVE COLLECTION -Don’t need to install agents in many cases (though you can), rsyslog will do the trick -No requests required, logs are collected constantly in the background. PRINCIPLES You must gather your data, and then analyze it. -When something goes wrong, we dig into whatever sources of data we have to find the root cause. This is reactive. -ELK encourages and enables proactive analysis, since we have a wealth of data at all times. At any given moment, we can open up ELK and look for existing problems, as well as the EARLY WARNING SIGNS of what might be future problems, helping catch issues before they get out of hand or trigger alerts.
  7. These two monitoring servers are in a data center which has been growing faster than expected. Seeing that they were heavily loaded, we were able to order hardware in to deal with that growth.
  8. This is a simple count of Nagios NOTIFICATION log entries by nagios_status (OK, WARNING, CRITICAL, UNKNOWN). Digging into this spike (not shown for company security reasons), we found that almost 75% were from a London data center, one of our smaller ones. Filtering on that data center (with one click) and slicing by alert service, we found that 72% of the alerts in the time frame were SSL certificate related. It turned out that a name server had failed, and its redundant backup was responding too slowly.