IT Status Radiators Communicate System Health

•Download as PPTX, PDF•

0 likes•45 views

The document discusses using information radiators to communicate IT system status at the University of Manitoba. It describes setting up a public dashboard to provide executives and managers with easy visibility into the current state of IT systems without requiring logins. The proof of concept dashboard displays real-time monitoring information on service availability and performance, team metrics, upcoming changes, and other IT-related data through widgets on a mobile-friendly website. It aims to roll up detailed technical monitoring into a more consumable format for non-technical audiences.

Technology

Using Information Radiators to
Communicate IT Status’
William Moore, Senior Systems Analyst, University of Manitoba

Why?
• Wanted an easy way for executives and
managers to see the current state of our IT
systems
• Roll up the detailed monitoring information into
something more consumable
• Offer the information without requiring an
account on the monitoring system, while still
protecting confidential information

Current
(https://umanitoba.ca/ist/status)

Service Monitoring (Left Panel)
• Primary focus of the display
• IT Service is a Nagios Host
• Service Component is a Nagios Service
• Service status represents worst component
status

Types of checks we perform
• Selenium Test Script
• SSL Certificate Check
• Oracle table check
• File modification age
• Port check
• LDAP Authentication
Check
• Redundancy Rollup
Check (BPI)
• Process running checks

Selenium Test Scripts
SELENIUM OK: All tests passed
(Check took 7.96s) |
Visit SignUM=0.59s,
Visit Access Manager=1.53s,
Login=4.27s,
Logout=1.57s

Right Panel
(IT Centric Information Panes)

Future Thoughts
(IT Centric Information Panes)

Change Requests
Recent Changes
Service Date Status Change ID
JUMP April 32, 2019 Success CR123456789
Upcoming Changes
Service Date Classification Change ID
JUMP June 32, 2019 Normal Medium CR987654321

Inter System Communication
SignUM
Banner VIP
Active
Directory
JUMP

Questions?
• URL: https://umanitoba.ca/ist/status
• E-Mail: William.Moore@umanitoba.ca
• LinkedIN: /in/william-j-moore/
• Presentations: On Slideshare
https://www.slideshare.net/WilliamMoore22/

IT Status Radiators Communicate System Health

What's hot

Server and application monitoring webinars [Applications Manager] - Part 2ManageEngine, Zoho Corporation

IT RE-MASTEREDAllProbe

Exelysis Contact CenterAlexandros Dalezios

What's new in NetFlow Analyzer 12.2ManageEngine, Zoho Corporation

James Craft_May_2016Craft James

UC Analytics - Lync RGS MonitorCode Software

Reconciliation Testing Aspects of Trading Systems Software FailuresIosif Itkin

QVision ScadaLeePearce18

F5 GOV Round Table - Securing Application AccessTzoori Tamam

Configurable Alerts Framework for PeopleSoftLeandro Baca

Simplifying IT operations manament with OpManagerManageEngine, Zoho Corporation

Network Configuration Management - Mumbai SeminarManageEngine, Zoho Corporation

Flip IT Data Sheet 2015Sašo Djoković

Hi600 u12_inst_slidesljmcneill33

Element Management Subsystemdevalnaik

Kubernetes Journey of a Large FinTechAkshay Mathur

RightFax SCOM Management PackMetastore

Server and application monitoring webinars [Applications Manager] - Part 3ManageEngine, Zoho Corporation

Learn how an app-centric approach will improve security & operational efficiencyAdi Gazit Blecher

Jiro technologyHeena Madan

What's hot (20)

Server and application monitoring webinars [Applications Manager] - Part 2

IT RE-MASTERED

Exelysis Contact Center

What's new in NetFlow Analyzer 12.2

James Craft_May_2016

UC Analytics - Lync RGS Monitor

Reconciliation Testing Aspects of Trading Systems Software Failures

QVision Scada

F5 GOV Round Table - Securing Application Access

Configurable Alerts Framework for PeopleSoft

Simplifying IT operations manament with OpManager

Network Configuration Management - Mumbai Seminar

Flip IT Data Sheet 2015

Hi600 u12_inst_slides

Element Management Subsystem

Kubernetes Journey of a Large FinTech

RightFax SCOM Management Pack

Server and application monitoring webinars [Applications Manager] - Part 3

Learn how an app-centric approach will improve security & operational efficiency

Jiro technology

Similar to IT Status Radiators Communicate System Health

Downtime is Not an Option: Integrating IBM Z into ServiceNow and SplunkPrecisely

Five biggest secrets to an it audit webinar slidesMichelle

What is Platform Observability? An OverviewKumar Kolaganti

How to address operational aspects effectively with Agile practices - Matthew...Skelton Thatcher Consulting Ltd

2016 CLA Summit - Branching Workflows for Team DevelopmentChing-Hwa Yu

Newest Family Member - IT Automation With OpalisAmit Gatenyo

Using Advanced Threat Analytics to Prevent Privilege Escalation AttacksBeyondTrust

Automate Data Scraping and Extraction for WebHelpSystems

How to Control Your Data and Stay Compliant with Robotic Process AutomationHelpSystems

DevSecOps: Taking a DevOps Approach to SecurityAlert Logic

Travis Wright - Complete it service managementNordic Infrastructure Conference

Сервис, ты как? Практики и подходы к мониторингу ИТ-сервисов системами инфрас...ALG Systems (АЛЖ Системс)

ICAB - ITA Chapter 5 class 9-10 - Controls and StandardsMohammad Abdul Matin Emon

Marlabs Capabilities Overview: Infrastructure ServicesMarlabs

Business Process Automation A Productivity LeverKnoldus Inc.

Presentation database security audit vault & database firewallxKinAnx

Structured NERC CIP Process Improvement Using Six SigmaEnergySec

Production Ready Microservices at ScaleRajeev Bharshetty

CISA Training - Chapter 5 - 2016Hafiz Sheikh Adnan Ahmed

Top 5 critical changes to audit for active directoryNetwrix Corporation

Similar to IT Status Radiators Communicate System Health (20)

Downtime is Not an Option: Integrating IBM Z into ServiceNow and Splunk

Five biggest secrets to an it audit webinar slides

What is Platform Observability? An Overview

How to address operational aspects effectively with Agile practices - Matthew...

2016 CLA Summit - Branching Workflows for Team Development

Newest Family Member - IT Automation With Opalis

Using Advanced Threat Analytics to Prevent Privilege Escalation Attacks

Automate Data Scraping and Extraction for Web

How to Control Your Data and Stay Compliant with Robotic Process Automation

DevSecOps: Taking a DevOps Approach to Security

Travis Wright - Complete it service management

Сервис, ты как? Практики и подходы к мониторингу ИТ-сервисов системами инфрас...

ICAB - ITA Chapter 5 class 9-10 - Controls and Standards

Marlabs Capabilities Overview: Infrastructure Services

Business Process Automation A Productivity Lever

Presentation database security audit vault & database firewall

Structured NERC CIP Process Improvement Using Six Sigma

Production Ready Microservices at Scale

CISA Training - Chapter 5 - 2016

Top 5 critical changes to audit for active directory

Recently uploaded

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

How to convert PDF to text with Nanonetsnaman860154

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxnull - The Open Security Community

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Artificial intelligence in the post-deep learning eraDeakin University

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Recently uploaded (20)

Scanning the Internet for External Cloud Exposures via SSL Certs

Maximizing Board Effectiveness 2024 Webinar.pptx

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Benefits Of Flutter Compared To Other Frameworks

08448380779 Call Girls In Friends Colony Women Seeking Men

Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

How to convert PDF to text with Nanonets

E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Advanced Test Driven-Development @ php[tek] 2024

Artificial intelligence in the post-deep learning era

Breaking the Kubernetes Kill Chain: Host Path Mount

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi

DMCC Future of Trade Web3 - Special Edition

Designing IA for AI - Information Architecture Conference 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Injustice - Developers Among Us (SciFiDevCon 2024)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

IT Status Radiators Communicate System Health

1. Using Information Radiators to Communicate IT Status’ William Moore, Senior Systems Analyst, University of Manitoba

3. Why? • Wanted an easy way for executives and managers to see the current state of our IT systems • Roll up the detailed monitoring information into something more consumable • Offer the information without requiring an account on the monitoring system, while still protecting confidential information

4. Proof of Concept

5. Black Friday Edition

6. Current (https://umanitoba.ca/ist/status)

7. Mobile Friendly

8. Service Monitoring (Left Panel) • Primary focus of the display • IT Service is a Nagios Host • Service Component is a Nagios Service • Service status represents worst component status

9. Types of checks we perform • Selenium Test Script • SSL Certificate Check • Oracle table check • File modification age • Port check • LDAP Authentication Check • Redundancy Rollup Check (BPI) • Process running checks

10. Selenium Test Scripts SELENIUM OK: All tests passed (Check took 7.96s) | Visit SignUM=0.59s, Visit Access Manager=1.53s, Login=4.27s, Logout=1.57s

11. Redundancy Rollup (BPI)

12. Weather Info

13. Right Panel (IT Centric Information Panes)

14. Service Availability

15. JIRA Project Status

16. JIRA Project Status

17. JIRA Project Status

18. Twitter Feeds

19. Fireplace (aka Youtube support)

20. Images

21. Future Thoughts (IT Centric Information Panes)

22. Service Performance Stats

23. Team Metrics

24. Change Requests Recent Changes Service Date Status Change ID JUMP April 32, 2019 Success CR123456789 Upcoming Changes Service Date Classification Change ID JUMP June 32, 2019 Normal Medium CR987654321

25. Important Dates

26. Inter System Communication SignUM Banner VIP Active Directory JUMP

27. Questions? • URL: https://umanitoba.ca/ist/status • E-Mail: William.Moore@umanitoba.ca • LinkedIN: /in/william-j-moore/ • Presentations: On Slideshare https://www.slideshare.net/WilliamMoore22/

Editor's Notes

Hi I am Bill I have been in IT for over 20 years with the majority of that time spent in Higher Education. In this time I have had many different roles, everything from a web developer to looking after middleware to a systems analyst. Though my roles have changed my core interests from an IT perspective have not changed that much. I believe that computers should help automate many of our activities in IT allowing us to focus on other needs. Some areas is monitoring of IT managed services and testing. For quite awhile I have been interested in improving the automated monitoring of IT managed services and communicating that status. One of my early attempts is show behind me in an old screenshot. I had built a small widget that would query our monitoring system and display the status of a number of systems that I was responsible for. This saved me having to stay logged in, or open up multiple web pages. I could get all the info I needed at a glance. However, when I am not thinking about IT my other areas of enjoyment are Photography, Astronomy, and a Scouter. The image of me in the corner is from earlier this year during a Scout winter camp. When I woke on that day it was -37 without the wind-chill. This picture was taken after 2 hours of x-country skiing with Scouts and Cubs. The image on the far left was recently taken from my backyard of a dandelion gone to seed. I thought it looked rather cool and tried to capture it with my phone. I am a very amateur astronomer, what I like most is being able to see things from my backyard and sharing that with my family and friends. Last year I obtained a camera for my telescope and was successful in capturing Jupiter. I suppose this last point is one other thing I enjoy either in IT or in personal life. I like looking for solutions to problems.
In the fall of 2017 a group of us were wrapping up a meeting and as the day was almost over the conversation turned to system monitoring as we prepared to head home. We acknowledged that we do perform various monitoring checks on all our servers, however this information is not presented in a way that is easily consumable. Often you had to have in depth knowledge to understand how the alert actually impacts the IT managed service. I should perhaps mention that at this time the monitoring was the typical hardware type of monitoring, is the data volume full, is the CPU or Memory being maxed out. Is process X still running. These are all important checks for an IT managed service, but very hard to share with anyone outside of the Infrastructure and Application support teams on what this means to the Students, or the Administration. We began to spitball an idea that we should have a web page we could point our Executive to so they could get a birds eye view of the state of the services we in IT manage for the institution. The thought was also to help save our Executive and other stakeholders from having to call around to understand what is going on. We had a small checklist No login required Low level check rolled up into a general status of the service (Are we good?)
As I like the idea of monitoring and making this information more readily available my manager and I decided to make this a higher priority for me to work on. He worked to acquire a computer and monitor while I reached out to our Infrastructure team to see what we could do with our monitoring application to meet our needs. The tool we use for monitoring our systems is Nagios, with it we can setup a host definition and what services on that host we want to monitor. For each service we could indicate thresholds for warning and critical alerts as well as how frequently we should check that service. Now most of the checks being performed were more hardware based, and did not equate exactly to what it means for the service users if that threshold was triggered. Fortunately Nagios had a feature which allowed you to apply some Business Intelligence to all the service checks. You could group a number of service checks into a single reporting item. This allowed us to roll up the hardware checks into a more simple “Is the service Ok or is it broken”. My Manager and I were able to release our proof of concept within a couple of days and it was placed on display in our work area by the printers a day before Black Friday in 2017. It was a very simple setup, and meet the small checklist we came up with earlier in the week. The image on the left is what we put together, an old laptop and monitor that used a Tiny URL to access the Nagios reporting screen. We opted to have the monitor in portrait mode as it helped the display stand out more and allowed us to have a longer list of services without requiring a scroll bar.
The proof of concept was well received as it gave our executive something they had been asking for years. Our central IT has three main locations on this campus and it was requested that we should have this display present in the other main locations. The only criticism at the time was we needed a larger display and that we should find some TVs we could use instead of an old desktop monitor. My manager and I were able to acquire some TVs the next day and swapped out the display on the proof of concept before we closed for the weekend. The other locations would have to wait a bit as we had to acquire another computer and arrange for the TV to be hung. However, as I looked at the larger display I found the simple Nagios summary page did not look as good on a larger display. Part of this is we had to put the TV in landscape mode rather than portrait, the other part was that the TV offered so much more real estate. Prior to heading home on Friday I received access to a web server with PHP and began to learn how to use the Nagios API to extract the data so that we could offer a different type of display. By the following Sunday I had what you see behind me on display for Monday morning. Though it was only marginally better than the Nagios display it gave us far more possibilities than just reporting on the status of the IT managed services. What you see in addition to the service check was filler I put together on the weekend to show what could be possible. Fortunately this enhancement was also well received. A couple of directors after seeing this began to ask questions on how they could get some of their data displayed there.
A lot has changed since the Fall of 2017. We no longer have a laptop powering the display, we swapped them out with two Rasberry Pis. The Pis were a great change as we could place them behind the TV making the area look less messy. With the Pis I am also able to remotely restart them which is very handy when we push out a site update and want the displays to see the latest code. Prior to this I would have to run from building to building after a web site change. Aside from the Pis we have also attempted to make the web site more modern. We got rid of the old frames that were used prior and are now using JavaScript to dynamically update the HTML as the content changes. BTW if any of you want to see the web site yourself the URL is there and it is publicly available. The layout of our site consists of three information panes: Service Status – on the left Weather & Time – across the top IT Centric information
As I was redesigning the interface for the site I wanted the ability to easily view it on my phone. I still look after some services and there are times when I am away from work or on the way to a meeting and I just want to confirm that everything is still good. As such the redesign not only moved away from frames it adopted the use of Bootstrap to layout the HTML elements. This ensured that the new design was mobile-first from the start. When laying out the new site I made sure the Service check pane would be at the top of the screen when viewed on a small device. Though recently it has begun to take a couple of seconds for the data to be delivered so on first launch from your phone you will see the weather at the top. This is something I will be addressing in the coming weeks.
For the service checks we are redefining the Nagios terms Host and Service. For those that are not familiar with Nagios, in simple terms you setup a Host in Nagios which is typically a server. On that host you setup the service checks to be performed on that server. However, when we talk about a Service that is offered such as your Student Information System, that is not likely a single server. For the purpose of this display we equate that Service to a Nagios Host, and then the Nagios service checks will be to monitor “all” the parts of that Service known to cause issues for the end users. I say “ALL” but truthfully you will never know all the critical checks on day 1, as such we are constantly updating our list of checks for a service as we discover previously unseen issues. The PHP script will connect in real time to Nagios and obtain the list of services we are reporting on and the status of each check for those systems. It will then summarize the status checks into a Green, Yellow, Red face to visually show you if the service is operating as expected. The roll up is a very simple algorithm. If any check is worse than the current status then that becomes the new status.
We monitor a large number of items on our systems, however not all will directly correlate to a service disruption. For the purpose of this dashboard we attempt to determine what causes a failure for the user, or a degradation of the service. Some of the checks we often use are: Selenium Test Scripts SSL Certificate checks File modification age Port checks Process running checks LDAP authentication checks Business Process Intelligence checks
Because we want to be alerted (and alert others) when end users may encounter a problem one of the more commonly used checks is a Selenium Test script. Where possible we have a test script that will log into the web application as a normal user and look for some text and then log out. We can handle a variety of login sequences such as a simple login form, or a service which redirects you to a central login service and back. We often write the scripts so that we test the different steps in the process, such as the example behind me for logging into our Identity Management System. An advantage of breaking the script up like this is if the process takes longer than normal we can quickly see which step in the process was behaving abnormally. We can also share this information on the status screen to better explain to our audience what the issue appears to be. Nagios keeps these metrics which also allows us to run a report for a particular service to see how well it has been performing over that period of time. We can use that data to adjust our thresholds, or if it is a hosted service have a conversation with the hosting outfit on what we have been noticing.
To provide the best possible experience to our Business Colleagues we have setup some of the components to be redundant. If you recall from the Proof of Concept we used the Nagios feature Business Process Intelligence to help summarize the state of the whole service, something we replaced with our web application as we found the Nagios feature to be hard to configure to meet our needs. However, this feature is ideal for monitoring a redundant component. We first create a service check for that component (i.e. an LDAP auth check, or a web page check which fetches a graphic). Once this is setup for each member of the redundancy we can group them all together within BPI and setup thresholds for warning and critical. In the example behind me, the portal tier will issue a warning if there are only 2 nodes online and critical if there is only 1 node online. This service can run fine on 3 nodes, the fourth gives us a live spare.
We have kept the weather information and date/time from the Black Friday version of the dashboard. The background of the weather information screen changes depending on what the current temperature is.
The next set of slides will showcase some of the content options we have for the Right Panel. There is a JSON file that controls which of the screens are enabled (as they are not enabled all the time) and how long they should be displayed for before switching to the next item in the list.
As mentioned we use Nagios to monitor our Services. One of the features of Nagios is the ability to produce a report representing the Availability of a service. That is what this option show cases. As the reporting engine in Nagios can be rather slow, we snapshot the report data once a month for the previous months, and once a day for the current state. The module then outputs this data for all services we are exposing in the left panel.
Last fall we upgrade the Admin side of Banner to version 9. Prior to the Go-Live Weekend it was asked if we could display the implementation progress on our screens. We put together a module that would use the data stored in JIRA to report on the overall progress of the implementation. This is done by creating a release in JIRA, and associating the tasks of the implementation to that release. By using the state of the task, time tracking fields and some labels in JIRA we are able to give an overall progress of the implementation, as well as how each specific step in the implementation is progressing. The screen behind me shows that we are currently on Schedule and we have highlighted the currently active task. It is possible to have multiple tasks in-progress showing different states.
Here the project manager has indicated that the Banner Application Upgrade task is in a critical state (using a label on the task). When this label appears on any task that is ‘In Progress’ the overall status switches to critical and the task that is causing this state is highlighted in red.
Here we have got passed the Banner Application Upgrade task but now the Deploy Extensions task is behind schedule which has caused the overall status to be behind. As this was our first attempt at using JIRA to track the implementation plan and display it to our community the project manager managed these tasks rather than the actual worker. Though despite this being our first attempt it was appreciated by the executives as they could check on their phones about the overall progress in between e-mail notificaitons.
We can add any number of twitter feeds as part of the rotating content. This allows us to cycle through social media information which may be of interest to those walking by the screen. In the event the twitter feed is too long for the display, it will start to automatically scroll up and down.
In December before we break for holiday’s we add to the mix of informational items a YouTube video of Christmas Music with a Fireplace crackling away. This is not up 24/7 but usually for 30 minutes or so as part of the overall rotation. We can also put up other YouTube videos.
One of the early pieces of content we offered in this section was to display an image. We could handle any number of images and would keep track of what image was last shown so that during the next iteration we should show the next image in sequence. If the image was larger than the viewport it would be resized automatically to fit. The original thought was to use this as an easy way for another area within IST to share information and not have to worry about creating HTML. They could create a poster image of their information and we could display this with minimal effort. We have used this to give a visual description of the smiley face as well as cheer on the Winnipeg Jets when they were doing well in the 2018 Playoffs. This option is not used as much anymore, but it is still available.
For most of the Nagios checks we use we are able to produce charts showing the performance of that check over a given time period. Eventually there will be a page that displays detailed service metrics on rotation. The trick is to determine which monitoring events would be of value to show long term performance;
We utilize Cherwell to track Incidents, Service Requests, and Changes. The thought with this display is to show Cherwell performance metrics for the various IST Teams.
We are considering adding a module that will display recent IT system changes as well as upcoming ones. The information displayed will not be overly detailed but would allow an IST person to see if anything has changed recently which may be the cause of the problem they are currently troubleshooting.
We have an Event Calendar (Active Data) that we can use to fetch a list of upcoming important dates. Dates such as University closures, or last day of voluntary withdrawl
Something we want to try and put together is along the lines of a subway train map. A number of our systems communicate with each other in order to fulfill our Business Colleagues needs. We want to be able to highlight some of those bigger systems and display if the communication paths are working appropriately. The data on this screen would be real-time as opposed to historical. In the crude mockup behind me we can see that SignUM is communicating okay with Banner and VIP however it is having intermittent issues with Active Directory and Serious issues with JUMP.

IT Status Radiators Communicate System Health

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to IT Status Radiators Communicate System Health

Similar to IT Status Radiators Communicate System Health (20)

Recently uploaded

Recently uploaded (20)

IT Status Radiators Communicate System Health

Editor's Notes