The document discusses using information radiators to communicate IT system status at the University of Manitoba. It describes setting up a public dashboard to provide executives and managers with easy visibility into the current state of IT systems without requiring logins. The proof of concept dashboard displays real-time monitoring information on service availability and performance, team metrics, upcoming changes, and other IT-related data through widgets on a mobile-friendly website. It aims to roll up detailed technical monitoring into a more consumable format for non-technical audiences.
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
IT Status Radiators Communicate System Health
1. Using Information Radiators to
Communicate IT Status’
William Moore, Senior Systems Analyst, University of Manitoba
2.
3. Why?
• Wanted an easy way for executives and
managers to see the current state of our IT
systems
• Roll up the detailed monitoring information into
something more consumable
• Offer the information without requiring an
account on the monitoring system, while still
protecting confidential information
8. Service Monitoring (Left Panel)
• Primary focus of the display
• IT Service is a Nagios Host
• Service Component is a Nagios Service
• Service status represents worst component
status
9. Types of checks we perform
• Selenium Test Script
• SSL Certificate Check
• Oracle table check
• File modification age
• Port check
• LDAP Authentication
Check
• Redundancy Rollup
Check (BPI)
• Process running checks
10. Selenium Test Scripts
SELENIUM OK: All tests passed
(Check took 7.96s) |
Visit SignUM=0.59s,
Visit Access Manager=1.53s,
Login=4.27s,
Logout=1.57s
24. Change Requests
Recent Changes
Service Date Status Change ID
JUMP April 32, 2019 Success CR123456789
Upcoming Changes
Service Date Classification Change ID
JUMP June 32, 2019 Normal Medium CR987654321
Hi I am Bill
I have been in IT for over 20 years with the majority of that time spent in Higher Education. In this time I have had many different roles, everything from a web developer to looking after middleware to a systems analyst. Though my roles have changed my core interests from an IT perspective have not changed that much. I believe that computers should help automate many of our activities in IT allowing us to focus on other needs. Some areas is monitoring of IT managed services and testing.
For quite awhile I have been interested in improving the automated monitoring of IT managed services and communicating that status. One of my early attempts is show behind me in an old screenshot. I had built a small widget that would query our monitoring system and display the status of a number of systems that I was responsible for. This saved me having to stay logged in, or open up multiple web pages. I could get all the info I needed at a glance.
However, when I am not thinking about IT my other areas of enjoyment are Photography, Astronomy, and a Scouter.
The image of me in the corner is from earlier this year during a Scout winter camp. When I woke on that day it was -37 without the wind-chill. This picture was taken after 2 hours of x-country skiing with Scouts and Cubs.
The image on the far left was recently taken from my backyard of a dandelion gone to seed. I thought it looked rather cool and tried to capture it with my phone.
I am a very amateur astronomer, what I like most is being able to see things from my backyard and sharing that with my family and friends. Last year I obtained a camera for my telescope and was successful in capturing Jupiter.
I suppose this last point is one other thing I enjoy either in IT or in personal life. I like looking for solutions to problems.
In the fall of 2017 a group of us were wrapping up a meeting and as the day was almost over the conversation turned to system monitoring as we prepared to head home. We acknowledged that we do perform various monitoring checks on all our servers, however this information is not presented in a way that is easily consumable. Often you had to have in depth knowledge to understand how the alert actually impacts the IT managed service.
I should perhaps mention that at this time the monitoring was the typical hardware type of monitoring, is the data volume full, is the CPU or Memory being maxed out. Is process X still running. These are all important checks for an IT managed service, but very hard to share with anyone outside of the Infrastructure and Application support teams on what this means to the Students, or the Administration.
We began to spitball an idea that we should have a web page we could point our Executive to so they could get a birds eye view of the state of the services we in IT manage for the institution. The thought was also to help save our Executive and other stakeholders from having to call around to understand what is going on.
We had a small checklist
No login required
Low level check rolled up into a general status of the service (Are we good?)
As I like the idea of monitoring and making this information more readily available my manager and I decided to make this a higher priority for me to work on. He worked to acquire a computer and monitor while I reached out to our Infrastructure team to see what we could do with our monitoring application to meet our needs.
The tool we use for monitoring our systems is Nagios, with it we can setup a host definition and what services on that host we want to monitor. For each service we could indicate thresholds for warning and critical alerts as well as how frequently we should check that service.
Now most of the checks being performed were more hardware based, and did not equate exactly to what it means for the service users if that threshold was triggered. Fortunately Nagios had a feature which allowed you to apply some Business Intelligence to all the service checks. You could group a number of service checks into a single reporting item. This allowed us to roll up the hardware checks into a more simple “Is the service Ok or is it broken”.
My Manager and I were able to release our proof of concept within a couple of days and it was placed on display in our work area by the printers a day before Black Friday in 2017.
It was a very simple setup, and meet the small checklist we came up with earlier in the week.
The image on the left is what we put together, an old laptop and monitor that used a Tiny URL to access the Nagios reporting screen. We opted to have the monitor in portrait mode as it helped the display stand out more and allowed us to have a longer list of services without requiring a scroll bar.
The proof of concept was well received as it gave our executive something they had been asking for years. Our central IT has three main locations on this campus and it was requested that we should have this display present in the other main locations. The only criticism at the time was we needed a larger display and that we should find some TVs we could use instead of an old desktop monitor.
My manager and I were able to acquire some TVs the next day and swapped out the display on the proof of concept before we closed for the weekend. The other locations would have to wait a bit as we had to acquire another computer and arrange for the TV to be hung.
However, as I looked at the larger display I found the simple Nagios summary page did not look as good on a larger display. Part of this is we had to put the TV in landscape mode rather than portrait, the other part was that the TV offered so much more real estate.
Prior to heading home on Friday I received access to a web server with PHP and began to learn how to use the Nagios API to extract the data so that we could offer a different type of display. By the following Sunday I had what you see behind me on display for Monday morning. Though it was only marginally better than the Nagios display it gave us far more possibilities than just reporting on the status of the IT managed services.
What you see in addition to the service check was filler I put together on the weekend to show what could be possible. Fortunately this enhancement was also well received. A couple of directors after seeing this began to ask questions on how they could get some of their data displayed there.
A lot has changed since the Fall of 2017. We no longer have a laptop powering the display, we swapped them out with two Rasberry Pis. The Pis were a great change as we could place them behind the TV making the area look less messy. With the Pis I am also able to remotely restart them which is very handy when we push out a site update and want the displays to see the latest code. Prior to this I would have to run from building to building after a web site change.
Aside from the Pis we have also attempted to make the web site more modern. We got rid of the old frames that were used prior and are now using JavaScript to dynamically update the HTML as the content changes.
BTW if any of you want to see the web site yourself the URL is there and it is publicly available.
The layout of our site consists of three information panes:
Service Status – on the left
Weather & Time – across the top
IT Centric information
As I was redesigning the interface for the site I wanted the ability to easily view it on my phone. I still look after some services and there are times when I am away from work or on the way to a meeting and I just want to confirm that everything is still good. As such the redesign not only moved away from frames it adopted the use of Bootstrap to layout the HTML elements. This ensured that the new design was mobile-first from the start.
When laying out the new site I made sure the Service check pane would be at the top of the screen when viewed on a small device. Though recently it has begun to take a couple of seconds for the data to be delivered so on first launch from your phone you will see the weather at the top. This is something I will be addressing in the coming weeks.
For the service checks we are redefining the Nagios terms Host and Service. For those that are not familiar with Nagios, in simple terms you setup a Host in Nagios which is typically a server. On that host you setup the service checks to be performed on that server.
However, when we talk about a Service that is offered such as your Student Information System, that is not likely a single server. For the purpose of this display we equate that Service to a Nagios Host, and then the Nagios service checks will be to monitor “all” the parts of that Service known to cause issues for the end users.
I say “ALL” but truthfully you will never know all the critical checks on day 1, as such we are constantly updating our list of checks for a service as we discover previously unseen issues.
The PHP script will connect in real time to Nagios and obtain the list of services we are reporting on and the status of each check for those systems. It will then summarize the status checks into a Green, Yellow, Red face to visually show you if the service is operating as expected. The roll up is a very simple algorithm. If any check is worse than the current status then that becomes the new status.
We monitor a large number of items on our systems, however not all will directly correlate to a service disruption. For the purpose of this dashboard we attempt to determine what causes a failure for the user, or a degradation of the service.
Some of the checks we often use are:
Selenium Test Scripts
SSL Certificate checks
File modification age
Port checks
Process running checks
LDAP authentication checks
Business Process Intelligence checks
Because we want to be alerted (and alert others) when end users may encounter a problem one of the more commonly used checks is a Selenium Test script. Where possible we have a test script that will log into the web application as a normal user and look for some text and then log out. We can handle a variety of login sequences such as a simple login form, or a service which redirects you to a central login service and back.
We often write the scripts so that we test the different steps in the process, such as the example behind me for logging into our Identity Management System. An advantage of breaking the script up like this is if the process takes longer than normal we can quickly see which step in the process was behaving abnormally. We can also share this information on the status screen to better explain to our audience what the issue appears to be.
Nagios keeps these metrics which also allows us to run a report for a particular service to see how well it has been performing over that period of time. We can use that data to adjust our thresholds, or if it is a hosted service have a conversation with the hosting outfit on what we have been noticing.
To provide the best possible experience to our Business Colleagues we have setup some of the components to be redundant. If you recall from the Proof of Concept we used the Nagios feature Business Process Intelligence to help summarize the state of the whole service, something we replaced with our web application as we found the Nagios feature to be hard to configure to meet our needs.
However, this feature is ideal for monitoring a redundant component. We first create a service check for that component (i.e. an LDAP auth check, or a web page check which fetches a graphic). Once this is setup for each member of the redundancy we can group them all together within BPI and setup thresholds for warning and critical.
In the example behind me, the portal tier will issue a warning if there are only 2 nodes online and critical if there is only 1 node online. This service can run fine on 3 nodes, the fourth gives us a live spare.
We have kept the weather information and date/time from the Black Friday version of the dashboard. The background of the weather information screen changes depending on what the current temperature is.
The next set of slides will showcase some of the content options we have for the Right Panel. There is a JSON file that controls which of the screens are enabled (as they are not enabled all the time) and how long they should be displayed for before switching to the next item in the list.
As mentioned we use Nagios to monitor our Services. One of the features of Nagios is the ability to produce a report representing the Availability of a service. That is what this option show cases.
As the reporting engine in Nagios can be rather slow, we snapshot the report data once a month for the previous months, and once a day for the current state. The module then outputs this data for all services we are exposing in the left panel.
Last fall we upgrade the Admin side of Banner to version 9. Prior to the Go-Live Weekend it was asked if we could display the implementation progress on our screens. We put together a module that would use the data stored in JIRA to report on the overall progress of the implementation.
This is done by creating a release in JIRA, and associating the tasks of the implementation to that release. By using the state of the task, time tracking fields and some labels in JIRA we are able to give an overall progress of the implementation, as well as how each specific step in the implementation is progressing.
The screen behind me shows that we are currently on Schedule and we have highlighted the currently active task. It is possible to have multiple tasks in-progress showing different states.
Here the project manager has indicated that the Banner Application Upgrade task is in a critical state (using a label on the task). When this label appears on any task that is ‘In Progress’ the overall status switches to critical and the task that is causing this state is highlighted in red.
Here we have got passed the Banner Application Upgrade task but now the Deploy Extensions task is behind schedule which has caused the overall status to be behind.
As this was our first attempt at using JIRA to track the implementation plan and display it to our community the project manager managed these tasks rather than the actual worker. Though despite this being our first attempt it was appreciated by the executives as they could check on their phones about the overall progress in between e-mail notificaitons.
We can add any number of twitter feeds as part of the rotating content. This allows us to cycle through social media information which may be of interest to those walking by the screen.
In the event the twitter feed is too long for the display, it will start to automatically scroll up and down.
In December before we break for holiday’s we add to the mix of informational items a YouTube video of Christmas Music with a Fireplace crackling away. This is not up 24/7 but usually for 30 minutes or so as part of the overall rotation. We can also put up other YouTube videos.
One of the early pieces of content we offered in this section was to display an image. We could handle any number of images and would keep track of what image was last shown so that during the next iteration we should show the next image in sequence. If the image was larger than the viewport it would be resized automatically to fit.
The original thought was to use this as an easy way for another area within IST to share information and not have to worry about creating HTML. They could create a poster image of their information and we could display this with minimal effort.
We have used this to give a visual description of the smiley face as well as cheer on the Winnipeg Jets when they were doing well in the 2018 Playoffs.
This option is not used as much anymore, but it is still available.
For most of the Nagios checks we use we are able to produce charts showing the performance of that check over a given time period. Eventually there will be a page that displays detailed service metrics on rotation. The trick is to determine which monitoring events would be of value to show long term performance;
We utilize Cherwell to track Incidents, Service Requests, and Changes. The thought with this display is to show Cherwell performance metrics for the various IST Teams.
We are considering adding a module that will display recent IT system changes as well as upcoming ones. The information displayed will not be overly detailed but would allow an IST person to see if anything has changed recently which may be the cause of the problem they are currently troubleshooting.
We have an Event Calendar (Active Data) that we can use to fetch a list of upcoming important dates. Dates such as University closures, or last day of voluntary withdrawl
Something we want to try and put together is along the lines of a subway train map. A number of our systems communicate with each other in order to fulfill our Business Colleagues needs. We want to be able to highlight some of those bigger systems and display if the communication paths are working appropriately. The data on this screen would be real-time as opposed to historical.
In the crude mockup behind me we can see that SignUM is communicating okay with Banner and VIP however it is having intermittent issues with Active Directory and Serious issues with JUMP.