Developing a Service Dashboard: keeping an eye on things

3,983 views

Published on

This session walks the user through the process of developing a custom service dashboard for the Blackboard Learn platform, which was deployed in production December 2011. It displays metrics such as disk space, database tables, processor load, connections, the number of users, etc., automatically updating.
The author shares the measures used and how they were obtained, the APIs used to display the content and manage access, and the use of ajax and Google charts to provide live updates. Some time is spent explaining the design philosophy so that viewers aren’t dazzled by an array of blinking lights.
It concludes showing how we have incorporated other monitoring tools into the dashboard and our plans for the future.

Delivered at BbWorld 2012 in New Orleans

Published in: Education, Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,983
On SlideShare
0
From Embeds
0
Number of Embeds
37
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Developing a Service Dashboard: keeping an eye on thingsThis session walks the user through the process of developing a custom service dashboard for the Blackboard Learn platform, which was deployed in production December 2011. It displays metrics such as disk space, database tables, processor load, connections, the number of users, etc., automatically updating.Malcolm Murray will share the measures used and how they were obtained, the APIs used to display the content and manage access, and the use of ajax and Google charts to provide live updates. Some time is spent explaining the design philosophy so that viewers aren’t dazzled by an array of blinking lights.It will conclude showing how we have incorporated other monitoring tools into the dashboard and our plans for the future.Audience: Sys Admins? Managers? Developers?Eyeball from: http://www.clker.com/cliparts/q/K/E/M/8/C/green-eye-md.png
  • The key issue here is that Blackboard is a complicated service to manage. There’s a lot to it, some parts are essentially black boxes (closed systems) but we need to keep it running smoothly.Out of the box, there aren’t many management tools (though the Admin Console is a good start)Image source: http://myhometheaterbuild.wordpress.com/2011/08/03/feeling-ocd-how-to-clean-your-car-engine/
  • The team managing a service don’t need a report from every switch or transaction – they don’t need to know everything, but do want to quickly check all seems well, or be alerted to any problem.What’s more this needs to be done in such a way that it doesn’t get in the way of their other activities. We can’t assume that all the team are liux/database/java/whatever gurus, so the tool needs to indicate where the concern is in plain english.Dashboard icon: http://www.veryicon.com/icons/system/smoothicons-5/dashboard-17.html
  • So what should the Blackboard dashboard look like? The borrowing of terms from automotive design is a appropriate – the “dashboard” needs to convey the data we need, without distracting us from the road ahead.The second photo is an example of a poorly designed dashboard – your focus is on the steering column, not the windscreen.Image source: http://tutsplus.com/tutorial/creating-a-car-dashboard-using-the-brush-tool/Photo source: http://static.stomp.com.sg/site/servlet/linkableblob/stomp/1148306/data/a98211_d13jpg1338954675095-data.jpg
  • So just why is it so complicated?
  • For most institutions, Blackboard is not a stand-alone service running on a single box.From our own experience, as demand increased, we brought in a load balancer, currently run on four virtual servers, have a dedicated collab server, dedicated database, store data on a filestore (a NetApps appliance), etc. Each server has different partitions, the database spans multiple volumes, all with their own quota. There’s a lot of interdependencies and it is essentially a meta-service composed of lots of discrete units.
  • When we started, each part had (if we were lucky) it’s own monitoring interface.Each had a different URLs, somerequired password, some only accessible on-site, etc.Some of the pages were definitely not designed for quick inspection key information such as server load may take careful scrutiny.Bringing these together introduces a further tension – the need to keep things simple, yet provide access to lots of interfaces in one placeImage icon source: http://www.caradvice.com.au/20229/cadillac-introduces-2010-srx-crossover/
  • At this point there is a real danger of scope-creep – what do we need the dashboard to do?The initial aim was some form of (semi) real time monitoring, which can provide the metrics needed for service measurement if we persist them somewhere.Some hoped it would provide alerts – e.g. triggering emails if a measure exceeded a critical threshold. Personally I think this is in the wrong place – don’t expect a failing server to email you that it is going down It is hard to escape the ever-growing demand for KPIs – could this provide some? Should it? Could KPIs help shape what we measure?The scoping stage needs a lot of discipline or you end up trying to spec the impossible. My advice is to start small, and be prepared to learn, borrow, share and sometimes even start again from scratch (but wiser)!Image: http://img.ehowcdn.com/article-new/ehow/images/a05/ms/99/wire-switch-panel-race-cars-800x800.jpg
  • A dashboard needs to provide some measurements
  • The obvious candidates were used and available disk space, ditto for database tables, some measure of processor load and an indicator of users to help us understand any sudden changes in the figures.We started with the easy things to measure and added more as we could.An important activity if verifying the results – do your numbers add up? If you say a disk is approaching 75% capacity, is there really 25% free?Image source: http://www.moates.net/innovate-auxbox-lma-3.html
  • Our database is split over several logical volumes. Although the tables are set to autogrow, sometimes in the past we had run very close to running out of space.Thus an early task was to get these figures onto the screen.Here the figures for disk data03 are shown in red – indicating they had exceeded a warning threshold. (Panic not, new data are no longer written to data03 now).As these figures only change relatively slowly (and there is a non-trivial overhead in calculating them) we decided they should be updated when the page is refreshed, drawing on data updated hourly.Image source: http://thinkinginrails.com/wp-content/uploads/2010/05/database-integration.jpg
  • If you want more detail about the database – e.g. to try and understand why one volume is filling up – click on the database tab.This lists the volumes, tables and indexes, complete with sparklines showing the measures for the last 48 hours. These trends help to stop panics if a temp folder suddenly starts filling up.Clicking on the sparkline opens a new window…
  • Here the disk usage is shown in two graphs, generated using the Google chart APIs.The left hand graph (red) shows relative usage – how near 100% of the disk space allocated to the system did we get?The right hand (blue) shown disk space in absolute terms – the horizontal grey line along the top shows that this disk has been sitting at 250GB throughout the monitoring period.The highlighted section along the bottom allows you to zoom in on a particular section of the graph – all thanks to Google’s code!
  • The performance of the app servers was another early candidate that made it to the dashboard.We wanted similar measures of disk usage and capacity, but also processor load and the number of connections to the load balancer.We also wanted some JVM stats, but that has to wait until version 2.01As the load and connections are volatile, we needed these data to be regularly refreshed.Image source: http://www.7l.com/images/large-SL2600-Multi-Servers-icon.png
  • For users, we can harvest the existing session data stored in the Blackboard database. Given that not everyone in the sessions table is likely to be an active user – we don’t all log out – we also plotted the number of logins. This tab provides a graph that the user can query, to see whether current usage is high, low or normal.
  • What happened at 4.30 yesterday? Did someone try and download the entire content collection? Did something go viral? Or was there a denial of service attack?On the front page we plot a summary of current users, updated using a Prototype query every minute. A more detailed version of these data is available from the Online Now tab…
  • The logic used to gather these data is based on Santo’s SENECA Who’s Online building block.Knowing who is actually online doing something can help diagnose strange events (load spikes) or provide a list of people to inform if things are about to go wrong!
  • I’ve sort of skipped ahead, showing you the end result without really explaining how.The next section gives you a flavour of the scripts we use to generate and record the data.Note they may not be the best way of doing them, or the most efficient.They are in-house solutions that work, that’s good enough for us just now!
  • Most of the data are gathered using a cron job that triggers a set of shell scripts running on one server –we chose the collab server as it is under-used.They follow a set pattern:Invoke a linux command to generate a set of measures (e.g. df –P or netstat) possibly redirecting the output to a file.Massage the file using commands such as grep and awk to get just the bits we need, in the format we want.Append these to a text file in the form of SQL insert statements.Once all the measures are done, run the SQL to persist the data.Credit here goes to my colleague Stephen Applegarth.
  • This example shows the query used to generate free, used and total disk space figures for the database tables
  • The query on the previous slide generates output like this
  • This project had zero budget and limited time.At the developer’s conference this year, NoriakiTatsumi from Blackboard gave a great presentation showing how to use the free (GPL2 license) tool zabbix for system monitoring, which links to a custom building block, allowing JMX calls, security checks and lots of other goodness.I am sure this is the way to go – we will be investigating this when I get home! His presentation was recorded and should be available with all the other devcon2012 materials when they are released.
  • OK – time to think more about the UI design decisions we made when developing the system dashboard.
  • Hard to argue with Einstein (and win)But what does this mean for our dashboard?Image source: http://www.wallchan.com/wallpaper/19591/
  • One of the design constraints/requirements was that the product needed to fit on this old 30” 75 cm monitor (running 1360 x 768) that we had lying around the office.I am a firm believer that the front page of a dashboard shouldn’t need to scroll.
  • This wasn’t something that was going to be right under my nose all the time.It has to work from a distance
  • We now look at a selection of dashboard designs culled from the internet – there are many more just google ‘system dashboard’http://www.dashboardinsight.com/dashboards/product-demos/altosoft-insight-dashboard-for-system-center.aspx
  • http://dashboardspy.wordpress.com/2010/12/08/excel-dashboard-tutorial/Interesting example created using Excel – lots of VBA so no good for Mac users
  • Two of the more classic analogue control panel designs – do you like these?http://www.designvsart.com/blog/2008/08/14/designing-information-dashboards/#.T_YAi3BzWw8
  • Red amber green lightshttp://dashboardspy.wordpress.com/2006/11/02/business-analysis-monitoring-dashboards-bam-rolling-up-application-kpis-for-a-system-status-dashboard/
  • Which is your favourite?No right answer!
  • This example shows the way this dashboard appears to the 5% of men who suffer from deuteranopia (most common form of colour blindness).Can you tell which services are now in a state of alert?http://dashboardspy.wordpress.com/2006/11/02/business-analysis-monitoring-dashboards-bam-rolling-up-application-kpis-for-a-system-status-dashboard/
  • My thinking has been informed by reading around the subject. I have found these two authors particularly informative and thought provoking.N.B. That is not the same as saying that I agree with everything they say!Stephen Few has over 20 years of experience as an innovator, consultant, and educator in the fields of business intelligence (a.k.a. data warehousing and decision support) and information design. Through his company, Perceptual Edge, he focuses on the effective analysis and presentation quantitative business information. Stephen is recognized as a world leader in the field of data visualization. He teaches regularly at conferences such as those presented by The Data Warehousing Institute (TDWI) and DCI, and also in the MBA program at the Haas School of Business at U. C. Berkeley. He is also the author of the book "Show Me the Numbers: Designing Tables and Graphs to Enlighten" (Analytics Press). Edward Tufte is an American statistician and professor emeritus of political science, statistics, and computer science at Yale University. He is noted for his writings on information design and as a pioneer in the field of data visualization.
  • When designing our dashboard we considered the phrase “Don’t bother me I am busy!”It is designed so that if after a quick glance all is grey, that means we can go back to our day job, all is well.If something is amiss it appears red and in bold font – drawing your attention to the issue.Consider how this would look if we had used red and green…
  • Sparklines are clever data-rich graphics that manage to impose a low cognitive load on the viewer.Tufte’s examples list start and end values in the sequence, plus low and high points.In our case we simply show the last 48 results, plotting the start and end values. We felt that was enough for this application.
  • They are rendered using a delightfully simple bit of javascript – simply enclosing the sequence of numbers in a custom span.This means that the browser builds the graph on the fly – delayed auto buildDoesn’t work in a grumpy browser like IE that won’t support the canvas element – not a problem for our implementation – Safari, Firefox or Chrome would do nicely.
  • Now lets turn our attention to the deployment process
  • 2,996,403 Visits 1st Oct to 31 Dec 2011 (Google Analytics)The system is busy. We had to make sure that our monitoring wouldn’t tip it over the edge during busy periods.Image source: http://misterysnake.homepage24.de/bilder/indieluftgehen.jpg
  • AJAX query updates key data every minute. – number of live sessions, the app server sparklines and number of connections. Database and app server disk space figures are pulled from database when page first loaded. Cached for this user until a refresh is forced. The bottom performance graphic is read from a file – this is generated automatically by another building block
  • The rest of the graphing uses the Google chart API – this makes the browser do the work – all the data is stored on the page, the API selects which portion to graph – no JSP refresh neededInitial graph LHS relative, RHS actual – steps indicate physical growth in available disk spaceThen overlain with interactive graph where we can change the metric displayed
  • Standard building block interface – a degree of future proofingThresholds – match our current risk appetite – how full should a disk get before we colour it red?2. URLs – allows us to change the location of external tools at will3: Access control – Use institutional roles and tabs to manage who sees what – some of the pages that allow dynamic reconfiguration are visible to sys admins but not senior managers!
  • Standard building block interface – a degree of future proofingProvide information here displayed on individual server reports to help understand any differencesServer diagram generated on the fly using google chart APIs – helps to ensure categorisation is correctCan edit these data from here without a restart
  • This page allows you to provide a friendly name for the various tables and disks – so everyone knows who to contact and what to ask about if they see an alertThis information is echoed on graphs of individual disks/servers
  • Use tabs to easily switch views – incl. links to pages shown earlierKey information is on the front page, but useful links are collected in one place on the others.
  • Image source: http://www.tuvie.com/wp-content/uploads/solid-future-car-concept1.jpgThis is only the start…
  • This slide shows our first attempt at replacing some of the old apache load balancer pages with custom reports generated by querying the F5 appliance. Still very much a work in progress. This page updates automatically.
  • So what have we learned?
  • Keep it simple: one screen, grey is good, red is bad newsLightweight: use ajax calls to update screen, display content pulled from other sitesDynamic reconfiguration: important it stays up to date and has all the information you needUse tabs to organise content and control accessLearn from others: plenty of books, look at other dashboards around – which ones do you like/hate?It is doable – have a go yourself!
  • Developing a Service Dashboard: keeping an eye on things

    1. Developing a Service Dashboard: keeping an eye on things 12th July 2012 Dr Malcolm Murray
    2. Bb: a complex system to manage
    3. People need a simple dashboard
    4. What should it look like?
    5. Complexity 5
    6. A complex system to manage F5
    7. Many interfacesToo many interfaces
    8. What should it do? Monitor Measure Alert Email Report Make Coffee?
    9. Measurement 9
    10. What should we measure?Disk SpaceTable SizeLoadUsers
    11. Database MeasuresDisks usedTable SizeTrends
    12. Database Measures
    13. Detail from the Sparkline
    14. Application ServersDisk spaceLoadConnections
    15. Users
    16. Digg Effect or DOS?
    17. Whose Online Now?With thanks to Santo Nucifora at Seneca College
    18. Implementation 18
    19. A lot of shell scriptscd /local/bboard/blackboard/contentdf -P . | grep -v 1024-blocks | awk {print "insert into dur_dashboard_data (when,name, space, capacity, used, available) values(sysdate,?duocontent?,?"$1"?,?"$2"?,?"$3"?,?"$4"?);"}|sed "s^?^^g" >> /local/home/bbuser/sc/intotable.sqlFilesystem 1024-blocks Used Available Capacity Mounted on/dev/sda1 16246428 3231304 12176772 21% /tmpfs 32995944 2401084 30594860 8% /dev/shm/dev/mapper/vg0-s01 51606140 3914360 45070340 8% /s01/dev/mapper/vg0-s02 51606140 14266484 34718216 30% /s02/dev/mapper/vg0-data01 258030980 181684776 63239004 75% /data01/dev/mapper/vg0-data02 258030980 88587208 156336572 37% /data02/dev/mapper/vg0-data03 309637120 284903840 9005280 97% /data03ssh bbuser@duoapp1 w10:23:05 up 3 days, 23:44, 0 users, load average: 0.02, 0.04, 0.00USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
    20. Database Tablespacesspool /local/home/bbuser/sc/tablespacereturn;SELECT Total.name "Tablespace Name", nvl(Free_space, 0) Free_space, nvl(total_space-Free_space, 0) Used_space, total_spaceFROM (select tablespace_name, sum(bytes/1024) Free_Space from sys.dba_free_space dfs group by tablespace_name ) Free, (select b.name, sum(bytes/1024) TOTAL_SPACE from sys.v_$datafile a, sys.v_$tablespace B where a.ts# = b.ts# group by b.name ) TotalWHERE Free.Tablespace_name(+) = Total.nameORDER BY Total.name/spool off;
    21. Database TablespacesTablespace Name FREE_SPACE USED_SPACE TOTAL_SPACE------------------------------ ---------- ---------- -----------BBADMIN_DATA 48384 2816 51200BBADMIN_INDX 18048 2432 20480BB_BB60_DATA 1766336 102734880 104501216BB_BB60_INDX 1128512 29519808 30648320BB_BB60_STATS_DATA 3545344 76160752 79706096BB_BB60_STATS_INDX 6362880 141035728 147398608CMS_DATA 4736 81664 86400CMS_DOC_DATA 42368 467712 510080CMS_DOC_INDX 23040 453184 476224CMS_FILES_COURSES_DATA 139392 2779776 2919168CMS_FILES_COURSES_INDX 93760 1867072 1960832CMS_FILES_INST_DATA 204928 4090752 4295680CMS_FILES_INST_INDX 414720 3681280 4096000CMS_FILES_LIBRARY_DATA 41600 299904 341504CMS_FILES_LIBRARY_INDX 9728 177664 187392CMS_FILES_ORGS_DATA 44160 430592 474752CMS_FILES_ORGS_INDX 13376 249792 263168CMS_FILES_USERS_DATA 28672 557056 585728CMS_FILES_USERS_INDX 25408 503424 528832CMS_INDX 3136 60096 63232SYSAUX 75392 1235328 1310720SYSTEM 7616 832064 839680UNDOTBS1 4676480 341120 5017600USERS 1445056 1880384 3325440
    22. Stored in custom tables
    23. Is there a better way?Almost definitely! e.g. ZabbixMore details available:http://tinyurl.com/bbzabbix
    24. Simplicity 24
    25. Deployment Target
    26. Must work from my desk Hey folks, duoapp2 is in trouble… … and you need to tidy your desk
    27. 28
    28. 29
    29. 30
    30. 31
    31. Favourite?1 23 5 4
    32. 33
    33. Learn from othersStephen Few Edward Tufte
    34. Careful use of colour
    35. Informative Graphics 1991.1.1 65 months 2004.4.28 low high Euro foreign exchange $1.1608 1.1907 .8252 1.2858 Euro foreign exchange ¥121.32 130.17 89.30 140.31 Euro foreign exchange £0.7111 0.6665 .5711 0.7235Edward Tufte’s Sparklines
    36. Simple Code  <span class="sparkline"> 10 14 15 4.5 3.4 16 </span> http://code.google.com/p/js-sparklines/
    37. Deployment 38
    38. Light touch on production 3 million visits Oct – Dec 2011
    39. Update only the key metrics
    40. JavaScript Graphinghttp://code.google.com/apis/chart/
    41. Easy live reconfiguration
    42. Tailor to the Live System
    43. Use Meaningful Labels
    44. Tabs provide easy links
    45. What next? Better integration with the F5 More on NetApps disk usage Java Memory Utilization Number of downloads per user Decide what to make public!
    46. F5 Reporting
    47. Summary 48
    48. SummaryKeep it simpleLight touch on system being monitoredAllow dynamic reconfigurationManage access using tabs & rolesLearn from others Slides available at: http://db.tt/rp2D88Nt
    49. @malcolmmurray malcolm.murray@durham.ac.uk malcolm.murray@gmail.com We value your feedback!Please fill out a session evaluation. 50

    ×