Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests

3,722 views

Published on

The use of web accessibility evaluation tools is a widespread practice. Evaluation tools are heavily employed as they help in reducing the burden of identifying accessibility barriers. However, an overreliance on automated tests often leads to setting aside further testing that entails expert evaluation and user tests. In this paper we empirically show the capabilities of current automated evaluation tools. To do so, we investigate the effectiveness of 6 state-of-the-art tools by analysing their coverage, completeness and correctness with regard to WCAG 2.0 conformance. We corroborate that relying on automated tests alone has negative effects and can have undesirable consequences. Coverage is very narrow as, at most, 50% of the success criteria are covered. Similarly, completeness ranges between 14% and 38%; however, some of the tools that exhibit higher completeness scores produce lower correctness scores (66-71%) due to the fact that catching as many violations as possible can lead to an increase in false positives. Therefore, relying on just automated tests entails that 1 of 2 success criteria will not even be analysed and among those analysed, only 4 out of 10 will be caught at the further risk of generating false positives.

Published in: Technology, Design
  • Be the first to comment

  • Be the first to like this

Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests

  1. 1. Benchmarking Web Accessibility Evaluation Tools:10th International Cross-Disciplinary Conference on Web AccessibilityW4A2013Markel Vigo University of Manchester (UK)Justin Brown Edith Cowan University (Australia)Vivienne Conway Edith Cowan University (Australia)Measuring the Harm of Sole Reliance on Automated Testshttp://dx.doi.org/10.6084/m9.figshare.701216
  2. 2. Problem & FactW4A201313 May 2013 2WWW is not accessible
  3. 3. EvidenceW4A201313 May 2013 3Webmasters are familiar with accessibilityguidelinesLazar et al., 2004Improving web accessibility: a study of webmaster perceptionsComputers in Human Behavior 20(2), 269–288
  4. 4. Hypothesis IAssuming guidelines do a good job...H1: Accessibility guidelines awareness is not thatwidely spread.W4A201313 May 2013 4
  5. 5. Evidence IIW4A201313 May 2013 5Webmasters put compliance logos on non-compliant websitesGilbertson and Machin, 2012Guidelines, icons and marketable skills: an accessibility evaluation of 100 web development companyhomepagesW4A 2012
  6. 6. Hypothesis IIAssuming webmasters are not trying to cheat...H2: A lack of awareness on the negative effectsof overreliance on automated tools.W4A201313 May 2013 6
  7. 7. • Its easy• In some scenarios seems like the onlyoption: web observatories, real-time...• We dont know how harmful they can beW4A201313 May 2013 7Expanding on H2Why we rely on automated tests
  8. 8. • If we are able to measure theselimitations we can raise awareness• Inform developers and researchers• We run a study with 6 tools• Compute coverage, completeness andcorrectness wrt WCAG 2.0W4A201313 May 2013 8Expanding on H2Knowing the limitations of tools
  9. 9. • Coverage: whether a given SuccessCriteria (SC) is reported at least once• Completeness:• Correctness:W4A201313 May 2013 9MethodComputed Metricstrue_ positivesactual _violationsfalse_ positivestrue_ positives+ false_ positives
  10. 10. W4A201313 May 2013 10Vision Australiawww.visionaustralia.org.au• Non-profit• Non-government• Accessibility resourcePrime Ministerwww.pm.gov.au• Federal Government• Should abide by theTransition StrategyTransperthwww.transperth.wa.gov.au• Government affiliated• Used by people withdisabilitiesMethodStimuli
  11. 11. MethodObtaining the "Ground Truth"W4A201313 May 2013 11Ad-hoc samplingManual evaluationAgreementGround truth
  12. 12. W4A201313 May 2013 12Evaluate Compare withthe GTMethodComputing MetricsComputemetricsT1For every page inthe sample...T2T3T4T5T6R1R2R3R4R5R6Get reportsGTM1M2M3M4M5M6
  13. 13. Accessibility of StimuliW4A201313 May 2013 131.1.11.2.11.2.21.2.31.2.41.2.51.3.11.3.21.3.31.4.11.4.21.4.31.4.41.4.52.1.12.1.22.2.12.2.22.3.12.4.12.4.22.4.32.4.42.4.52.4.62.4.72.4.92.4.103.1.13.1.23.2.13.2.23.2.33.2.43.3.13.3.23.3.33.3.44.1.14.1.2violated success criteriafrequency0204060801.1.11.2.11.2.21.2.31.2.41.2.51.3.11.3.21.3.31.4.11.4.21.4.31.4.41.4.52.1.12.1.22.2.12.2.22.3.12.4.12.4.22.4.32.4.42.4.52.4.62.4.72.4.92.4.103.1.13.1.23.2.13.2.23.2.33.2.43.3.13.3.23.3.33.3.44.1.14.1.2violated success criteriafrequency0204060801.1.11.2.11.2.21.2.31.2.41.2.51.3.11.3.21.3.31.4.11.4.21.4.31.4.41.4.52.1.12.1.22.2.12.2.22.3.12.4.12.4.22.4.32.4.42.4.52.4.62.4.72.4.92.4.103.1.13.1.23.2.13.2.23.2.33.2.43.3.13.3.23.3.33.3.44.1.14.1.2violated success criteriafrequency020406080Vision Australiawww.visionaustralia.org.auPrime Ministerwww.pm.gov.auTransperthwww.transperth.wa.gov.au
  14. 14. • 650 WCAG Success Criteria violations(A and AA)• 23-50% of SC are covered byautomated test• Coverage varies across guidelines andtoolsW4A201313 May 2013 14ResultsCoverage
  15. 15. • Completeness ranges in 14-38%• Variable across tools and principlesW4A201313 May 2013 15ResultsCompleteness per tool
  16. 16. • How conformance levels influence oncompleteness• Wilcoxon Signed Rank: W=21, p<0.05• Completeness levels are higher forA level SCW4A201313 May 2013 16ResultsCompleteness per type of SC
  17. 17. • How accessibility levels influence oncompleteness• ANOVA: F(2,10)=19.82, p<0.001• The less accessible a page is thehigher levels of completenessW4A201313 May 2013 17ResultsCompleteness vs. accessibility
  18. 18. • Cronbachs α = 0.96• Multidimensional Scaling (MDS)• Tools behave similarlyW4A201313 May 2013 18ResultsTool Similarity on Completeness
  19. 19. • Tools with lower completeness scoresexhibit higher levels of correctness 93-96%• Tools that obtain higher completenessyield lower correctness 66-71%• Tools with higher completeness arealso the most incorrect onesW4A201313 May 2013 19ResultsCorrectness
  20. 20. • We corroborate that 50% is the upper limitfor automatising guidelines• Natural Language Processing?– Language: 3.1.2 Language of parts– Domain: 3.3.4 Error preventionW4A201313 May 2013 20ImplicationsCoverage
  21. 21. • Automated tests do a better job......on non-accessible sites...on A level success criteria• Automated tests aim at catchingstereotypical errorsW4A201313 May 2013 21ImplicationsCompleteness I
  22. 22. • Strengths of tools can be identified acrossWCAG principles and SC• A method to inform decision making• Maximising completeness in our sampleof pages– On all tools: 55% (+17 percentage points)– On non-commercial tools: 52%W4A201313 May 2013 22ImplicationsCompleteness II
  23. 23. Conclusions• Coverage: 23-50%W4A201313 May 2013 23• Completeness: 14-38%• Higher completeness leads to lowercorrectness
  24. 24. Follow up13 May 2013 24Contact@markelvigo | markel.vigo@manchester.ac.ukPresentation DOIhttp://dx.doi.org/10.6084/m9.figshare.701216Datasetshttp://www.markelvigo.info/ds/bench12/index.html10th International Cross-Disciplinary Conference on Web AccessibilityW4A2013

×