Successfully reported this slideshow.

Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests

0

Share

Loading in …3
×
1 of 24
1 of 24

Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests

0

Share

Download to read offline

The use of web accessibility evaluation tools is a widespread practice. Evaluation tools are heavily employed as they help in reducing the burden of identifying accessibility barriers. However, an overreliance on automated tests often leads to setting aside further testing that entails expert evaluation and user tests. In this paper we empirically show the capabilities of current automated evaluation tools. To do so, we investigate the effectiveness of 6 state-of-the-art tools by analysing their coverage, completeness and correctness with regard to WCAG 2.0 conformance. We corroborate that relying on automated tests alone has negative effects and can have undesirable consequences. Coverage is very narrow as, at most, 50% of the success criteria are covered. Similarly, completeness ranges between 14% and 38%; however, some of the tools that exhibit higher completeness scores produce lower correctness scores (66-71%) due to the fact that catching as many violations as possible can lead to an increase in false positives. Therefore, relying on just automated tests entails that 1 of 2 success criteria will not even be analysed and among those analysed, only 4 out of 10 will be caught at the further risk of generating false positives.

The use of web accessibility evaluation tools is a widespread practice. Evaluation tools are heavily employed as they help in reducing the burden of identifying accessibility barriers. However, an overreliance on automated tests often leads to setting aside further testing that entails expert evaluation and user tests. In this paper we empirically show the capabilities of current automated evaluation tools. To do so, we investigate the effectiveness of 6 state-of-the-art tools by analysing their coverage, completeness and correctness with regard to WCAG 2.0 conformance. We corroborate that relying on automated tests alone has negative effects and can have undesirable consequences. Coverage is very narrow as, at most, 50% of the success criteria are covered. Similarly, completeness ranges between 14% and 38%; however, some of the tools that exhibit higher completeness scores produce lower correctness scores (66-71%) due to the fact that catching as many violations as possible can lead to an increase in false positives. Therefore, relying on just automated tests entails that 1 of 2 success criteria will not even be analysed and among those analysed, only 4 out of 10 will be caught at the further risk of generating false positives.

More Related Content

Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests

  1. 1. Benchmarking Web Accessibility Evaluation Tools: 10th International Cross-Disciplinary Conference on Web Accessibility W4A2013 Markel Vigo University of Manchester (UK) Justin Brown Edith Cowan University (Australia) Vivienne Conway Edith Cowan University (Australia) Measuring the Harm of Sole Reliance on Automated Tests http://dx.doi.org/10.6084/m9.figshare.701216
  2. 2. Problem & Fact W4A201313 May 2013 2 WWW is not accessible
  3. 3. Evidence W4A201313 May 2013 3 Webmasters are familiar with accessibility guidelines Lazar et al., 2004 Improving web accessibility: a study of webmaster perceptions Computers in Human Behavior 20(2), 269–288
  4. 4. Hypothesis I Assuming guidelines do a good job... H1: Accessibility guidelines awareness is not that widely spread. W4A201313 May 2013 4
  5. 5. Evidence II W4A201313 May 2013 5 Webmasters put compliance logos on non- compliant websites Gilbertson and Machin, 2012 Guidelines, icons and marketable skills: an accessibility evaluation of 100 web development company homepages W4A 2012
  6. 6. Hypothesis II Assuming webmasters are not trying to cheat... H2: A lack of awareness on the negative effects of overreliance on automated tools. W4A201313 May 2013 6
  7. 7. • It's easy • In some scenarios seems like the only option: web observatories, real-time... • We don't know how harmful they can be W4A201313 May 2013 7 Expanding on H2 Why we rely on automated tests
  8. 8. • If we are able to measure these limitations we can raise awareness • Inform developers and researchers • We run a study with 6 tools • Compute coverage, completeness and correctness wrt WCAG 2.0 W4A201313 May 2013 8 Expanding on H2 Knowing the limitations of tools
  9. 9. • Coverage: whether a given Success Criteria (SC) is reported at least once • Completeness: • Correctness: W4A201313 May 2013 9 Method Computed Metrics true_ positives actual _violations false_ positives true_ positives+ false_ positives
  10. 10. W4A201313 May 2013 10 Vision Australia www.visionaustralia.org.au • Non-profit • Non-government • Accessibility resource Prime Minister www.pm.gov.au • Federal Government • Should abide by the Transition Strategy Transperth www.transperth.wa.gov.au • Government affiliated • Used by people with disabilities Method Stimuli
  11. 11. Method Obtaining the "Ground Truth" W4A201313 May 2013 11 Ad-hoc sampling Manual evaluation Agreement Ground truth
  12. 12. W4A201313 May 2013 12 Evaluate Compare with the GT Method Computing Metrics Compute metrics T1 For every page in the sample... T2 T3 T4 T5 T6 R1 R2 R3 R4 R5 R6 Get reports GT M1 M2 M3 M4 M5 M6
  13. 13. Accessibility of Stimuli W4A201313 May 2013 13 1.1.1 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.3.1 1.3.2 1.3.3 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 2.1.1 2.1.2 2.2.1 2.2.2 2.3.1 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.9 2.4.10 3.1.1 3.1.2 3.2.1 3.2.2 3.2.3 3.2.4 3.3.1 3.3.2 3.3.3 3.3.4 4.1.1 4.1.2 violated success criteria frequency 020406080 1.1.1 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.3.1 1.3.2 1.3.3 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 2.1.1 2.1.2 2.2.1 2.2.2 2.3.1 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.9 2.4.10 3.1.1 3.1.2 3.2.1 3.2.2 3.2.3 3.2.4 3.3.1 3.3.2 3.3.3 3.3.4 4.1.1 4.1.2 violated success criteria frequency 020406080 1.1.1 1.2.1 1.2.2 1.2.3 1.2.4 1.2.5 1.3.1 1.3.2 1.3.3 1.4.1 1.4.2 1.4.3 1.4.4 1.4.5 2.1.1 2.1.2 2.2.1 2.2.2 2.3.1 2.4.1 2.4.2 2.4.3 2.4.4 2.4.5 2.4.6 2.4.7 2.4.9 2.4.10 3.1.1 3.1.2 3.2.1 3.2.2 3.2.3 3.2.4 3.3.1 3.3.2 3.3.3 3.3.4 4.1.1 4.1.2 violated success criteria frequency 020406080 Vision Australia www.visionaustralia.org.au Prime Minister www.pm.gov.au Transperth www.transperth.wa.gov.au
  14. 14. • 650 WCAG Success Criteria violations (A and AA) • 23-50% of SC are covered by automated test • Coverage varies across guidelines and tools W4A201313 May 2013 14 Results Coverage
  15. 15. • Completeness ranges in 14-38% • Variable across tools and principles W4A201313 May 2013 15 Results Completeness per tool
  16. 16. • How conformance levels influence on completeness • Wilcoxon Signed Rank: W=21, p<0.05 • Completeness levels are higher for 'A level' SC W4A201313 May 2013 16 Results Completeness per type of SC
  17. 17. • How accessibility levels influence on completeness • ANOVA: F(2,10)=19.82, p<0.001 • The less accessible a page is the higher levels of completeness W4A201313 May 2013 17 Results Completeness vs. accessibility
  18. 18. • Cronbach's α = 0.96 • Multidimensional Scaling (MDS) • Tools behave similarly W4A201313 May 2013 18 Results Tool Similarity on Completeness
  19. 19. • Tools with lower completeness scores exhibit higher levels of correctness 93- 96% • Tools that obtain higher completeness yield lower correctness 66-71% • Tools with higher completeness are also the most incorrect ones W4A201313 May 2013 19 Results Correctness
  20. 20. • We corroborate that 50% is the upper limit for automatising guidelines • Natural Language Processing? – Language: 3.1.2 Language of parts – Domain: 3.3.4 Error prevention W4A201313 May 2013 20 Implications Coverage
  21. 21. • Automated tests do a better job... ...on non-accessible sites ...on 'A level' success criteria • Automated tests aim at catching stereotypical errors W4A201313 May 2013 21 Implications Completeness I
  22. 22. • Strengths of tools can be identified across WCAG principles and SC • A method to inform decision making • Maximising completeness in our sample of pages – On all tools: 55% (+17 percentage points) – On non-commercial tools: 52% W4A201313 May 2013 22 Implications Completeness II
  23. 23. Conclusions • Coverage: 23-50% W4A201313 May 2013 23 • Completeness: 14-38% • Higher completeness leads to lower correctness
  24. 24. Follow up 13 May 2013 24 Contact @markelvigo | markel.vigo@manchester.ac.uk Presentation DOI http://dx.doi.org/10.6084/m9.figshare.701216 Datasets http://www.markelvigo.info/ds/bench12/index.html 10th International Cross-Disciplinary Conference on Web Accessibility W4A2013

×