Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Supporting PDF accessibility evaluation: early results from the FixRep project Andrew Hewson & Emma Tonkin [email_address]...
Introduction <ul><li>UKOLN: </li></ul><ul><li>Based at the University of Bath in the UK is: </li></ul><ul><li>&quot;A cent...
What is formal metadata? <ul><li>Formal metadata; </li></ul><ul><li>Includes information such as filetype, title, author a...
What is accessibility? <ul><li>Web accessibility means that people with disabilities can use the Web. ( http://www.w3.org/...
PDF format <ul><li>Web-based uses of relevance to digital libraries for example include: </li></ul><ul><li>forms </li></ul...
Document accessibility <ul><li>Can we aspire to a perfectly accessible repository? </li></ul><ul><li>Careful editing / rep...
Research questions <ul><li>What span of content appears in a document repository that enables user deposit? </li></ul><ul>...
Methodology #1: Prototype <ul><li>A prototype has been developed for analysis of PDFs. This extracts information about the...
Methodology #2: Pilot Case Study <ul><li>OPUS Repository (University of Bath) </li></ul><ul><li>Spidered site to identify ...
Results <ul><li>Proportion of documents successfully processed </li></ul><ul><li>80% were successfully batch processed wit...
Results <ul><li>XML Tag use </li></ul><ul><li>Small number of tags used (26) </li></ul><ul><li>Usage was consistent (avera...
Results <ul><li>PDF Versions </li></ul><ul><li>Most popular version seems to be 1.4 – however this might be attributable t...
Results <ul><li>‘ Producer’ and ‘Creator’ </li></ul><ul><li>These two tags both show disproportionate favouritism for two ...
Discussion <ul><li>The ‘cover sheet’ issue </li></ul><ul><li>As mentioned, a cover sheet has been prepended to many of the...
Conclusions <ul><li>Good news! More tagged PDFs around than expected. </li></ul><ul><li>Bad news! We may ‘shooting ourselv...
Upcoming SlideShare
Loading in …5
×

0

Share

Download to read offline

Supporting PDF accessibility evaluation: Early results from the FixRep project

Download to read offline

This presentation presents results from a pilot study exploring automated formal metadata extraction in accessibility evaluation. We demonstrate a prototype created during the FixRep project that aims to support capture, storage and reuse of accessibility information where available, and to approach the problem of reconstructing required data from available sources.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

Supporting PDF accessibility evaluation: Early results from the FixRep project

  1. 1. Supporting PDF accessibility evaluation: early results from the FixRep project Andrew Hewson & Emma Tonkin [email_address] e.tonkin @ukoln.ac.uk
  2. 2. Introduction <ul><li>UKOLN: </li></ul><ul><li>Based at the University of Bath in the UK is: </li></ul><ul><li>&quot;A centre of excellence in digital information management, providing advice and services to the library, information and cultural heritage communities.” </li></ul><ul><li>FixRep: </li></ul><ul><li>An 18 month project aiming to examine existing techniques and implementations for automated formal metadata extraction, with the aim of enabling metadata triage </li></ul>
  3. 3. What is formal metadata? <ul><li>Formal metadata; </li></ul><ul><li>Includes information such as filetype, title, author and image captions </li></ul><ul><li>Is mostly intrinsic to the document and its citation. </li></ul><ul><li>Could it include information of relevance to accessibility? </li></ul>
  4. 4. What is accessibility? <ul><li>Web accessibility means that people with disabilities can use the Web. ( http://www.w3.org/WAI/intro/accessibility.php ) </li></ul><ul><li>capable of being reached; </li></ul><ul><li>capable of being read with comprehension; </li></ul><ul><li>easily obtained; </li></ul><ul><li>easy to get along with or talk to; friendly; </li></ul><ul><li>( http://wordnetweb.princeton.edu/perl/webwn ) </li></ul>
  5. 5. PDF format <ul><li>Web-based uses of relevance to digital libraries for example include: </li></ul><ul><li>forms </li></ul><ul><li>printable versions of resources </li></ul><ul><li>pre-prints of papers and articles. </li></ul><ul><li>A very common format found in institutional repositories. </li></ul>
  6. 6. Document accessibility <ul><li>Can we aspire to a perfectly accessible repository? </li></ul><ul><li>Careful editing / repository management takes time and is labour intensive for administrators and users. </li></ul><ul><li>Finding a balance between quantity and quality, i.e. maximising usability of repository content, is the realistic goal. </li></ul><ul><li>Not strict validation, but support for user level review / triage. </li></ul>
  7. 7. Research questions <ul><li>What span of content appears in a document repository that enables user deposit? </li></ul><ul><li>Does this variation in document format imply a reduction in accessibility, what sort of reduction, to whom, and to what extent? </li></ul><ul><li>Is it possible for us to automatically identify issues that may be of particular concern, or for us to identify good practice where it is used? </li></ul><ul><li>Separating non-optimal features from show-stopper problems. </li></ul>
  8. 8. Methodology #1: Prototype <ul><li>A prototype has been developed for analysis of PDFs. This extracts information about the document in a number of ways: </li></ul><ul><li>Header and formatting analysis </li></ul><ul><li>Information from the body of the document </li></ul><ul><li>Information from the originating filesystem </li></ul><ul><li>Based on Unix tools the prototype has been developed in Perl using pdfinfo, pdftotext, and pdfimages, as well as a number of CPAN modules. </li></ul><ul><li>It uses a REST service API </li></ul>
  9. 9. Methodology #2: Pilot Case Study <ul><li>OPUS Repository (University of Bath) </li></ul><ul><li>Spidered site to identify PDFs </li></ul><ul><li>PDFs cached offline </li></ul><ul><li>Analysed via batch process </li></ul><ul><li>Responses placed in MySql database </li></ul><ul><li>Data analysis process completed manually via SQL queries. </li></ul><ul><li>Automation of analysis process goal for future iterations of project. </li></ul>
  10. 10. Results <ul><li>Proportion of documents successfully processed </li></ul><ul><li>80% were successfully batch processed with the results stored in the database </li></ul><ul><li>The 20% that failed exhibited two categories of errors: </li></ul><ul><li>No metadata was available for extraction </li></ul><ul><li>Format of file unsupported by toolset </li></ul>
  11. 11. Results <ul><li>XML Tag use </li></ul><ul><li>Small number of tags used (26) </li></ul><ul><li>Usage was consistent (average 21, mode 21) </li></ul><ul><li>Some ‘traditional’ tags were absent in most cases(author, title, etc.) </li></ul>
  12. 12. Results <ul><li>PDF Versions </li></ul><ul><li>Most popular version seems to be 1.4 – however this might be attributable to the ‘Creator’ software used to generated the PDFs in the sample: in particular due to the addition of a ‘cover sheet’ before being added to the OPUS repository. </li></ul>
  13. 13. Results <ul><li>‘ Producer’ and ‘Creator’ </li></ul><ul><li>These two tags both show disproportionate favouritism for two applications (compared with an expected normal distribution) </li></ul><ul><li>It is likely, as with the favoured PDF version, that is an artefact of the cover sheet addition to the PDFs. </li></ul>hello
  14. 14. Discussion <ul><li>The ‘cover sheet’ issue </li></ul><ul><li>As mentioned, a cover sheet has been prepended to many of the PDFs examined. </li></ul><ul><li>This might not seem to be an issue, however, as can be seen here it might confuse automated systems, rendering the metadata virtually useless </li></ul>
  15. 15. Conclusions <ul><li>Good news! More tagged PDFs around than expected. </li></ul><ul><li>Bad news! We may ‘shooting ourselves in the foot’ with additions like after-the-fact cover sheets. This may remove original metadata that could have been utilised for machine learning. </li></ul><ul><li>This prototype tools has already proved very useful and we plan to develop it further. </li></ul>

This presentation presents results from a pilot study exploring automated formal metadata extraction in accessibility evaluation. We demonstrate a prototype created during the FixRep project that aims to support capture, storage and reuse of accessibility information where available, and to approach the problem of reconstructing required data from available sources.

Views

Total views

3,958

On Slideshare

0

From embeds

0

Number of embeds

3,027

Actions

Downloads

4

Shares

0

Comments

0

Likes

0

×