QQML presentation


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

QQML presentation

  1. 1. Supporting PDF accessibility evaluation: early results from the FixRep project Andrew Hewson & Emma Tonkin [email_address] [email_address]
  2. 2. Introduction <ul><li>UKOLN: </li></ul><ul><li>Based at the University of Bath in the UK is: </li></ul><ul><li>&quot;A centre of excellence in digital information management, providing advice and services to the library, information and cultural heritage communities.” </li></ul><ul><li>FixRep: </li></ul><ul><li>An 18 month project aiming to examine existing techniques and implementations for automated formal metadata extraction, with the aim of enabling metadata triage </li></ul>
  3. 3. What is formal metadata? <ul><li>Formal metadata; </li></ul><ul><li>Includes information such as filetype, title, author and image captions </li></ul><ul><li>Is mostly intrinsic to the document and its citation. </li></ul><ul><li>Could it include information of relevance to accessibility? </li></ul>
  4. 4. What is accessibility? <ul><li>Web accessibility means that people with disabilities can use the Web. ( http://www.w3.org/WAI/intro/accessibility.php ) </li></ul><ul><li>capable of being reached; </li></ul><ul><li>capable of being read with comprehension; </li></ul><ul><li>easily obtained; </li></ul><ul><li>easy to get along with or talk to; friendly; </li></ul><ul><li>( http://wordnetweb.princeton.edu/perl/webwn ) </li></ul>
  5. 5. PDF format <ul><li>Web-based uses of relevance to digital libraries for example include: </li></ul><ul><li>forms </li></ul><ul><li>printable versions of resources </li></ul><ul><li>pre-prints of papers and articles. </li></ul><ul><li>A very common format found in institutional repositories. </li></ul>
  6. 6. Document accessibility <ul><li>Can we aspire to a perfectly accessible repository? </li></ul><ul><li>Careful editing / repository management takes time and is labour intensive for administrators and users. </li></ul><ul><li>Finding a balance between quantity and quality, i.e. maximising usability of repository content, is the realistic goal. </li></ul><ul><li>Not strict validation, but support for user level review / triage. </li></ul>
  7. 7. Research questions <ul><li>What span of content appears in a document repository that enables user deposit? </li></ul><ul><li>Does this variation in document format imply a reduction in accessibility, what sort of reduction, to whom, and to what extent? </li></ul><ul><li>Is it possible for us to automatically identify issues that may be of particular concern, or for us to identify good practice where it is used? </li></ul><ul><li>Separating non-optimal features from show-stopper problems. </li></ul>
  8. 8. Methodology #1: Prototype <ul><li>A prototype has been developed for analysis of PDFs. This extracts information about the document in a number of ways: </li></ul><ul><li>Header and formatting analysis </li></ul><ul><li>Information from the body of the document </li></ul><ul><li>Information from the originating filesystem </li></ul><ul><li>Based on Unix tools the prototype has been developed in Perl using pdfinfo, pdftotext, and pdfimages, as well as a number of CPAN modules. </li></ul><ul><li>It uses a REST service API </li></ul>
  9. 9. Methodology #2: Pilot Case Study <ul><li>OPUS Repository (University of Bath) </li></ul><ul><li>Spidered site to identify PDFs </li></ul><ul><li>PDFs cached offline </li></ul><ul><li>Analysed via batch process </li></ul><ul><li>Responses placed in MySql database </li></ul><ul><li>Data analysis process completed manually via SQL queries. </li></ul><ul><li>Automation of analysis process goal for future iterations of project. </li></ul>
  10. 10. Results <ul><li>Proportion of documents successfully processed </li></ul><ul><li>80% were successfully batch processed with the results stored in the database </li></ul><ul><li>The 20% that failed exhibited two categories of errors: </li></ul><ul><li>No metadata was available for extraction </li></ul><ul><li>Format of file unsupported by toolset </li></ul>
  11. 11. Results <ul><li>XML Tag use </li></ul><ul><li>Small number of tags used (26) </li></ul><ul><li>Usage was consistent (average 21, mode 21) </li></ul><ul><li>Some ‘traditional’ tags were absent in most cases(author, title, etc.) </li></ul>
  12. 12. Results <ul><li>PDF Versions </li></ul><ul><li>Most popular version seems to be 1.4 – however this might be attributable to the ‘Creator’ software used to generated the PDFs in the sample: in particular due to the addition of a ‘cover sheet’ before being added to the OPUS repository. </li></ul>
  13. 13. Results <ul><li>‘ Producer’ and ‘Creator’ </li></ul><ul><li>These two tags both show disproportionate favouritism for two applications (compared with an expected normal distribution) </li></ul><ul><li>It is likely, as with the favoured PDF version, that is an artefact of the cover sheet addition to the PDFs. </li></ul>hello
  14. 14. Discussion <ul><li>The ‘cover sheet’ issue </li></ul><ul><li>As mentioned, a cover sheet has been prepended to many of the PDFs examined. </li></ul><ul><li>This might not seem to be an issue, however, as can be seen here it might confuse automated systems, rendering the metadata virtually useless </li></ul>
  15. 15. Conclusions <ul><li>Good news! More tagged PDFs around than expected. </li></ul><ul><li>Bad news! We may ‘shooting ourselves in the foot’ with additions like after-the-fact cover sheets. This may remove original metadata that could have been utilised for machine learning. </li></ul><ul><li>This prototype tools has already proved very useful and we plan to develop it further. </li></ul>