Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th , 2004 Law Library Rutgers-Newa...
Delivery Formats & Issues <ul><li>Delivery Format: type of the file a user receives when accessing a document in a digital...
Open v. Closed Formats <ul><li>Who is &quot;in control&quot; of the document format you choose? A standards body? A single...
Usability & Accessibility <ul><li>What software do users need to view a particular format? </li></ul><ul><li>Can a web bro...
Subject Specific Concerns for Legal Materials <ul><li>Legal digital projects usually manage texts, not images.  </li></ul>...
Possible Delivery Formats <ul><li>Pure image formats: TIFF, JPEG  </li></ul><ul><li>Open encoded formats: XML, HTML, ASCII...
Pure Images: TIFF, JPEG <ul><li>Raster  (pixel-based) exclusively used for scanned collections </li></ul><ul><li>TIFF is t...
Imaged Formats Cont. <ul><li>OCR is an important consideration: </li></ul><ul><ul><li>5% rate of error doesn't have an imp...
Open Encoded Formats XML, HTML, ASCII, Unicode <ul><ul><li>Typically easier to integrate into digital libraries [Baird 200...
Open Encoded Formats Cont. <ul><li>Cons: </li></ul><ul><ul><li>These documents can be expensive for staff to create  </li>...
Hybrid Formats: PDF, DjVu <ul><li>PDF and DjVu are proprietary technologies that have substantial support in the open sour...
Adobe PDF <ul><li>Pros: </li></ul><ul><ul><li>PDF has strong market acceptance in the legal community  </li></ul></ul><ul>...
DjVu <ul><li>Designed to be a scan-to-web technology.  </li></ul><ul><li>Pros: </li></ul><ul><ul><li>Best compression of a...
Proprietary Formats <ul><li>Word Processing Formats: MS Word, WordPerfect </li></ul><ul><li>Not a good choice for document...
The New Jersey Digital Legal Library <ul><li>URL:  http://njlegallib.rutgers.edu </li></ul><ul><li>Digitize New Jersey Leg...
References <ul><li>Baird, Henry.  Difficult and Urgent Open Problems in Document Images Analysis for Libraries.  Proceedin...
Upcoming SlideShare
Loading in …5
×

Document Delivery Formats for the World Wide Web

353 views
269 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
353
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
5
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Document Delivery Formats for the World Wide Web

  1. 1. Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th , 2004 Law Library Rutgers-Newark School of Law
  2. 2. Delivery Formats & Issues <ul><li>Delivery Format: type of the file a user receives when accessing a document in a digital collection </li></ul><ul><li>Important not just for viewing, but also for Information Retrieval (IR) tasks like full-text indexing </li></ul><ul><li>There is no one format that is right for every type of collection. </li></ul><ul><li>Important issues to consider: </li></ul><ul><ul><li>Open v. Closed Formats </li></ul></ul><ul><ul><li>Usability and Accessibility </li></ul></ul><ul><ul><li>Subject Specific Concerns for Legal Materials </li></ul></ul>
  3. 3. Open v. Closed Formats <ul><li>Who is &quot;in control&quot; of the document format you choose? A standards body? A single company or organization? </li></ul><ul><li>Can you count on something that one entity controls to be supported over time? </li></ul><ul><li>Advantages of Open Formats (a.k.a. Standards) </li></ul><ul><ul><li>Interoperability and support over time. </li></ul></ul><ul><ul><li>Integrate well with open-source or low cost processing and IR tools </li></ul></ul><ul><ul><li>Help web content providers who need to support an increasing variety of devices and platforms </li></ul></ul>
  4. 4. Usability & Accessibility <ul><li>What software do users need to view a particular format? </li></ul><ul><li>Can a web browser natively display it? </li></ul><ul><li>If the format requires a browser plug-in: </li></ul><ul><ul><li>Is it free? Are users likely to have it installed? </li></ul></ul><ul><ul><li>Does it work on all computing platforms? </li></ul></ul><ul><li>Do public search engines index the format? </li></ul><ul><li>Can dial-up modem users access the material in the collection? </li></ul>
  5. 5. Subject Specific Concerns for Legal Materials <ul><li>Legal digital projects usually manage texts, not images. </li></ul><ul><li>Some types of legal materials are harder to maintain, i.e. codified material. </li></ul><ul><li>Legal documents are almost exclusively printed in black & white. </li></ul><ul><li>Preservation of the page structure is important for citation purposes. </li></ul><ul><li>Maintaining the original appearance of digitized print documents is not important; archival and rare materials are potential exceptions. </li></ul>
  6. 6. Possible Delivery Formats <ul><li>Pure image formats: TIFF, JPEG </li></ul><ul><li>Open encoded formats: XML, HTML, ASCII, and Unicode </li></ul><ul><li>Hybrid formats: PDF, DjVu – can contain both image and text </li></ul><ul><li>Proprietary formats: Microsoft Word, WordPerfect </li></ul>
  7. 7. Pure Images: TIFF, JPEG <ul><li>Raster (pixel-based) exclusively used for scanned collections </li></ul><ul><li>TIFF is the best choice for archival scanned images </li></ul><ul><li>Pros </li></ul><ul><ul><li>Web browsers display them natively </li></ul></ul><ul><ul><li>Both are open formats </li></ul></ul><ul><li>Cons </li></ul><ul><ul><li>Large file sizes make viewing on slow connections problematic </li></ul></ul><ul><ul><li>Text of the documents available only through OCR (Optical Character Recognition) </li></ul></ul><ul><ul><li>Weak support for multi-page documents </li></ul></ul><ul><ul><li>JPEGs have trouble displaying text when they are compressed to levels appropriate for the web </li></ul></ul><ul><ul><li>Contain metadata about the physical file itself, not the contents of the file </li></ul></ul>
  8. 8. Imaged Formats Cont. <ul><li>OCR is an important consideration: </li></ul><ul><ul><li>5% rate of error doesn't have an impact on traditional IR measures </li></ul></ul><ul><ul><li>20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques. </li></ul></ul><ul><ul><li>High quality OCR is now available for relatively low cost </li></ul></ul><ul><ul><ul><li>Abbyy Finereader ($300) </li></ul></ul></ul><ul><ul><ul><li>Table and page layout recognition supported </li></ul></ul></ul>
  9. 9. Open Encoded Formats XML, HTML, ASCII, Unicode <ul><ul><li>Typically easier to integrate into digital libraries [Baird 2004] </li></ul></ul><ul><ul><li>Created in 3 ways: </li></ul></ul><ul><ul><ul><li>Born digital documents </li></ul></ul></ul><ul><ul><ul><li>Manually keyed documents </li></ul></ul></ul><ul><ul><ul><li>Corrected OCR </li></ul></ul></ul><ul><ul><li>IR applications easy to build, open source support strong </li></ul></ul><ul><ul><li>International standards or W3C recommendations </li></ul></ul><ul><ul><li>Accessible with all current web technologies </li></ul></ul><ul><ul><li>Metadata easily embedded in XML|HTML documents </li></ul></ul><ul><ul><li>Can be created with any text-editor </li></ul></ul><ul><ul><li>Improvements in OCR make encoding scanned collections feasible </li></ul></ul>
  10. 10. Open Encoded Formats Cont. <ul><li>Cons: </li></ul><ul><ul><li>These documents can be expensive for staff to create </li></ul></ul><ul><ul><ul><li>Manual Encoding in XML may have to be done by hand </li></ul></ul></ul><ul><ul><ul><li>Manual correction of OCR errors </li></ul></ul></ul><ul><ul><li>Need technical expertise on staff to get the full benefits of these formats, the PERL programmer </li></ul></ul><ul><ul><li>These don't necessarily preserve the &quot;look&quot; of printed documents </li></ul></ul>
  11. 11. Hybrid Formats: PDF, DjVu <ul><li>PDF and DjVu are proprietary technologies that have substantial support in the open source community. </li></ul><ul><li>Both can contain a layer of the document’s text and an image of each page in a document. </li></ul><ul><li>Both utilize cross-platform, freely available web browser plug-ins. </li></ul><ul><li>Both try to preserve the look of print documents </li></ul><ul><li>Easy to export born digital documents to these formats using printer drivers, “print to PDF” </li></ul>
  12. 12. Adobe PDF <ul><li>Pros: </li></ul><ul><ul><li>PDF has strong market acceptance in the legal community </li></ul></ul><ul><ul><li>PDF-Archive, a standard for using PDF as an archival format in development by AIIM [Association for Information and Image Management] </li></ul></ul><ul><ul><li>Adobe makes the PDF reference manual and software development kit freely available to developers. </li></ul></ul><ul><ul><li>Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologies </li></ul></ul><ul><li>Cons: </li></ul><ul><ul><li>Plug-in performance is poor for long documents </li></ul></ul><ul><ul><li>PDFs composed of scanned images can be very large in size, even for short documents </li></ul></ul>
  13. 13. DjVu <ul><li>Designed to be a scan-to-web technology. </li></ul><ul><li>Pros: </li></ul><ul><ul><li>Best compression of any image format on the web </li></ul></ul><ul><ul><li>Users can load lengthy documents very quickly </li></ul></ul><ul><ul><li>The DjVu plug-in can be manipulated via cgi-style arguments </li></ul></ul><ul><ul><li>Use the Any2DjVu server to try out the format. </li></ul></ul><ul><li>Cons: </li></ul><ul><ul><li>DjVu does not yet have great market acceptance in the legal community. </li></ul></ul><ul><ul><li>DjVu does not have a standard method for embedded metadata within documents. </li></ul></ul>
  14. 14. Proprietary Formats <ul><li>Word Processing Formats: MS Word, WordPerfect </li></ul><ul><li>Not a good choice for document delivery on the web </li></ul><ul><li>Cons: </li></ul><ul><ul><li>These formats are completely closed </li></ul></ul><ul><ul><li>Poor cross platform support </li></ul></ul><ul><ul><li>It is often problematic to index these documents using inexpensive or open source IR tools. </li></ul></ul>
  15. 15. The New Jersey Digital Legal Library <ul><li>URL: http://njlegallib.rutgers.edu </li></ul><ul><li>Digitize New Jersey Legal materials not currently available online. </li></ul><ul><li>Available for users in two formats: DjVu and PDF </li></ul><ul><li>Current Workflow: </li></ul><ul><ul><li>Scan -> TIFF; then TIFF -> PDF and TIFF -> DjVu </li></ul></ul><ul><ul><li>Extract OCR text from the DjVu to XHTML using XSL Stylesheets and DjVuLibre (The Open Source DjVu Library) </li></ul></ul><ul><ul><li>Use swish-e to index the XHTML documents with embedded extended Dublin Core metadata </li></ul></ul>
  16. 16. References <ul><li>Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004. </li></ul><ul><li>Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287-298. </li></ul>

×