Document Delivery Formats for the World Wide Web
Upcoming SlideShare
Loading in...5

Document Delivery Formats for the World Wide Web






Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Document Delivery Formats for the World Wide Web Document Delivery Formats for the World Wide Web Presentation Transcript

  • Document Delivery Formats for the Web and Legal Digital Collections Kevin Reiss June 18 th , 2004 Law Library Rutgers-Newark School of Law
  • Delivery Formats & Issues
    • Delivery Format: type of the file a user receives when accessing a document in a digital collection
    • Important not just for viewing, but also for Information Retrieval (IR) tasks like full-text indexing
    • There is no one format that is right for every type of collection.
    • Important issues to consider:
      • Open v. Closed Formats
      • Usability and Accessibility
      • Subject Specific Concerns for Legal Materials
  • Open v. Closed Formats
    • Who is "in control" of the document format you choose? A standards body? A single company or organization?
    • Can you count on something that one entity controls to be supported over time?
    • Advantages of Open Formats (a.k.a. Standards)
      • Interoperability and support over time.
      • Integrate well with open-source or low cost processing and IR tools
      • Help web content providers who need to support an increasing variety of devices and platforms
    View slide
  • Usability & Accessibility
    • What software do users need to view a particular format?
    • Can a web browser natively display it?
    • If the format requires a browser plug-in:
      • Is it free? Are users likely to have it installed?
      • Does it work on all computing platforms?
    • Do public search engines index the format?
    • Can dial-up modem users access the material in the collection?
    View slide
  • Subject Specific Concerns for Legal Materials
    • Legal digital projects usually manage texts, not images.
    • Some types of legal materials are harder to maintain, i.e. codified material.
    • Legal documents are almost exclusively printed in black & white.
    • Preservation of the page structure is important for citation purposes.
    • Maintaining the original appearance of digitized print documents is not important; archival and rare materials are potential exceptions.
  • Possible Delivery Formats
    • Pure image formats: TIFF, JPEG
    • Open encoded formats: XML, HTML, ASCII, and Unicode
    • Hybrid formats: PDF, DjVu – can contain both image and text
    • Proprietary formats: Microsoft Word, WordPerfect
  • Pure Images: TIFF, JPEG
    • Raster (pixel-based) exclusively used for scanned collections
    • TIFF is the best choice for archival scanned images
    • Pros
      • Web browsers display them natively
      • Both are open formats
    • Cons
      • Large file sizes make viewing on slow connections problematic
      • Text of the documents available only through OCR (Optical Character Recognition)
      • Weak support for multi-page documents
      • JPEGs have trouble displaying text when they are compressed to levels appropriate for the web
      • Contain metadata about the physical file itself, not the contents of the file
  • Imaged Formats Cont.
    • OCR is an important consideration:
      • 5% rate of error doesn't have an impact on traditional IR measures
      • 20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques.
      • High quality OCR is now available for relatively low cost
        • Abbyy Finereader ($300)
        • Table and page layout recognition supported
  • Open Encoded Formats XML, HTML, ASCII, Unicode
      • Typically easier to integrate into digital libraries [Baird 2004]
      • Created in 3 ways:
        • Born digital documents
        • Manually keyed documents
        • Corrected OCR
      • IR applications easy to build, open source support strong
      • International standards or W3C recommendations
      • Accessible with all current web technologies
      • Metadata easily embedded in XML|HTML documents
      • Can be created with any text-editor
      • Improvements in OCR make encoding scanned collections feasible
  • Open Encoded Formats Cont.
    • Cons:
      • These documents can be expensive for staff to create
        • Manual Encoding in XML may have to be done by hand
        • Manual correction of OCR errors
      • Need technical expertise on staff to get the full benefits of these formats, the PERL programmer
      • These don't necessarily preserve the "look" of printed documents
  • Hybrid Formats: PDF, DjVu
    • PDF and DjVu are proprietary technologies that have substantial support in the open source community.
    • Both can contain a layer of the document’s text and an image of each page in a document.
    • Both utilize cross-platform, freely available web browser plug-ins.
    • Both try to preserve the look of print documents
    • Easy to export born digital documents to these formats using printer drivers, “print to PDF”
  • Adobe PDF
    • Pros:
      • PDF has strong market acceptance in the legal community
      • PDF-Archive, a standard for using PDF as an archival format in development by AIIM [Association for Information and Image Management]
      • Adobe makes the PDF reference manual and software development kit freely available to developers.
      • Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologies
    • Cons:
      • Plug-in performance is poor for long documents
      • PDFs composed of scanned images can be very large in size, even for short documents
  • DjVu
    • Designed to be a scan-to-web technology.
    • Pros:
      • Best compression of any image format on the web
      • Users can load lengthy documents very quickly
      • The DjVu plug-in can be manipulated via cgi-style arguments
      • Use the Any2DjVu server to try out the format.
    • Cons:
      • DjVu does not yet have great market acceptance in the legal community.
      • DjVu does not have a standard method for embedded metadata within documents.
  • Proprietary Formats
    • Word Processing Formats: MS Word, WordPerfect
    • Not a good choice for document delivery on the web
    • Cons:
      • These formats are completely closed
      • Poor cross platform support
      • It is often problematic to index these documents using inexpensive or open source IR tools.
  • The New Jersey Digital Legal Library
    • URL:
    • Digitize New Jersey Legal materials not currently available online.
    • Available for users in two formats: DjVu and PDF
    • Current Workflow:
      • Scan -> TIFF; then TIFF -> PDF and TIFF -> DjVu
      • Extract OCR text from the DjVu to XHTML using XSL Stylesheets and DjVuLibre (The Open Source DjVu Library)
      • Use swish-e to index the XHTML documents with embedded extended Dublin Core metadata
  • References
    • Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004.
    • Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287-298.