Legal digital projects usually manage texts, not images.
Some types of legal materials are harder to maintain, i.e. codified material.
Legal documents are almost exclusively printed in black & white.
Preservation of the page structure is important for citation purposes.
Maintaining the original appearance of digitized print documents is not important; archival and rare materials are potential exceptions.
Possible Delivery Formats
Pure image formats: TIFF, JPEG
Open encoded formats: XML, HTML, ASCII, and Unicode
Hybrid formats: PDF, DjVu – can contain both image and text
Proprietary formats: Microsoft Word, WordPerfect
Pure Images: TIFF, JPEG
Raster (pixel-based) exclusively used for scanned collections
TIFF is the best choice for archival scanned images
Web browsers display them natively
Both are open formats
Large file sizes make viewing on slow connections problematic
Text of the documents available only through OCR (Optical Character Recognition)
Weak support for multi-page documents
JPEGs have trouble displaying text when they are compressed to levels appropriate for the web
Contain metadata about the physical file itself, not the contents of the file
Imaged Formats Cont.
OCR is an important consideration:
5% rate of error doesn't have an impact on traditional IR measures
20% error rate significantly degrades [Doerman 98] the performance of traditional IR techniques.
High quality OCR is now available for relatively low cost
Abbyy Finereader ($300)
Table and page layout recognition supported
Open Encoded Formats XML, HTML, ASCII, Unicode
Typically easier to integrate into digital libraries [Baird 2004]
Created in 3 ways:
Born digital documents
Manually keyed documents
IR applications easy to build, open source support strong
International standards or W3C recommendations
Accessible with all current web technologies
Metadata easily embedded in XML|HTML documents
Can be created with any text-editor
Improvements in OCR make encoding scanned collections feasible
Open Encoded Formats Cont.
These documents can be expensive for staff to create
Manual Encoding in XML may have to be done by hand
Manual correction of OCR errors
Need technical expertise on staff to get the full benefits of these formats, the PERL programmer
These don't necessarily preserve the "look" of printed documents
Hybrid Formats: PDF, DjVu
PDF and DjVu are proprietary technologies that have substantial support in the open source community.
Both can contain a layer of the document’s text and an image of each page in a document.
Both utilize cross-platform, freely available web browser plug-ins.
Both try to preserve the look of print documents
Easy to export born digital documents to these formats using printer drivers, “print to PDF”
PDF has strong market acceptance in the legal community
PDF-Archive, a standard for using PDF as an archival format in development by AIIM [Association for Information and Image Management]
Adobe makes the PDF reference manual and software development kit freely available to developers.
Standard methodology for embedding metadata in documents, the XMP Standard (Extensible Metadata Platform) that seeks compatibility with semantic web technologies
Plug-in performance is poor for long documents
PDFs composed of scanned images can be very large in size, even for short documents
Designed to be a scan-to-web technology.
Best compression of any image format on the web
Users can load lengthy documents very quickly
The DjVu plug-in can be manipulated via cgi-style arguments
Use the Any2DjVu server to try out the format.
DjVu does not yet have great market acceptance in the legal community.
DjVu does not have a standard method for embedded metadata within documents.
Word Processing Formats: MS Word, WordPerfect
Not a good choice for document delivery on the web
These formats are completely closed
Poor cross platform support
It is often problematic to index these documents using inexpensive or open source IR tools.
The New Jersey Digital Legal Library
Digitize New Jersey Legal materials not currently available online.
Available for users in two formats: DjVu and PDF
Scan -> TIFF; then TIFF -> PDF and TIFF -> DjVu
Extract OCR text from the DjVu to XHTML using XSL Stylesheets and DjVuLibre (The Open Source DjVu Library)
Use swish-e to index the XHTML documents with embedded extended Dublin Core metadata
Baird, Henry. Difficult and Urgent Open Problems in Document Images Analysis for Libraries. Proceedings of the First International Workshop on Document Image Analysis for Libraries. Palo Alto CA, 2004.
Doerman, David. The Indexing and Retrieval of Document Images: A Survey. 70 (3). Computer Vision and Image Understanding. pp. 287-298.