Many serials titles still exist in print only, and major commercial digitizing efforts often overlook titles that are not widely held. If out of copyright, these titles can be digitized by libraries, giving this old scholarship new life. Many libraries do some sort of digitization of textual materials, but too often serials experts are not involved. The titles may not be presented in ways that pull the serial together while also allowing article level linking. Serials experts can be valuable contributors to these digitization projects. This presentation will provide information on how to digitize text efficiently and how serials are being presented in digital collections. Serials specialists will learn ways that they can contribute to local digitization efforts to help ensure these titles are presented as effectively as possible.
Presenter: Wendy C Robertson
6. As the traditional collectors and
preservers of content, libraries
should ensure their content
remains accessible to a wide
audience.
6 NASIG • St. Louis, MO 6/4/2011
35. several "thousand dollars. [When]. Mother announced to her . her intentions of marrying father
after she came of age .'. . the stepmother skipped out with all the funds, simply vanished, and
mother was left penniless.,, 1 This itiformation was most welcome, for I had been able to say
very little, in my edition of Mattie's letters some years earlier, about her life before she married
into the Whitman family, and could only speculate on what was here confirmed: that she had
been an orphan whose connections with kin had been largely if not entirely severed. 2 The
interview in which this small but helpful revelation comes.is the most interesting part of the
little-noticed Fansler' Collection of Whitman materials at Northwestern University.3 The forty-
eight page hadwritten 'transcript, supplemented by a number of Miss Jessie's letters to the
Adding tags & soft hyphens
35 NASIG • St. Louis, MO 6/4/2011
69. • Copyright Term and the Public Domain in the
United States -
http://copyright.cornell.edu/resources/publicd
omain.cfm
• U.S. Copyright Office -
http://www.copyright.gov/records/
• Stanford's Copyright Renewal Database -
http://collections.stanford.edu/copyrightrene
wals/bin/page?forward=home
• Automated Bibliographic Rights
Determination -
http://www.hathitrust.org/bib_rights_determi
nation
69 NASIG • St. Louis, MO 6/4/2011
70. • HathiTrust Rights Database -
http://www.hathitrust.org/rights_database
• Smith, Kevin. Copyright Risk and Reward in
Mass Digitization. Presented at ARL annual
meeting, May 2011.
http://www.arl.org/bm~doc/mm11sp-smith.pdf
• Ockerbloom, John Mark. The Next Mother
Lode for Large-scale Digitization? Historic
Serials, Copyrights, and Shared Knowledge.
Presneted at Digital Library Federation Spring
Forum. Apr. 2006.
http://works.bepress.com/john_mark_ockerbl
oom/5/
70 NASIG • St. Louis, MO 6/4/2011
71. • First copyright renewals for periodicals
http://onlinebooks.library.upenn.edu/cce/first
period.html
• Information about the Catalog of Copyright
Entries
http://onlinebooks.library.upenn.edu/cce/
• DLF/OCLC Registry of Digital Masters -
http://www.oclc.org/digitalregistry/
• Northeast Document Conservation Center.
Reformatting. Preservation and Selection for
Digitization -
http://www.nedcc.org/resources/leaflets/6Ref
ormatting/06PreservationAndSelection.php
71 NASIG • St. Louis, MO 6/4/2011
72. • PREMIS (Preservation Metadata Maintenance
Activity)
http://www.loc.gov/standards/premis/
• NARA’s Technical Guidelines for Digitizing
Archival Materials for Electronic Access -
http://www.archives.gov/preservation/technic
al/guidelines.pdf
• DLF’s Benchmark for Faithful Digital
Reproductions of Monographs and Serials -
http://old.diglib.org/standards/bmarkfin.htm
• University of Michigan DLPS Digitization
specifications -
http://www.hathitrust.org/documents/UMDigi
tizationSpecs20100827.pdf
72 NASIG • St. Louis, MO 6/4/2011
73. • ABBYY FineReader
http://finereader.abbyy.com/
• Omnipage http://www.nuance.com/for-
business/by-
product/omnipage/index.htm
• OCRopus
http://code.google.com/p/ocropus/
73 NASIG • St. Louis, MO 6/4/2011
74. Thank you!
Questions?
wendy-robertson@uiowa.edu
http://ir.uiowa.edu/lib_pubs/78/
74 NASIG • St. Louis, MO 6/4/2011
Editor's Notes
How many of your institutions are digitizing text? Are entire serial runs digitized? Are you involved in any way?
Do you read on a mobile device? How about on a screen of any sort? What type of content do you read and does the size of the device matter (fiction vs. scholarship)? How about the format of the content (powerpoint, doc, pdf, e-pub)? Do you print out articles to read? Do you print out blog posts?
As the traditional collectors and preservers of content, libraries should continue to be involved with ensuring their content remains widely accessible and does not become relegated to a limited audience in physical library. Serials specialists can contribute their perspective to ensure these digitization efforts represent serials as usably as possible.Five years ago, John Mark Ockerbloom from U Penn presented at Digital Library Federation on historic serials as the next mother lode for large-scale digitization. I think this is still an under-tapped resource.
Deciding what to digitize is one of the first issues.
First, is it legal for you to digitize the content? Is the content now public domain? Does your institution have the copyright or can you get permission from the copyright holder?
I like the Cornell copyright chart, but there are several other sources for a quick review as well.
For post 1978 items, you can search the copyright office database, limiting your results to serials.
For 1923-1963, you can check Stanford’s copyright renewal database.
Another useful item to review is how Michigan is approaching copyright for Hathi Trust. First they are looking at details of the MARC record. They are using the year from their item specific information, if available. (Z30 refers to an Aleph table). Catalogers in the room may appreciate seeing their quality metadata used like this.
After the automated check, U Michigan staff manually checks remaining items, looking for copyright notices. They have been piloting this for several years, and they are expanding the process to involve more libraries to identify orphan works.
Serials have added complexities. Do you have rights for some volumes but not all volumes? Older volumes may be out of copyright while newer are covered. Or do you have rights for some articles but not all articles? This is probably more of an issue for relatively recent content. Some small scholarly journals let the authors keep their copyright so now the content can’t be easily digitized as a whole.
Penn has a listing of first copyright renewals for some periodicals and a page with additional copyright checking resources.
The next thing to consider is if the item has already been digitized. If it’s already done, you won’t want to do it again, unless it is extremely important. Check in WorldCat, Hathi Trust, Internet Archive and of course Google.
WorldCat records may include details of the digitization as part of DLF/OCLC Registry of Digital Masters.
If another institution has digitized a title, see if there are any gaps you can fill in. If you have the missing volumes, collaborate with the other institution so that all volumes are pulled together. You should at least let them know of your volumes once they are posted so they can add a link.
Even after you have restricted yourself to legal and not yet digitized content you will have selection to do. At Iowa, we take the approach that selection was made at the time content was acquired, so now we are really prioritizing the work. We focus on content that is needed by faculty. After that we focus on material that is not widely held or is a specific collection strength. We don’t take on digitization of a serial lightly because to do it well requires a commitment of resources.
We have digitized back content as part of cooperative ventures. These projects were conceived of as the whole run of the title, so volumes we borrowed as needed to complete the run. Special projects like this can attract grants.
A substantial number of our texts have been digitized because of preservation issues. Brittle items from our general stacks and materials from our special collections requested through interlibrary loan have been digitized. While this makes sense for the individual object, it makes it difficult to treat serials properly. Most of these volumes are not even publicly accessible; we will be contributing them to Hathi Trust. If we weren’t part of a larger initiative, I don’t know what we do with them. Some volumes are challenging because we need to send a MARC record with them to Hathi, but we may not have a good record for the individual volume. For example, one volume of a serial was made up of 12 items with distinctive titles which we locally analyzed, which then came with a collected title page which was inserted when bound. Our catalogers made an additional record using this collected title page so that the record will be as useful as possible in Hathi.
Scanning can be done for preservation, but not all scanning meets preservation standards. Preservation is not my area of expertise and I do almost none of our digitization. Following preservation practices means the item won’t need to be scanned again, taking further staff time to handle the item and subjecting fragile materials to the scanner more than necessary. One thing to keep in mind is the difference between the master and the deliverable. The master images will be large files, probably tiffs. They should be rarely accessed. The access copies should be much smaller for fast download.
There are plenty of sources that detail preservation scanning. For the most part nothing about text digitization is serials specific, so I’ll give just a couple of tips. When we scanned back issues of serials published with the last 40 years, we cut the spine off of extra issues. The pages could then go through a sheet feeder. This was efficient but obviously not the approach for most old titles. We scan size was set to the page size so cropping was not needed; this helps makes the files more consistent when the pages are put together into articles or issues. There are batch processes in Photoshop that can be done to rotate, darken, and crop images if necessary. Many libraries have their documentation or their specifications for outsourcing online which can give detailed on suggestions for dealing with missing pages and how to structure and name the files.
Naming or structuring files with serials in mind can be helpful so that the files group easily. We use voluming as part of the name, making each file name unique, whereas Michigan’s outsourcing specs use a file structure, making no attempt to correlate the name with the numbering. Michigan’s method scales better than ours, but ours makes it easier for us to work with the files.
Digitization takes time to do well. We want library projects to be useful. People’s expectations are set by commercial products and we need to do what we can to ensure serials are represented well by our libraries. Because our content is free, the bar for acceptability is lower. Share your understanding of what is normal for scholarly journals and magazine and for historic content to influence local projects.Consider how you expect people to read the content. Will they be printing it out? Assuming people will read the digital version, will it be read all the way through, such as a novel or a newsletter? Or is it made up of individual components that will be accessed, like articles? Is the text only important or is the layout of the page important, such as for seeing older content in context, something with a lot of illustrations, or knowing a specific page number for citations. With the growth of mobile access and ebooks I think this is an area we really need to look at.
Too often content is presented as the bound object, which does a disservice to serials. We all know that the way you bound a serial won’t necessarily match how we bound a serial, and in either case what really matters in the article.
Many serials are presented online as PDFs. I’m going to give some specific suggestions for how to make a better PDF. First of all, I am considering relatively modern titles. Older typography and ads do not work as well.
In order to make the text searchable, you will need to run OCR. Reading on a mobile device raises the bar and poor quality OCR is getting less acceptable, so I think this is one area that we should consider investing in better products rather than simply making do. We generally rely on Adobe Acrobat Pro 9, but this is not meant as an endorsement of the product. You can also use other programs, such as ABBYY Finereader, OmniPage and OCRupus, which is opensource. When we want to correct the OCR we use Finereader, which appears to be widely used for Internet Archive content. You may be able to add the text into a large scale discovery tool or it might be useful for researchers doing text mining.
In the last few months I have been learning about making PDFs accessible. Almost none of our content meets these standards yet. I’ll highlight a few things.
If you OCR in Acrobat, use the clearscan option. This will allow text to reflow well.
Reflowing started as an accessibility issue, but it is also useful on a mobile device to make the font bigger without needing to move right and left because the font will fit the screen.
Add tags to the document. This is needed for accessibility and has the added benefit that when you copy text you don’t have a paragraph mark at the end of each line. You can see here that the paragraph is a whole unit, not separate lines. You can also see the Adobe will remove optional hyphens, which is good. With the hyphen removed, you could search for “intentions” and find the word. The OCR is not perfect, so a search for handwritten fails.Making a single, recent article accessible can be done fairly quickly, but a whole serial will take time. An old title could take a very long time. We will focus on more recent scholarly titles, where people probably want the content but not necessarily the look and feel of the original.
I would love crowd source cleanup. Australia’s historic newspaper project started several years ago shows it can be done. An impressive number of corrections are done each day. Cleanup would be slow going, but texts that readers find important would be done first.
Technology does keep improving, so PDFs that are not reflowable on their own can be made that way with an app. This example uses GoodReader, but you can see the soft-hyphens remain. I found it terribly irritating, but it does show how technology is helping our less than stellar PDFs.
Many serials have been digitized and are freely available. However, the display of the serials and usability of the serials is very different. I hope that in these examples you can see the many ways that digitized serials often fail to have basic functionality. I’m hoping that you can think about these issues with your content and work with your digitization staff. I would be remiss if I did not mention JSTOR. We are all familiar with them and in a sense they set the bar for library efforts. However, I am focusing on open access projects.The biggest collection is the Google Books project. Despite the word books, serials are digitized. Basically if it is textual material in a codex it is included. However, each volume is seen as its own thing. Links between volumes in a series are not made and it is difficult to browse all the content of a title.
The physical item is the smallest unit, so articles are just pages within the book. They may be listed on the about page, but it also may list subsections or diagrams. Note that when you download the PDF you cannot select text.
When full text is not available, the contents may not be listed. The entry makes it clear there is more than one issue and hints at the contents.
Some Google books are available as e-pub/reflowable. Those that come from Project Gutenberg are good – I have read several 19th century novels on my i-Pod.
I have been unimpressed at Google’s OCR; months back I tried to read Charlotte Brontë’s novel Shirley on my i-Pod and gave up.
Since Hathi Trust is a library project, I have higher expectations for its treatment of serials. Its display has noticeably improved over the last couple years, but it is still awkward. It has the classic sorting problem where numbers are not sorted as integers. The voluming is still connected to the physical volume, which means how institutions happen to scan something. Even with all these duplicate volumes, no. 58 is still not scanned. The sort appears to be based on whatever institutions included in a 955 subfield v and there is no specific guidance for how this should be done; the instructions merely request inclusion of volume description (enumeration / chronology).
I can download as PDF, but only if I login and I think you need to be froma partner institution. It was slow to create the PDF and download and when I tried to get a book on my ipod it was not successful.Google books and Hathi are searchable and allow you to FIND materials. Neither really promotes browsing of articles and neither provides decent article access. If you don’t mind reading a PDF on a screen, you can read the content and I know people do this.
As a reader on a mobile device, I prefer internet archive texts. Godey’s lady’s book uses the keyword field so that you can pull together items in a volume or year, which I think is a fairly clever workaround.
IA offers several reading options, including epub and pdf. The OCR was not without problems, but the story could be read. The music and image captions were unusable in e-pub. However, I don’t know who would actually try to read Godery;s on an i-pod. I think many people looking at Godey’s would be interested in the whole issue, with formatting, illustrations etc so the PDF probably meets needs well. I haven’t found a scholarly article yet in IA.
A less well known title displays the year or volume as part of the title. Keywords have not been added, but there is only one issue a year. Note the number of downloads – not high, but they are getting used.
Another approach is a smaller scale collaboration. This gives you more control and makes it more obvious to your funding organization how you are participating. The collections can also be more focused by topic or geography. Illinois Harvest includes several serials. Again, each volume is its own record. The volume number is in the title. It indicates how many volumes there are and you can click more volumes on the individual record. Then they have the sorting problem. You can jump by page number to article, but can’t direct link to article.
Welsh Journals Online is another example. The journals are presented together and with issues in order. Each article is a separate PDF. Note the absence of an ISSN, so even here a serials specialist could give advice.
Many libraries choose to post content on their own site. As long as these are well indexed by Google and if the URLs are in Worldcat, readers should be able to find the content. Make sure your journals are easily found by search engines since it will be unusual for people to begin their search on your site. With the growth of ebooks, I think large collections, like internet Archive, will be used by more people to find materials.
Several platforms exist for local content. CONTENTdm is particularly good for collections of images. This interface can be good for pamphlets and ephemera, things that are very visual in nature.
Pages can be displayed as individual page images and issues can be structured. In this case pages are images not PDFs so this can only be viewed online.
Collections can also display as lists of issues. The articles can be structured. Again the page by page image display means you need to read online, but you can at easily find the start of an article.If a title is its own collection, there can be a homepage for the journal. Otherwise, to pull all the issues together you need to do a search. CDM could also have PDFs.
If your institution uses Digital Commons or Open Journals Systems or other software to host current journals, you can also include back content. We have done this for two titles. In the case of Education Weekly, we made each newsletter issue a PDF. In order to sort correctly, we added leading zeros to the numbering.
For Iowa Geological Survey Annual Report, we posted each issue with articles split out. Despite the title annual report, this title includes geological investigations and these articles still appear in bibliographies.
Open Journal Systems can be used in a similar fashion.
Repository software can also be used, even if not specifically designed for journals. For example, the University of Illinois put back content of Library Trends in their D-Space repository. They collected the title, and then each issue
Each article is in a collection for an issue, and there is a page to browse all issues. An individual article has good metadata for citation and a handle, but no open URL linking is possible.
An institution may choose to post a serial as an unstructured series of web pages. This example successfully gets the content to people but has completely broken the serial up and is impossible to cite by pages number.
There are a few things I think serials people can bring to the discussion.
Making individual PDFs and their metadata takes longer, but the end product is so much more useful that we do it when we can. We do not try to do this for very short pieces in newsletters or non-scholarly content. If you have chosen to split content into articles, then you have increased your metadata requirements tremendously. We put the data in a spreadsheet and fill common values down. If the title is local, you may have a listing of articles in a document or web page that could give you a start on the data entry. Splitting content into articles opens the possibility for OpenURL linking.
Search engines may lead directly to a PDF and people may save the PDF to their computer. The PDF should ideally have enough information in it that people can properly cite the content and can find where they got it. This could mean adding a new cover page, like JSTOR (especially since older articles often didn’t include the citation in a header or footer). This could also be done by adding author, title, rights, and URL to the properties.Another nice feature is to adjust the PDF pagination so that you can type in the actual page number and go to that page of a scan. This isn’t hard to do, but can make a long article much easier to use.
One other thing to remember – you can request an ISSN for an old title. You may want to request one for the electronic version once it is on line, if you have a journal site. You may also want to get the title into link resolver knowledgebase. Hopefully it is standard practice locally to notify cataloging when a title is online so that it can be cataloged.
If the title has changed, the digitization staff may not really know how to deal with this. Make sure the change is clear on the site.
Pay attention to your local efforts to make sure your serials are as accessible and useable as possible. There are lots of opportunities, from metadata to advice about making sure the digitized titles are in the e-journal flow. But even if you are in the department, serials still may not be represented as you would like.Digitization of back volumes seems like a series of compromises. We want it done well so that it doesn’t need to done again, but we have limited resources and we are not charging for the content, so. We try to make the best decisions we can based on how we think people will use the content and what we can afford to do. We have been striving for good enough, but as expectation rise, I’m not sure I always know what good enough is any longer. Remember: make sure your content is where the users are and keep your data standard and open.