Outbound Harvesting with Encore as a
Library Space-Saving Strategy: The
Case of HathiTrust Docs
Christopher C. Brown
Unive...
DR, IR,
Digital
Texts

Inbound Harvesting
Outbound
Harvesting

This presentation will show how Encore
harvesting can be us...
Depository since 1909
 Historically a 70-75%
selective
 Now a 4.8% selective, but
receive 100% of online
cataloging
 Ad...
Currently 80% of our paper
documents are in storage
 We will be remodelling our
library – totally displaced for
at least ...
Our users are accustomed to using
electronic documents
 Need to divert attention away from
physical collection holdings
...
http://www.openarchives.org/
 Promotes interoperability standards for
dissemination of content
 Hathi Trust allows harve...
Local Site

Classic
OPAC

Remote Site
with Digital
Content

Harvester

with Digital
Content

Encore
(III)

(next-gen
catal...
Hathi Trust Attributes
From: http://www.hathitrust.org/rights_database

PD = where docs generally live
•

•
•
•

Mass identification of copyright status based on bibliographicallyderived information: a) As texts are loaded, a...
Public Domain Distribution
I wanted to see how many government
documents were in our HathiTrust harvest
 Limit to HathiTrust for a given year
 Exam...
Date Range
2000-2009
1990-1999
1980-1989
1970-1979
1960-1969
1950-1959
1940-1949
1930-1939
1920-1929
1910-1919
1900-1909
1...
30000

25000

20000

15000

Total Docs
Hathi Docs

10000

5000

2009
2006
2003
2000
1997
1994
1991
1988
1985
1982
1979
197...
600,000
550,000

500,000
450,000
400,000
350,000
300,000
250,000

Harvested Records

200,000

Harvested Docs

150,000
100,...
Although serial holdings do not
sort properly, users can figure
out what they need.

Inclusion of Serials
Access to Older Serials
And Very Old Serials
Multivolume Works
U. Of Michigan and U. of
California holdings both
show in this record

Duplicate Holdings
Now, the Bad News:
Records are Stripped Down

“Lumber, Lumber, Lumber”
Notice the multiple
duplications of
subject headings

Harvested Record from our
Catalog
Same record, but
subject heading
subfields are present

Original Record in Hathi Trust
008 fixed field data

650 subfields other than “a”

500 notes
5xx shipping list info
300 subfields after “a”

086 SuDocs n...
Represents clickthroughs from the catalog record to individual government
documents over 7+ years.

Use Stats for Regular ...
Statistics from Google Analytics
•Statistics for all Hathi Trust records accessed, not just documents
•Spikes in usage are...









Encore provides an easy way to add external content to
a library catalog experience
HathiTrust records are ...
Upcoming SlideShare
Loading in …5
×

Outbound Harvesting with Encore as a Library Space-Saving Strategy : The Case of HathiTrust Docs

480 views

Published on

Brown, Christopher C. “Outbound Harvesting with Encore as a Library Space-Saving Strategy : The Case of HathiTrust Docs.” Presentation given at the Innovative Users Group at ALA Midwinter, 7 January 2011, San Diego, CA.

Published in: Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
480
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Outbound Harvesting with Encore as a Library Space-Saving Strategy : The Case of HathiTrust Docs

  1. 1. Outbound Harvesting with Encore as a Library Space-Saving Strategy: The Case of HathiTrust Docs Christopher C. Brown University of Denver, Penrose Library (303) 871-3404 cbrown@du.edu Friday, April 15, 2011
  2. 2. DR, IR, Digital Texts Inbound Harvesting Outbound Harvesting This presentation will show how Encore harvesting can be used to mitigate a space problem in a library, substituting online access for the need for physical access to the collection. The government documents collection will be the primary focus.
  3. 3. Depository since 1909  Historically a 70-75% selective  Now a 4.8% selective, but receive 100% of online cataloging  Adding URLs to historic documents  About University of Denver
  4. 4. Currently 80% of our paper documents are in storage  We will be remodelling our library – totally displaced for at least 18 months; 100% of documents will be in storage  Government documents will remain in storage after renovation  The Problem
  5. 5. Our users are accustomed to using electronic documents  Need to divert attention away from physical collection holdings  Encore harvesting of Hathi Trust can do this  OCLC report: 15% of HathiTrust public domain materials are government docs*  Malpas, Constance. 2011. Cloud-sourcing Research Collections: Managing Print in the Mass-digitized Library Environment. Dublin, Ohio: OCLC Research. http://www.oclc.org/research/publications/library/2011/201101.pdf. Partial Solution: Using Encore for Outbound Harvesting
  6. 6. http://www.openarchives.org/  Promotes interoperability standards for dissemination of content  Hathi Trust allows harvesting of its records  Innovative Interface’s Encore catalog allows for records to be harvested (with the purchase of a harvester connection)  OAI-PMH Harvesting
  7. 7. Local Site Classic OPAC Remote Site with Digital Content Harvester with Digital Content Encore (III) (next-gen catalog outside the ILS box) Traditional III Millennium ILS Remote Site with Digital Content • Harvested records appear only in Encore, not in “classic” catalog • Harvested records update on a periodic schedule – in our case daily Encore Model
  8. 8. Hathi Trust Attributes From: http://www.hathitrust.org/rights_database PD = where docs generally live
  9. 9. • • • • Mass identification of copyright status based on bibliographicallyderived information: a) As texts are loaded, a set query in Mirlyn identifies those texts that are:US federal government documents, or published in the US prior to 1923, or published outside of the US before 1870 These are treated as public domain (ATTRIBUTE name=pd) based on bibliographically-derived information (REASON name=bib). We do not restrict access to these materials. b) Those texts that do not meet these criteria (e.g,. US post-1923 and not a government document) are treated as in-copyright (i.e., ATTRIBUTE name=ic and REASON name=bib). c) An additional attribute is used to represent works published outside the United States between 1870 and 1923 because copyright status for these works depends on the location of the user. Works published outside the US prior to 1923 are in the public domain; however, due to the variations in copyright law in countries outside the US, it is estimated that 1870 is the earliest date works published in these countries may still be under copyright. Therefore, users accessing the volume from US IP addresses will have access to the works published outside the US between 1870 through 1923; however, users with non-US IP addresses will not (ATTRIBUTE name=pdus and REASON name=bib). PD vs. PDUS
  10. 10. Public Domain Distribution
  11. 11. I wanted to see how many government documents were in our HathiTrust harvest  Limit to HathiTrust for a given year  Examine first result on each page of 25 results (4% of results) [limitation: Encore only displays first 1,000 results]  Sampling Method
  12. 12. Date Range 2000-2009 1990-1999 1980-1989 1970-1979 1960-1969 1950-1959 1940-1949 1930-1939 1920-1929 1910-1919 1900-1909 1890-1899 1880-1889 1870-1879 1860-1869 Hathi Totals 505,682 709,214 723,657 631,110 546,914 281,615 184,755 175,103 175,226 175,148 179,018 112,295 83,950 58,624 50,907 4,593,218 Hathi All Pub Domain pdus + pd 14,140 29,163 33,753 28,633 21,244 20,861 17,096 16,237 66,563 169,923 153,284 110,605 82,809 57,826 50,337 872,474 Hathi pdus DU pd Harvest 726 13,369 880 28,164 1,204 32,321 2,046 26,189 1,987 18,991 863 19,893 600 16,253 654 15,317 27,108 28,854 75,955 61,230 70,900 47,999 50,502 34,742 38,928 23,855 27,202 17,751 2,273 45,790 301,828 430,718 Docs Sampling 13,340 99.78% 26,662 94.67% 31,370 97.06% 25,607 97.78% 7,668 40.38% 3,888 19.54% 3,771 23.21% 2,600 16.97% 1,529 5.30% 4,124 6.73% 2,265 4.72% 596 1.72% 699 2.93% 319 1.80% 248 0.54% 124,686 28.95% Statistics as of mid-March, 2011 The Docs Sampling columns show the estimated numbers of docs per year and the estimated percentage of docs per year from the Harvest Harvesting Hathi Docs: The Stats
  13. 13. 30000 25000 20000 15000 Total Docs Hathi Docs 10000 5000 2009 2006 2003 2000 1997 1994 1991 1988 1985 1982 1979 1976 1973 1970 1967 1964 1961 1958 1955 1952 1949 1946 1943 1940 1937 1934 1931 1928 1925 1922 1919 1916 1913 1910 1907 1904 1901 1898 1895 0 Sources: 1895-1976 data: Monthly Catalog, 1895-1976 (ProQuest);1976 onward data: CGP Hathi Docs Usage in Proportion to Docs Distribution
  14. 14. 600,000 550,000 500,000 450,000 400,000 350,000 300,000 250,000 Harvested Records 200,000 Harvested Docs 150,000 100,000 50,000 - Tracking of daily harvesting since harvesting began, April 16, 2010 through January 1, 2011 Hathi Harvest in Perspective
  15. 15. Although serial holdings do not sort properly, users can figure out what they need. Inclusion of Serials
  16. 16. Access to Older Serials
  17. 17. And Very Old Serials
  18. 18. Multivolume Works
  19. 19. U. Of Michigan and U. of California holdings both show in this record Duplicate Holdings
  20. 20. Now, the Bad News: Records are Stripped Down “Lumber, Lumber, Lumber”
  21. 21. Notice the multiple duplications of subject headings Harvested Record from our Catalog
  22. 22. Same record, but subject heading subfields are present Original Record in Hathi Trust
  23. 23. 008 fixed field data 650 subfields other than “a” 500 notes 5xx shipping list info 300 subfields after “a” 086 SuDocs number Stripped-Out Fields
  24. 24. Represents clickthroughs from the catalog record to individual government documents over 7+ years. Use Stats for Regular Online Docs
  25. 25. Statistics from Google Analytics •Statistics for all Hathi Trust records accessed, not just documents •Spikes in usage are docs librarian (my) testing, not real users Use Stats for Hathi Trust?
  26. 26.       Encore provides an easy way to add external content to a library catalog experience HathiTrust records are freely available and are easy to harvest The Encore-harvested records are stripped-down and inadequate, providing too few access points and inadequate descriptions The content is superb, contain monographic and serial documents holdings over a span of about 150 years Overall the project is worth having in our Encore catalog, especially since our legacy documents are all in storage and will remain there We are considering adding other external collections using Encore, such as Center for Research Libraries digital holdings. Conclusions

×