Development & Practice in the CyberCemetery


                                     Starr Hoffman
                 Head, Government Documents Dept.
                   University of North Texas Libraries
                                  25 September 2011
•   Intro               Wha t is the Cy be rCe m e te ry ?
•   Purpose             Why c re a te a Cy be rCe m e te ry ?
•   Development
•   Archiving Process
•   Technical Details
•   User Demographics   Who us e s the Cy be rCe m e te ry ?
•   Conclusion
http:/digital.library.unt.edu/
      /                      explore/
                                    collections/
                                               GDCC/
• online archive of websites from U.S. government agencies
  or commissions that are no longer operating




   http:/digital.library.unt.edu/
         /                      explore/
                                       collections/
                                                  GDCC/
• online archive of websites from U.S. government agencies
  or commissions that are no longer operating
  • “snapshot” of each website as it existed before “pulling the plug”
• maintained by the University of North Texas Libraries
• freely accessible world-wide
• affiliated NAR archive (National Archives and Records
                A
  Administration)




  http:/digital.library.unt.edu/
        /                      explore/
                                      collections/
                                                 GDCC/
1997 - present   2008 - present
• Protect At-Risk Information:
  • 1990’s: U.S. government information = online
  • born-digital
  • edited or removed without warning

• Federal Depository Library Program (FDLP)
    • administered by U.S. Government Printing Office (GPO)
    • mission: to p ro v id e fre e , p e rm a ne nt p ublic a c c e s s to
      g o ve rnm e nt info rm a tio n
    • online information complicates this mission
    • University of North Texas is a federal depository library
1995
 e-docs at
    risk

   Government
 Printing Office
(GPO) publishes
  report stating
need to preserve
    electronic
   government
   publications
1997
GPO + UNT

 University of
 North Texas
(UNT) talks to
 GPO about
  forming a
 partnership
1997
  ACIR
archived
  UNT archives
  website of the
     Advisory
 Commission on
Intergovernment
   al Relations
      (ACIR)
1999
GPO + UNT
= expanded
permanent public
     access,
  expanded to
multiple websites,
& any agency or
 commission no
longer operating
1999
 CyberCemetery



archive is named
 “CyberCemetery”
because websites
 are from “dead”
   agencies &
  commissions
2006
GPO + UNT
 + NARA

 partnership now
includes the U.S.
     National
  Archives and
     Records
  Administration
     (NARA)
2011

  73+
websites
archived
1. Identify at-risk government agencies and commissions
  •     contacted directly by agency/commission
  •     contacted by GPO
  •     read/listen to news
  •     read government-related websites & blogs
  •     targeted search-engine queries
      •    (“final report” + .gov)
  •     referrals from other librarians, patrons
2. Evaluate the website
   • must be an official government website
   • the agency or commission must:
     •   be closing
     •   issued a final report
     •   other indication that the website is at-risk
2.       Evaluate the website (continued)
              Questions for website administrator:

                   Wha t operating system wa s us e d to ho s t this we bs ite ?
                   Wha t webserver software wa s us e d fo r the ho s ting o f this we bs ite ?
                   A s e rve r s id e inc lud e s (s s i) us e d in this we bs ite ?
                      re
                   Wa s this we bs ite static htm o r a dynam site?
                                                        l               ic
                         I d y na m ic , wha t scripting languages we re us e d fo r this we bs ite (p hp , p e rl,
                            f
                          p y tho n)?
                         Wa s a database us e d fo r this we bs ite ?
                         2.      I s o , wha t d a ta ba s e wa s us e d fo r this we bs ite ?
                                  f
                         3.      Wha t m e tho d s we re us e d to c o nne c t to the d a ta ba s e ?
                   I the re stream m
                    s                   ing edia a s s o c ia te d with this we bs ite ?
                   A the re proprietary content types us e d in this we bs ite ?
                      re
                   A the re a ny com ents y o u wo uld like to a d d ?
                      re                   m
3.       Harvest the website
     •       software: Heritrix (from Internet Archive)
           •     http://crawler.archive.org/
           •     downloads content
           •     bundles all content into WARC file
           •     WARC = website in a single file
           •     no manipulation of code or content

3.       Access archived website
     •       software: Wayback (from Internet Archive)
           •     http://archive-access.sourceforge.net/projects/wayback/
           •     retrieves content from WARC
           •     add banner notifying archived status
5. Harvesting alternative: Donated content
  •       directly receive files from agency or commission

      •      Why no t donated content?
             •   Content could be altered
             •   Harvesting = exact copy of online published content


      •      Why donated content?
             •   If content cannot be accessed by harvesting
             •   flash video, large amounts of media
             •   rarely necessary now
6. Link Checking
  •     Manual:
      •    manually navigate original & archived sites
  •     Automated:
      •    Xenu Link Checker
      •    http://home.snafu.de/tilman/xenulink.html
      •    compare reports of original & archived sites
6. Load to UNT Server
  •    Upload archived website
  •    Add navigation
  •    Notify GPO (or agency/commission) that archived version is
       live
• Backup
  • full backups to magnetic tape
  • performed each weekend
  • shipped to offsite storage company
     • Iron Mountain
     • http://www.ironmountain.com
• web files (HTML, XML)
• text documents (.txt, .pdf,
  .doc)
• spreadsheets & statistics
  (.xls)
• presentations (.ppt)
• media files:
  • images & photographs (.jpg,
    .gif, .png, .tiff)
  • audio (.mp3)
  • video (.wm, .mov, .rp)
•   researchers
•   historians
•   students
•   government employees
•   general public




• avg. +1,000,000 hits per month
• peak visits in one day:
   • 9,996 on 11.03.2011
• most popular site: 9 /1 1 Co m m is s io n
•   provides permanent public access
•   archive of “dead” government information
•   freely, globally available
•   73 websites and growing

• partnership between:
    • University of North Texas Libraries
    • U.S. Government Printing Office
    • National Archives and Records Administration
FOR FURTHER
      INFORMATION:
http://www.library.unt.edu/govinfo/
http://digital.library.unt.edu/explore/collections/GDCC/


   Starr Hoffman
  Head, Government Documents Dept.
  University of North Texas Libraries
  govinfo@unt.edu


  starr.hoffman@gmail.com
  http:/geekyartistlibrarian.com
        /

Development of the CyberCemetery (2011)

  • 1.
    Development & Practicein the CyberCemetery Starr Hoffman Head, Government Documents Dept. University of North Texas Libraries 25 September 2011
  • 2.
    Intro Wha t is the Cy be rCe m e te ry ? • Purpose Why c re a te a Cy be rCe m e te ry ? • Development • Archiving Process • Technical Details • User Demographics Who us e s the Cy be rCe m e te ry ? • Conclusion
  • 3.
    http:/digital.library.unt.edu/ / explore/ collections/ GDCC/
  • 4.
    • online archiveof websites from U.S. government agencies or commissions that are no longer operating http:/digital.library.unt.edu/ / explore/ collections/ GDCC/
  • 5.
    • online archiveof websites from U.S. government agencies or commissions that are no longer operating • “snapshot” of each website as it existed before “pulling the plug” • maintained by the University of North Texas Libraries • freely accessible world-wide • affiliated NAR archive (National Archives and Records A Administration) http:/digital.library.unt.edu/ / explore/ collections/ GDCC/
  • 10.
    1997 - present 2008 - present
  • 12.
    • Protect At-RiskInformation: • 1990’s: U.S. government information = online • born-digital • edited or removed without warning • Federal Depository Library Program (FDLP) • administered by U.S. Government Printing Office (GPO) • mission: to p ro v id e fre e , p e rm a ne nt p ublic a c c e s s to g o ve rnm e nt info rm a tio n • online information complicates this mission • University of North Texas is a federal depository library
  • 14.
    1995 e-docs at risk Government Printing Office (GPO) publishes report stating need to preserve electronic government publications
  • 15.
    1997 GPO + UNT University of North Texas (UNT) talks to GPO about forming a partnership
  • 16.
    1997 ACIR archived UNT archives website of the Advisory Commission on Intergovernment al Relations (ACIR)
  • 17.
    1999 GPO + UNT =expanded permanent public access, expanded to multiple websites, & any agency or commission no longer operating
  • 18.
    1999 CyberCemetery archive isnamed “CyberCemetery” because websites are from “dead” agencies & commissions
  • 19.
    2006 GPO + UNT + NARA partnership now includes the U.S. National Archives and Records Administration (NARA)
  • 20.
  • 21.
    1. Identify at-riskgovernment agencies and commissions • contacted directly by agency/commission • contacted by GPO • read/listen to news • read government-related websites & blogs • targeted search-engine queries • (“final report” + .gov) • referrals from other librarians, patrons
  • 22.
    2. Evaluate thewebsite • must be an official government website • the agency or commission must: • be closing • issued a final report • other indication that the website is at-risk
  • 23.
    2. Evaluate the website (continued)  Questions for website administrator:  Wha t operating system wa s us e d to ho s t this we bs ite ?  Wha t webserver software wa s us e d fo r the ho s ting o f this we bs ite ?  A s e rve r s id e inc lud e s (s s i) us e d in this we bs ite ? re  Wa s this we bs ite static htm o r a dynam site? l ic  I d y na m ic , wha t scripting languages we re us e d fo r this we bs ite (p hp , p e rl, f p y tho n)?  Wa s a database us e d fo r this we bs ite ? 2. I s o , wha t d a ta ba s e wa s us e d fo r this we bs ite ? f 3. Wha t m e tho d s we re us e d to c o nne c t to the d a ta ba s e ?  I the re stream m s ing edia a s s o c ia te d with this we bs ite ?  A the re proprietary content types us e d in this we bs ite ? re  A the re a ny com ents y o u wo uld like to a d d ? re m
  • 24.
    3. Harvest the website • software: Heritrix (from Internet Archive) • http://crawler.archive.org/ • downloads content • bundles all content into WARC file • WARC = website in a single file • no manipulation of code or content 3. Access archived website • software: Wayback (from Internet Archive) • http://archive-access.sourceforge.net/projects/wayback/ • retrieves content from WARC • add banner notifying archived status
  • 25.
    5. Harvesting alternative:Donated content • directly receive files from agency or commission • Why no t donated content? • Content could be altered • Harvesting = exact copy of online published content • Why donated content? • If content cannot be accessed by harvesting • flash video, large amounts of media • rarely necessary now
  • 26.
    6. Link Checking • Manual: • manually navigate original & archived sites • Automated: • Xenu Link Checker • http://home.snafu.de/tilman/xenulink.html • compare reports of original & archived sites 6. Load to UNT Server • Upload archived website • Add navigation • Notify GPO (or agency/commission) that archived version is live
  • 28.
    • Backup • full backups to magnetic tape • performed each weekend • shipped to offsite storage company • Iron Mountain • http://www.ironmountain.com
  • 29.
    • web files(HTML, XML) • text documents (.txt, .pdf, .doc) • spreadsheets & statistics (.xls) • presentations (.ppt) • media files: • images & photographs (.jpg, .gif, .png, .tiff) • audio (.mp3) • video (.wm, .mov, .rp)
  • 30.
    researchers • historians • students • government employees • general public • avg. +1,000,000 hits per month • peak visits in one day: • 9,996 on 11.03.2011 • most popular site: 9 /1 1 Co m m is s io n
  • 31.
    provides permanent public access • archive of “dead” government information • freely, globally available • 73 websites and growing • partnership between: • University of North Texas Libraries • U.S. Government Printing Office • National Archives and Records Administration
  • 32.
    FOR FURTHER INFORMATION: http://www.library.unt.edu/govinfo/ http://digital.library.unt.edu/explore/collections/GDCC/ Starr Hoffman Head, Government Documents Dept. University of North Texas Libraries govinfo@unt.edu starr.hoffman@gmail.com http:/geekyartistlibrarian.com /

Editor's Notes

  • #14 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #15 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #16 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #17 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #18 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #19 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #20 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!
  • #21 1995 Government Printing Office (GPO) publishes report stating need to preserve electronic government publications 1997 University of North Texas (UNT) talks to GPO about forming a partnership UNT archives website of the Advisory Commission on Intergovernmental Relations 1999 UNT/GPO partnership is expanded permanent public access multiple government websites government agency or commission which is no longer operating (and/or has issued a final report) the collection is named “CyberCemetery” due to its collection of websites from “dead” government agencies and commissions 2006 UNT/GPO partnership is expanded Now includes the U.S. National Archives and Records Administration (NARA) 2011 73 websites archived, and more on the way!