Gone today, here tomorrow: the future of government information and the digital FDLP
Gone today, here tomorrow: the future of government information and the digital FDLP James R. Jacobs email@example.com lockss-usdocs.stanford.edu UW i-school Thursday January 24, 2013Wednesday, January 23, 2013Iʼd like to thank Cass Hartnett, the Northwest Government Information Network, the UW Information School, the UWAssociation of Library and Information Science Students (ALISS), and the University of Washington Libraries for inviting me totalk with you today. I hope itʼll be worth your while :-)
Aaron Swartz 11.8.86 - 1.11.13 PD haiku notice: do what you feel like / since the work is abandoned / the law doesn’t care http://www.aaronsw.com/weblog/000360 http://www.rememberaaronsw.com/Wednesday, January 23, 2013I dedicate today’s talk to my friend and internet activist Aaron Swartz for his progressiveideals and dedication to free and open information access.PD: do what you feel like / since the work is abandoned / the law doesn’t care
Agenda • Historical ideals of the FDLP • Collection strategies: • Everyday Electronic Materials (EEMs) “Water droplets” • Archive-it “Oceans” • LOCKSS-USDOCS “Waterfalls” • Collaboration “Reservoirs” • Reﬂection • [[slides available at slideshare.net/freegovinfo]]Wednesday, January 23, 2013introduce and agendaWeʼre at the very beginning of the digital era where tools, policies, best practices, etc are all in ﬂux. In many ways, weʼre at theage of new metaphors needed to describe what it is that we as librarians do on a daily basis.Id like to talk about the underlying historical ideals of the FDLP, discuss how those ideals have been under ﬁre from both withinand without the library community and argue that those ideals applied to todays new information metaphors give us the bestchance at access to and long-term preservation and assurance of govt information.Then Iʼll talk about some of the digital collection strategies that Iʼve found to be successful and then conclude with a bit aboutcollaboration and to-dos.
Librarians ... ... Explore ... Collect ... Describe ... Share ... PreserveWednesday, January 23, 2013... But basically, we explore, collect, describe, share and preserve the world of information. In my humble estimation, formatdoes not change what it is that we do as librarians! Today I aim to show that the shift to digital does not preclude us fromexploring, collecting, describing, sharing, and preserving government information.Right up front, Im a librarian and a collaborator in the LOCKSS-USDOCS distributed digital preservation project (Lots ofCopies Keep Stuff Safe). Ive been in academia/education my whole life as a student, teacher, librarian and technologist. Iveworked in libraries since high school and been a government information/FDLP librarian since 2002 and served a 3-year termon the Depository Library Council, the body which informs and advises the Govt Printing Ofﬁce regarding issues of the FederalDepository Library Program. So my mindset/perspective/bias is from one who assists in the scholarly communication process,one who believes that libraries have a place in the digital information landscape, and one who believes strongly in the idea thatpublic access to govt information is a fundamental right.In the print era (which is not over!) we had rules and processes in place to do the things that we do as librarians.In the digital realm (which is just beginning and will continue to overlap with the print era for the foreseeable future) we are justbeginning to ﬁgure out the rules and processes. But the concepts remain the same.Government documents are the DNA of the democratic process – Carl Malamud would call it the “source code.” And so wemust ﬁnd ways to continue to give access and preserve this content for the long-term.
FDLP principles • Forward democratic ideals • Serve public interest / public access / public control / public preservation • Serve the information needs of your community • Forward the long-term institutional viability of libraries • Promote and leverage collective actionWednesday, January 23, 2013Are you:--forwarding democratic ideals?--serving public interest / public access / public control / public preservation?--serving the information needs of your community?--forwarding the long-term institutional life of libraries?--promoting and leveraging collective action?These are the principles that we as govt information librarians (and librarians in general!) hold dear. Best practices =practices in which these principles are embedded – and the principles embedded in the FDLP.If you too believe in these ideals (I hope!), then you already do take actions in support of these values – and probably oneof the main reasons you all have stayed in the field of librarianship is because you believe the following:--libraries are critical as memory organizations--local control of collections is imperative (e.g., a large network of libraries resists accident and natural disasters and areself-healing. A large network of FDLP libraries can help alleviate and ameliorate the damage and rebuild collections whenthose accidents invariably occur. Just ask my friend Rebecca Blakeley who had a wonderful presentation at the 2008 fallDepository Library Conference about the steps that McNeese State University Library took in rebuilding their documentscollection after heavy damage suffered from Hurricane Rita.--distributed system is crucial to meet local needs (spread responsibility for content among various locations andadministrations)--public interest (affirms FDLP libraries’ role in ensuring permanent public access!)--value of library community--shared preservation responsibilitiesWhile I talk about the following projects, please keep these principles and ideals in mind. So let’s get to the case studypart of the discussion.
“There seems to be an inverse relationship between convenience of dissemination and preservation standards.” -- Chuck Humphrey, data librarian, U of AlbertaWednesday, January 23, 2013Over the last 20-30 years, developments in publishing and Internet technologies have affected the way governmentinformation is produced, disseminated, controlled, and preserved. These changes have affected the policies and proceduresof the GPO and, in turn, have affected the depository library program. Despite the often-heard promises that Webtechnologies will bring more information to more people more quickly and easily, the actual effects have been decidedlymixed. The highly visible, short-term successes of rapid dissemination of single titles directly to citizens (e.g., the largenumber of downloads of the 9/11 report) mask the loss of a secure infrastructure (GPOs Federal Digital System (FDsys)notwithstanding) for long-term preservation of and access to government information as more and more agencies publishcontent on their own Web sites rather than using the GPO conduit (which we in the govt info world call "fugitive documents")and very few agencies publish to any standards or have policies in place that deal with archiving and preservation. As ChuckHumphrey, a data librarian friend of mine, once said, “there seems to be an inverse relationship between convenience ofdissemination and preservation standards.”In addition to this lack of a secure infrastructure, the growing din of the call for digitization of historic govt publications – Irefuse to use the term “legacy”! – from some of the large library associations like ARL, ASERL and CIC, while no doubt a boonfor access today – though with their own unique issues in terms of metadata, provenance, findability, usability etc – issomewhat of a red herring that makes library administrators believe that they will soon be able to dispose of their physicalcollections – not to mention their documents staffs! – and use that space for this week’s buzz word. This call for digitizationmay instead have the deleterious affect of damaging the long-term preservation of govt publications.Lastly, the growing trend toward privatization of govt information has actually caused a decrease in public access despite itsdigital nature. This is not a new trend. Herbert Schiller noted this in 1986 in his book "Information and the Crisis Economy."Speaking of machine-readable formats, he wrote that, "Library information capability is greatly enhanced. Yet this benefit isaccompanied by the abandonment of libraries historical free access policy. User charges are introduced. The public characterof the library is weakening as its commercial connection deepens. No less important, the composition and character of itsholdings change as the clientele shifts from general public to the ability-to-pay user."
GAO/Thomson contract Carl Malamud. Public.resource.org. 1/23/13 http://sn.im/gao-contractWednesday, January 23, 2013Weve seen over the last several years a disturbing rise in Federal Agencies entering into contracts with privatecompanies whereby public domain govt documents are digitized and then taken out of the commons via licensingagreements. See for example, the Government Accountability Office (GAO)s deal with Thomson-West wherebyThomson-West digitized the GAOs 20,597 legislative histories of most public laws from 1915-1995 and in return receivedexclusive license to sell access to the content. GAO received nothing in return but an account on Thomsons service whilethe public received nothing at all.Last year, NARA entered into a contract with Ancestry.com to serve out the 1940 census schedules (aka enumerators’notebooks) that were released in 2012 after 72 years. Ancestry agreed to be NARA’s digital infrastructure, offering freeaccess for 1 year (until April 2013) but henceforth the public would need an Ancestry subscription in order to access theschedules. And don’t get me started about IBM and Census’ American Factfinder.Rapid technological change and the misplaced assumption that "its all in google" have caused some in the FDLPcommunity to question the need for the FDLP and some others to drop out of the program altogether. I believe that theinherent nature of digital information actually increases the need for a distributed network of dedicated, legislativelyauthorized libraries and librarians. It would be prudent to draw upon the existing infrastructure of FDLP libraries and the200 years of cumulative experience of these institutions in assuring preservation of and access to governmentinformation. We must reinforce FDLP’s traditional mission of selection, collection, free access, and preservation in thedigital era in order to assure free access to this information into the foreseeable future.
FDLP ecosystemWednesday, January 23, 2013Nobody knows for sure how to preserve digital content for the long-term. This means to me that a loosely coupled,independently administered, distributed ecosystem is the best way to assure long-term preservation -- manyorganizations with many funding models and distributed technical infrastructures have a better shot at preservation than 1or 2 organizations -- especially if one of those organizations has a tenuous budget, or is a private corporation etc. DavidWeinberger described the Web in this way in his book “Small pieces loosely joined” and I think that metaphor holdsequally true for libraries. Here’s a back of the napkin kind of sketch of how I imagine the FDLP ecosystem to look.How would each of these scenarios deal with or react to different stress situations or threat models (directly out of theOAIS handbook e.g., reduced budgets, increased demand for privatization, increased demand for censorship or control orremoval of information, media/hardware/software/network failure, natural disaster, organizational failure etc.)? Its easy tosee that a highly replicated, distributed FDLP model of preservation based on common open digital standards and OAISwould deal with these situations much better than a centralized model. A web is much stronger than a silo. This holds truefor all information, not just govt info of course.Thus ends the soapbox portion of my talk. I’m sure to get back on it later, but for now I’d like to shift gears a bit and talkabout practical matters and about my strategy for collection development and long-term preservation in the FDLPecosystem. I’ll run through a few examples for how to conceptualize and actually do digital collection development of govtinformation. I like to use a water metaphor to describe my processes. In the digital realm, we have to collect drops ofwater, waterfalls as well as the ocean.First the droplets:
EEMs • Everyday Electronic Materials • serendipitous collection • Collecting the Web a drop at a time • Flickr photo by Elle Is Oneirataxic. Attribution-NonCommercial- ShareAlike 2.0 Generic Creative Commons licenseWednesday, January 23, 2013EEMs – or Everyday Electronic Materials – is a Mellon Foundation grant-funded project here at Stanford to build infrastructureand a workflow to support the collection, description, preservation and public access of digital objects by bibliographers andsubject specialists.EEMs are those digital materials that are serendipitously referenced in news reports, distributed by posting on Web sites, orthrough email notification to scholars and bibliographers; those items that selectors come across in the course of doing theireveryday work. In the past, librarians may have downloaded documents to their desktops and perhaps print them out and havethem bound (if their administrations were amenable!). Now we’ve got a digital stacks in which to collect, preserve and giveaccess!**For those interested in more, I’ve got a citation and link at the end of the presentation to my colleague Katherine Kott’s reporton the project. For those chomping at the bit now, just Google Kott, EEM, CNI.Subject specialist workflow is pretty simple:1. identify a document (*only pdfs and only monographs at this time)2. drag url of doc to the EEMs browser widget3. determine copyright status. Request permission from the copyright owner to harvest/preserve if need be (I can usually skipthis step with public domain govt documents!)4. describe the document (title, author, rights status, notes)5. submit to acq and cataloging workflow.6. EEM is locally stored in our digital repository and accessible through our catalog (searchworks)7. My EEMs workflow also includes reporting fugitive documents to GPO, but I’ll describe that momentarily.
Agencies tracked for EEMs • Bureau of Land Management CA ﬁeld ofﬁce • Department of Justice • Bureau of Ocean Energy Management, Regulation and Enforcement (BOEMRE) (including Minerals Management Service) • NOAA • National Cancer Institute • National Institutes of Health • USDA • Ofﬁce of Management and Budget • **Harvesting with archive-it: • EPA • GAO • Census current industrial reports • Thanks lost docs blog! http://lostdocs.freegovinfo.infoWednesday, January 23, 2013My use of the EEMs workflow and tool grew out of 2 other projects focusing on fugitive govt documents – fugitive documentsare a particular passion of mine!Particularly through the work of the lostdocs blog (lostdocs.freegovinfo.info) – which tracks fugitive document submissions tothe GPO in order to provide a public listing of fugitive documents – I’ve been able to target several agencies that generally arethe worst offenders in terms of fugitive documents:We also found that 3 other agencies that were top fugitive offenders published too many documents to make the EEMsworkflow feasible. So I’m harvesting the following 3 agencies with Archive-it (which I’ll describe later):I have an acquisitions staff person working about 3hrs per month to 1) check the agency publications pages for newpublications; 2) Check the CGP (http://catalog.gpo.gov) to see if the document has made it into the GPO catalog, and 3)submit a fugitive document report to GPO, and upload the PDF to the EEMs tool.Besides these federal agencies, I also scour the news – and have a google alert set – for leaked and newsworthy govtdocuments like the recently debunked LoC report on Iranian intelligence written about on ProPublica. This is sort of likereverse engineering the collection development process.
EEM: http://searchworks.stanford.edu/view/8707790Wednesday, January 23, 2013Through the EEMs workflow, to date we’ve been able to collect over 400 documents like this one (notice the Stanford PURL),preserve them locally in the Stanford digital repository (SDR) and give access to them through our catalog, searchworks. Thinkwhat we could do if 100 libraries – or 1000! – instituted this workflow? Collectively, we could cover all federal agencies toassure that no born-digital document within scope of the FDLP falls through the cracks and becomes fugitive.Next I’ll talk about the ocean:
Archive-it • collecting the Web in bulk • Archive-it.org/home/ssrg • Fotopedia image by Marcus Revertegat. Creative Commons Attribution 3.0 Unported license.Wednesday, January 23, 2013Archive-it is a subscription service from the Internet Archive – which by the way has many digital copies of historic govtdocuments and digitized microfilm available in its text collection. It’s an easy collection-building tool whereby you give thesoftware a list of urls (called “seeds”), schedule the crawler to harvest the seeds, and then give public access to thecontent collected. It’s a good way to contextualize or make sense of the ocean of content on the open Web.Since 2007 we’ve harvested:Documents Crawled: 58,590,127 (anything from a spacer gif to a mp4 file is considered a “document”)Data Archived: 4,616.5 GB (4.6 TB!)
SULAIR archive-it home: http://www.archive-it.org/home/SSRGWednesday, January 23, 2013What I’m collecting with Archive-It:• CRS Reports• FOIA documents and Agency FOIA reading rooms• Fugitive US agencies: EPA, GAO etc (shout-out to lostdocs.freegovinfo.info)• Bay Area governments• Climate change and environmental policy• G-20• CA Dept of education curriculum and instruction• US budget• FRUS
Collection seeds https://archive-it.org/public/collection.html?id=1078Wednesday, January 23, 2013Metadata: one of our catalogers has created Dublin core metadata at the collection and seed level. Archive-it allows formetadata at the document level, but we have not done that. We are in the planning stage to index the metadata for ourcatalog. We’re also planning to feed archive-it collections into our LOCKSS caches for redistribution and long-termpreservation.
search and discover http://snipurl.com/crs-energyefﬁciencyWednesday, January 23, 2013We give access to the collections via full text search from the archive-it site and from our databases page. Our crawled seedsalso are brought into the wayback machine for public access.Search can also be embedded into other Web pages (feel free to copy/paste this code!)
Paste this into your HTML: <form action="http://www.archive-it.org/public/search"> <input type="hidden" name="collection" value="***COLLECTIONID***" /> <input type="text" name="query" /> <input type="submit" name="go" value="Go" /> </form> ***COLLECTIONID*** = 1078 (CRS reports collection) add search to other pages </gratuitous_code>Wednesday, January 23, 2013<form action="http://www.archive-it.org/public/search"><input type="hidden" name="collection" value="***COLLECTIONID***" /><input type="text" name="query" /><input type="submit" name="go" value="Go" /></form><form action="http://www.archive-it.org/public/search"><input type="hidden" name="collection" value="1078" /><input type="text" name="query" /><input type="submit" name="go" value="Go" /></form>Lastly, I’ll mention the waterfall that is LOCKSS-USDOCS.
LOCKSS-USDOCS • Targeted Web collection and distributed preservation • Lots of Copies Keep Stuff Safe • lockss-usdocs.stanford.edu • Flickr waterfall picture by discordia1967. That’s actually me at Hanakapi`ai falls in Kauai :-)Wednesday, January 23, 2013lockss-usdocs.stanford.eduCombines the best of targeted Web harvesting with collaboration and distributed preservation.
Wednesday, January 23, 2013LOCKSS – Lots of Copies Keep Stuff Safe – began at Stanford in 1999. The LOCKSS software was built to solve the problemof long-term preservation of digital content. It is an open-source distributed digital preservation system based on openstandards (OAIS, OpenURL, HTTP, WARC). Originally LOCKSS was focused on journal literature but over the last 10 yearshas been used by other projects focusing on government information, theses and dissertations, numeric data, state recordsetc.The goals of LOCKSS are to spread out the economic cost and responsibility of digital preservation and use off the shelfhardware and open-source software, so that libraries and content publishers can easily and affordably create, preserve, andarchive local electronic collections and readers can access archived and newly published content transparently at its originalURLs through links resolvers like SFX.Think of a LOCKSS box as a digitally distributed depository library!SLIDE 16: DECENTRALIZED PRESERVATION (NEED?)How does lockss work?There are 2 parts to the LOCKSS software: harvest and content collection; and content checking and replication.1) any site – for example FDsys.gov – that gives LOCKSS permission to harvest can be collected by the LOCKSS Webharvester -- the state of the art in Web harvesting!2) and this is the cool part: lockss goes through a process of checking and polling all digital content in all of the lockss boxeson a network. If 1 box has content that is different from all of the other boxes, the software will fix the content, assuring that allcontent in the whole network is exactly the same. It is for all intents and purposes injecting stem cells into the network toreplicate and fix content that’s become corrupted over time.That’s it. LOCKSS is elegant in its simplicity and proven effective in keeping LOCAL(!) digital content safely preserved overtime. This is as close to the unix maxim of “doing one thing, doing it well.”
LOCKSS-USDOCS • LOCKSS for US Documents • Replicates FDLP in the digital environment • “digital deposit” (for more on “digital deposit,” see http://freegovinfo.info/taxonomy/term/3) • Tamper evident • 36 libraries and GPO participatingWednesday, January 23, 2013So now you can see why some of us in the documents community are so excited about LOCKSS and why we decided toimplement LOCKSS-USDOCS. Portland State and Simon Frasier Universities are the closest partners but I’m always lookingfor more.Using the LOCKSS software we are re-implementing a tamper evident distributed preservation system for digital documents.Rather than a central silo on a .gov server, digital govt documents reside on 36 servers at 36 different libraries (and counting!).
LOCKSS-USDOCS is ... Federal register, code of federal regulations, congressional record, congressional bills, congressional reports, US Code, Public&Private laws, Public Papers of the President, historic supreme court decisions, US Statutes at Large, GAO Reports, US Budget ... and more!! http://www.gpo.gov/fdsys/browse/collectiontab.actionWednesday, January 23, 2013GPO has been instrumental in this process by putting LOCKSS permission statements on all 44 FDsys collections. Thisincludes:Federal register, code of federal regulations, congressional record, congressional bills, congressional reports, US Code,Public&Private laws, Public Papers of the President, historic supreme court decisions, US Statutes at Large, GAO Reports, USBudget, etc many of these going back to the early 1990s when they first went digital.In the 2008 Blue Ribbon Task Force on Sustainable Digital Preservation and Access, Abby Smith Rumsey wrote, “Access tovaluable digital materials tomorrow depends upon preservation actions taken today; and, over time, access depends onongoing and efficient allocation of resources to preservation.”With LOCKSS-USDOCS we’re taking collective responsibility today for long-term preservation of digital depository materials.
Collaboration • Farmington Plan Redux • Summer digital FDLP Institute • Adopt a federal agency • Join LOCKSS-USDOCS, TRAIL and other digitization/digital preservation projects • Seed the cloud: • Start blogging your Q&As and editing Wikipedia articles http://snipurl.com/qa-average-tariff- levels • Catalog, catalog, catalog!Wednesday, January 23, 2013Ok, here’s James getting back on his soapbox!As you can see, the technological tools are there. But there’s a need for a “Farmington Plan redux”:The Farmington Plan, which lasted from 1948 - 1972, was an innovative ARL program of collaborative collection developmentwhereby subscribing libraries would have responsibility for collecting and cataloging research materials in certain subject and/or linguistic areas and would then distribute records (in the form of cards) to the National Union Catalog.Moving forward, here are some things that we need to do as a community to realize this Farmington Plan Redux and build thedigital FDLP reservoir!:--First and foremost, set up a summer digital FDLP institute modeled on the ICPSR data library workshop which has trained ageneration of data librarians. Cass and I have talked about this before and I think this is one of the most critical pieces of theFarmington Plan Redux. The institute would train govt information librarians (and those interested in govt information) on theins and outs of the Open Archival Information System (OAIS) and other open digital library tools and standards – including aproposed standard that my friend and FGI co-conspirator Jim Jacobs and I have written about in a soon to be published D-Libarticle called the "Digital-Surrogate Seal of Approval" (DSSOA), a simple way of describing and guaranteeing to end-users thequality and accuracy of existing digital surrogates created from printed books and other non-digital originals. The institutewould teach techniques for expanding access to both digital and paper collections, give librarians a framework for updatingtheir understanding and have increased awareness of digital archival concepts and build and expand their digital toolboxes toinclude Web harvesting, digital information collection and organization, building and utilizing Web tools and the semantic Web.--Adopt a federal agency (or better yet, a local/regional office of a federal agency). Submit fugitive documents to GPO forinclusion in the CGP and distribution out to other depositories.--Join LOCKSS, the Technical Reports Archive and Information Library (TRAIL) – shout-out to Mel DeSart who’s beeninstrumental in building up TRAIL! – and other digitization/digital preservation projects.--Seed the cloud:Start blogging your Q&As and editing Wikipedia articles w library resources. Your users are online and using Google and othersearch engines to find stuff. This is an easy way to highlight your collections and your library’s resources and services.Highlighting your collections online brings users to your library.Shoutout to Ann Lally and Carolyn Dunford for their 2007 D-Lib article about seeding Wikipedia articles!
“...let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to the waste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.” — Thomas Jefferson, February 18, 1791Wednesday, January 23, 2013digital changes a lot of things about information, but it doesnt change the need to collect it, share it, preserve it, and giveaccess to it. As my friend, mentor and FGI co-conspirator Jim Jacobs recently stated, "lots of collections keep stuff safe!" (yesthere are 2 of us working on FGI!)“...let us save what remains: not by vaults and locks which fence them from the public eye and use in consigning them to thewaste of time, but by such a multiplication of copies, as shall place them beyond the reach of accident.”Or in other words:
Thanks!Wednesday, January 23, 2013Thanks everyone!
Further reading • Future of the Federal Depository Library Program. Free Government Information. http://freegovinfo.info/taxonomy/term/1087 • “Open Government Publications” Letter to Deputy CTO Noveck. http://freegovinfo.info/node/2970 • “Digital Deposit.” Free Government Information. http://freegovinfo.info/taxonomy/term/3 • Preservation for all: LOCKSS-USDOCS and our digital future. James Jacobs and Victoria Reich. Documents to the People (DttP) Volume 38:3 (Fall 2010). http://freegovinfo.info/system/ﬁles/lockssusdocs-dttp38%283%29.pdf • Everyday Electronic Materials in Policy and Practice. Coalition for Networked Information (CNI) project brieﬁng. Fall 2010. Katherine Kott. http://sn.im/eems-report • A Guide to Distributed Digital Preservation. K. Skinner and M. Schultz, Eds. (Atlanta, GA: Educopia Institute, 2010). http://www.metaarchive.org/GDDP • http://lockss-usdocs.stanford.eduWednesday, January 23, 2013