SlideShare a Scribd company logo
1 of 35
Web and Twitter Archiving at
the Library of Congress
Web Archive Globalization Workshop
June 16, 2011




Nicholas Taylor (@nullhandle)
Web Archiving Team
Library of Congress
why archive the web?

• preserve our nation’s
  history and culture
• identify and preserve at-
  risk digital content
• develop tools, models,
  and methods for digital
  preservation




                                  “World Wide Web 1997: 2 Terabytes in 63 Inches”

                              2                   LIBRARY OF CONGRESS
various collection strategies

• entire web domain—Internet Archive
• national domain—Sweden, Denmark, others
• selective (individual URLs) and thematic—
  Australia
• thematic or event-based—Library of Congress

http://netpreserve.org/about/archiveList.php



                       3           LIBRARY OF CONGRESS
web archiving at LC

• began in 2000 with MINERVA
  pilot
• identify policy issues,
  establish best practices,
  build tools (internally and
  w/ partners)
• broaden expertise and
  understanding of Web
  Archiving within LC
• collect, manage and sustain
  at-risk digital content
                                LC Prints & Photographs: design for Minerva Congressional Library

                           4                          LIBRARY OF CONGRESS
Curators/Recommending Officers
In Library Services, Congressional Research
   Service, and the Law Library pick the                Web Archiving Team and Web
   collections and what URLs to archive,                Preservation Engineering Team
 and research who to contact for permission.        In the Office of Strategic Initiatives (OSI).
                                                    We are project managers and technical staff
                                                    focused on capture, tools, and permissions.
                   Bibliographic Access
            MODS records are created in Library
           Services: the Network Development and         Information Technology Office and
            MARC Standards Office (NetDev) and              Technical Architecture Team
            Acquisitions and Bibliographic Access   Also in OSI. Supports Wayback Machine, Heritrix,
                (ABA) staff do the cataloging.         repository and tools development, and data
                                                     transfers. Contractors are also used in this area.


                                                                          LIBRARY OF CONGRESS
LC collections: over 245 TB + 5 TB/month
    – ongoing collections, including:
        • Congress/Legislative websites
        • Legal Blawgs
        • Public Policy Topics
    – event-based collections, including:
        • U.S. National Elections—2000, 2002, 2004, 2006, 2008, 2010
        • Iraq War 2003-2009
        • September 11 2001 and September 11 Remembrance 2002
        • Civil War Sesquicentennial
        • Olympics 2002
        • Supreme Court Nominations
        • Papal Transition
        • Case Studies: health care, terrorism, visual image content,
          organizational Web sites, Crisis in Darfur, “single site”
    – Overseas Operations collections, including: Egypt 2008; Brazilian, Indian,
      Indonesian, Philippine, and Thai Elections; Afghanistan Government;
      Pakistan Nationalisms

http://www.loc.gov/webarchiving/collections.html
                                         6                   LIBRARY OF CONGRESS
web archives access: loc.gov/lcwa




                 7        LIBRARY OF CONGRESS
essential tools

• capture: Heritrix (contract crawling w/ IA and
  in-house)
• replay: Wayback
• permissions/seed management; capture quality
  review; reporting; transfer tracking: custom
  apps built on LAMP stack
• transfer: BagIt Library (based on BagIt spec);
  *nix ingest/staging/storage/access servers;
  Internet2 connection
                       8            LIBRARY OF CONGRESS
other useful tools

• web archiving workflow management:
  NetArchiveSuite, Web Curator Tool
• small-scale web archiving: HTTrack
• Firefox add-ons: Firebug, Web Developer




                       9           LIBRARY OF CONGRESS
cataloging for access

• collection-level metadata
• site-level bibliographic metadata
  – nominators provide subject heading
  – HTML metadata extraction via cURL
  – cataloger assigns keywords
• cataloging metadata stored in MODS
• assisted keyword assignment: HIVE


                       10             LIBRARY OF CONGRESS
collection-level record example




                          LIBRARY OF CONGRESS
MODS bibliographic record example




                12       LIBRARY OF CONGRESS
Wayback Machine Resource Page




               13       LIBRARY OF CONGRESS
example of an archived site




                 14       LIBRARY OF CONGRESS
search and discovery

•   bibliographic metadata search
•   (not yet) Memento-enabled
•   full-text search based on NutchWAX unfeasible
•   Lucene/Solr looks promising




                         15          LIBRARY OF CONGRESS
Challenges for

WEB ARCHIVES

                 16   LIBRARY OF CONGRESS
challenges for web archives
 • technical
    – large, deep, dynamic, interlinked
    – continuous transformation, simultaneously growing and
      disappearing
 • intellectual property laws and regulations
    – legal deposit laws, mandates for preservation, laws that
      do not address web content
 • economic environment
    – few good business models for sustaining web collections
 • social environment
    – who is responsible and how is responsibility shared?

                               17              LIBRARY OF CONGRESS
capture, replay, and preservation

• capturing websites – Heritrix
   – “form-fronted” databases (i.e., “deep web”)
   – URLs the crawler can’t see that we want
   – …and URLs the crawler can see that we don’t
   – web 2.0 and other “new” web technologies
• replaying archived versions – Wayback
   – non-rigorous website coding
   – live site “leakage”
   – significant interactivity may be lost
• preserving access to our archives
   – billions of files
   – thousands of file types
   – how do we ensure content is accessible in 10, 25, 50 or more
      years?

                                                    LIBRARY OF CONGRESS
scaling capacity

• budgetary pressures
• limited access server disk
  space
   – competing w/ other big
     data projects
• new infrastructure for
  new capabilities



                                    photo by Henrik Bennetsen under CC BY-SA 2.0

                               19                           LIBRARY OF CONGRESS
when the only tool you have is a library…




         LC Prints & Photographs: exterior view of the LC Jefferson Building

                                                20                             LIBRARY OF CONGRESS
…many things look like collections

• archive behaves more like discrete records than
  web
  – archived sites not contiguously navigable
  – data doesn’t readily allow for downloading
• Twitter archive may prompt re-thinking web
  archive data access




                        21            LIBRARY OF CONGRESS
web archiving and U.S. copyright law

• legal deposit requirement
  only applies to “published
  works” (§ 407)
• § 108 of the Copyright Act
  provides library exceptions
   – doesn’t address digital
     preservation and web
     archiving


                                photo by Gabriel de Urioste under CC BY 2.0



                                                    LIBRARY OF CONGRESS
why not rely on robots.txt?

• unreliable proxy for
  copyright permissions
• archival crawler ≠ search
  crawler
• LC disregards robots.txt
  but leaves contact info




                              last.fm: robots.txt


                                                    LIBRARY OF CONGRESS
capture and access permissions

• permissions-based approach
                                                          access
  began in 2002                              capture
                                                          offsite
• permission plans for each
  collection developed w/
                               government   no notice    no notice
  counsel
• permission requirements
  depend on site type          advocacy/
                                              notice     permission
                                 policy
• more liberal about capture
  than about offsite access
                                 news       permission   permission



                                               LIBRARY OF CONGRESS
implications of opt-in permissions

• no response treated as denial
  – very few denials
  – many non-responsive
• case study: September 11, 2001
  – 2300 cataloged, 30000 uncataloged URLs
  – many news sites (“high risk” permissions
    category)
  – no permissions sought
  – very few takedown requests
                          25          LIBRARY OF CONGRESS
the future of permissions

• risk of more liberal
  approach appears low
• hope to move to
  more notice-based,
  opt-out policy
• may affect
  previously-captured
  sites as well
                              photo by RJ Sangosti, Denver Post under © (fair use)


                         26                                    LIBRARY OF CONGRESS
Challenges for the

TWITTER ARCHIVE

                     27   LIBRARY OF CONGRESS
why archive Twitter?

• historical record of communication, news reporting,
  and social trends
• complements collections and mission




        http://twitter.com/#!/klerner/status/64895357355704320
                                                                 LIBRARY OF CONGRESS
Twitter Archive FAQs

• currently receiving Tweets through Gnip
• includes only the public archive
  – deletions will propagate to archive
• access limitations
  – 6 month embargo on new Tweets
  – no bulk distribution
• downstream users
  – no commercial use
  – no substantial re-distribution
                                          LIBRARY OF CONGRESS
a (literally) growing challenge

• ~3 years: time it took from 1st Tweet to
  billionth
• 1 week: time it now takes users to send a
  billion Tweets
• average Tweets/day in 3/10: 50 million
• average Tweets/day in 3/11: 149 million

http://blog.twitter.com/2011/03/numbers.html

                       30           LIBRARY OF CONGRESS
questions to consider

• how does archive fit in w/ existing collections?
• how are agreement guidelines interpreted and
  implemented technically?
• what kind(s) of access can we provide?
• what context do we provide for content?




                                      LIBRARY OF CONGRESS
additional goals

• justify value to Congress and public
• understand and respond to researcher needs
• push the institution beyond existing curatorial
  models




                                     LIBRARY OF CONGRESS
for more information
•   Library of Congress Web Archiving Program:
    http://www.loc.gov/webarchiving/
•   Library of Congress Web Archives:
    http://loc.gov/lcwa/
•   National Digital Information Infrastructure and
    Preservation Program:
    http://www.digitalpreservation.gov/
•   Library of Congress – Twitter FAQ:
    http://blogs.loc.gov/loc/2010/04/the-library-and-twitter
•   Section 108 Study Group:
    http://www.section108.gov/


                           33             LIBRARY OF CONGRESS
for more information
•   “Legal Issues in Building Social Media Collections:”
    http://www.arl.org/bm~doc/mm11sp-okeeffe.pdf
•   “How the Library of Congress is building the Twitter
    archive:”
    http://radar.oreilly.com/2011/06/library-of-congress-tw
•   “Web Archives: The Future(s):”
    http://papers.ssrn.com/sol3/papers.cfm?abstract_id=183




                           34            LIBRARY OF CONGRESS
questions?

Nicholas Taylor
@nullhandle
ntay@loc.gov

More Related Content

Viewers also liked

الفصل الأول مكتبات
الفصل الأول مكتباتالفصل الأول مكتبات
الفصل الأول مكتباتEman Eldaboly
 
الارشيف الالكتروني
الارشيف الالكترونيالارشيف الالكتروني
الارشيف الالكترونيabedelaziz benzine
 
pdfالتسير الالكتروني للوثائق
 pdfالتسير الالكتروني للوثائق pdfالتسير الالكتروني للوثائق
pdfالتسير الالكتروني للوثائقabedelaziz benzine
 
أنواع المكتبات الوطنية والجامعية والمدرسية
أنواع المكتبات الوطنية والجامعية والمدرسيةأنواع المكتبات الوطنية والجامعية والمدرسية
أنواع المكتبات الوطنية والجامعية والمدرسيةأ.د حسين عبد الباسط
 
درس كامل عن محركات البحث
درس كامل عن محركات البحثدرس كامل عن محركات البحث
درس كامل عن محركات البحثdalirym
 
أساسيات البحث على الإنترنت
أساسيات البحث على الإنترنتأساسيات البحث على الإنترنت
أساسيات البحث على الإنترنتMostafa Gawdat
 
The Library of Congress Classification
The Library of Congress ClassificationThe Library of Congress Classification
The Library of Congress ClassificationDaryl Superio
 
مشروع المتصفحات ومحركات البحث نهائي
مشروع المتصفحات ومحركات البحث نهائيمشروع المتصفحات ومحركات البحث نهائي
مشروع المتصفحات ومحركات البحث نهائيDr.Mohammed AlMutahher
 
ادوات البحث في شبكة الانترنت
ادوات البحث في شبكة الانترنتادوات البحث في شبكة الانترنت
ادوات البحث في شبكة الانترنتسامر باخت
 
Library of Congress Classification
Library of Congress ClassificationLibrary of Congress Classification
Library of Congress ClassificationDenise Garofalo
 

Viewers also liked (12)

الفصل الأول مكتبات
الفصل الأول مكتباتالفصل الأول مكتبات
الفصل الأول مكتبات
 
الارشيف الالكتروني
الارشيف الالكترونيالارشيف الالكتروني
الارشيف الالكتروني
 
pdfالتسير الالكتروني للوثائق
 pdfالتسير الالكتروني للوثائق pdfالتسير الالكتروني للوثائق
pdfالتسير الالكتروني للوثائق
 
CTDA MODS Implementation Guidelines
CTDA MODS Implementation GuidelinesCTDA MODS Implementation Guidelines
CTDA MODS Implementation Guidelines
 
ديوي العشري
ديوي العشريديوي العشري
ديوي العشري
 
أنواع المكتبات الوطنية والجامعية والمدرسية
أنواع المكتبات الوطنية والجامعية والمدرسيةأنواع المكتبات الوطنية والجامعية والمدرسية
أنواع المكتبات الوطنية والجامعية والمدرسية
 
درس كامل عن محركات البحث
درس كامل عن محركات البحثدرس كامل عن محركات البحث
درس كامل عن محركات البحث
 
أساسيات البحث على الإنترنت
أساسيات البحث على الإنترنتأساسيات البحث على الإنترنت
أساسيات البحث على الإنترنت
 
The Library of Congress Classification
The Library of Congress ClassificationThe Library of Congress Classification
The Library of Congress Classification
 
مشروع المتصفحات ومحركات البحث نهائي
مشروع المتصفحات ومحركات البحث نهائيمشروع المتصفحات ومحركات البحث نهائي
مشروع المتصفحات ومحركات البحث نهائي
 
ادوات البحث في شبكة الانترنت
ادوات البحث في شبكة الانترنتادوات البحث في شبكة الانترنت
ادوات البحث في شبكة الانترنت
 
Library of Congress Classification
Library of Congress ClassificationLibrary of Congress Classification
Library of Congress Classification
 

Similar to Web and Twitter Archiving at the Library of Congress

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...nullhandle
 
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...The Frick Collection
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...Micah Altman
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunitiesAhmed AlSum
 
Putting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogPutting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogWGBH Media Library and Archives
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3Essam Obaid
 
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사Chris
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsWGBH Media Library and Archives
 
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...ABES
 
Unlocking LOCKSS with APIs
Unlocking LOCKSS with APIsUnlocking LOCKSS with APIs
Unlocking LOCKSS with APIsnullhandle
 
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...nullhandle
 
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Anna Perricci
 
Interlibrary Loans and Copyright
Interlibrary Loans and CopyrightInterlibrary Loans and Copyright
Interlibrary Loans and CopyrightLisa Redlinski
 
Digital Library
 Digital Library Digital Library
Digital LibraryShiv Kumar
 
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Marcus Smith
 

Similar to Web and Twitter Archiving at the Library of Congress (20)

Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
Lots More LOCKSS for Web Archiving: Boons from the LOCKSS Software Re-Archite...
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
Making the Black Hole Gray: Implementing the Web Archiving of Specialist Art ...
 
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
SAFETY NETS: RESCUE AND REVIVAL FOR ENDANGERED BORN-DIGITAL RECORDS- Program ...
 
Web archiving challenges and opportunities
Web archiving challenges and opportunitiesWeb archiving challenges and opportunities
Web archiving challenges and opportunities
 
Putting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television CatalogPutting the Pieces Together: Creating a National Educational Television Catalog
Putting the Pieces Together: Creating a National Educational Television Catalog
 
The development of web archiving 3
The development of web archiving 3The development of web archiving 3
The development of web archiving 3
 
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
Wikimedia 재단과 MediaWiki 위키 소프트웨어 조사
 
Building the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access WorkflowsBuilding the AAPB: Inter-Institutional Preservation and Access Workflows
Building the AAPB: Inter-Institutional Preservation and Access Workflows
 
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
Jabes 2008 - Conférence inaugurale, la grande révélation : penser les ressour...
 
Unlocking LOCKSS with APIs
Unlocking LOCKSS with APIsUnlocking LOCKSS with APIs
Unlocking LOCKSS with APIs
 
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
Boiling the Ocean, Together: Web Archive Collection Development in a Global C...
 
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
Progress Made and Lessons Learned through Collaborative Web Archiving Proj...
 
Interlibrary Loans and Copyright
Interlibrary Loans and CopyrightInterlibrary Loans and Copyright
Interlibrary Loans and Copyright
 
Digital Library
 Digital Library Digital Library
Digital Library
 
SirsiDynix
SirsiDynixSirsiDynix
SirsiDynix
 
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
Linked Open Data and The Digital Archaeological Workflow at the Swedish Natio...
 

More from nullhandle

Understanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web ArchivesUnderstanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web Archivesnullhandle
 
Interoperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media ArchivingInteroperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media Archivingnullhandle
 
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...nullhandle
 
2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlightsnullhandle
 
Collection Development for Selective Web Archiving
Collection Development for Selective Web ArchivingCollection Development for Selective Web Archiving
Collection Development for Selective Web Archivingnullhandle
 
Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?nullhandle
 
WASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsWASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsnullhandle
 
Building Web Archiving Technology, Together
Building Web Archiving Technology, TogetherBuilding Web Archiving Technology, Together
Building Web Archiving Technology, Togethernullhandle
 
Outreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web ArchivingOutreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web Archivingnullhandle
 
Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!nullhandle
 
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...nullhandle
 
Campaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional ResearchCampaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional Researchnullhandle
 
2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlightsnullhandle
 
Considerations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection DevelopmentConsiderations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection Developmentnullhandle
 
Advocating for Web Archivability
Advocating for Web ArchivabilityAdvocating for Web Archivability
Advocating for Web Archivabilitynullhandle
 
Building Archivable Websites
Building Archivable WebsitesBuilding Archivable Websites
Building Archivable Websitesnullhandle
 
Link Persistence, Website Persistence
Link Persistence, Website PersistenceLink Persistence, Website Persistence
Link Persistence, Website Persistencenullhandle
 
From Seed to Harvest: Web Archiving Program Considerations for SUL
From Seed to Harvest: Web Archiving Program Considerations for SULFrom Seed to Harvest: Web Archiving Program Considerations for SUL
From Seed to Harvest: Web Archiving Program Considerations for SULnullhandle
 
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...nullhandle
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archivingnullhandle
 

More from nullhandle (20)

Understanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web ArchivesUnderstanding Legal Use Cases for Web Archives
Understanding Legal Use Cases for Web Archives
 
Interoperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media ArchivingInteroperability and Technical Collaboration for Web and Social Media Archiving
Interoperability and Technical Collaboration for Web and Social Media Archiving
 
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
Rethinking Web Archiving Quality Assurance for Impact, Scalability, and Susta...
 
2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights2015 NDSA Web Archiving Survey Report Highlights
2015 NDSA Web Archiving Survey Report Highlights
 
Collection Development for Selective Web Archiving
Collection Development for Selective Web ArchivingCollection Development for Selective Web Archiving
Collection Development for Selective Web Archiving
 
Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?Why Not Lots of Copies Keep(ing) Software Safe?
Why Not Lots of Copies Keep(ing) Software Safe?
 
WASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIsWASAPI Web Archive Data Transfer APIs
WASAPI Web Archive Data Transfer APIs
 
Building Web Archiving Technology, Together
Building Web Archiving Technology, TogetherBuilding Web Archiving Technology, Together
Building Web Archiving Technology, Together
 
Outreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web ArchivingOutreach to Campus Webmasters for a Better Web, and Better Web Archiving
Outreach to Campus Webmasters for a Better Web, and Better Web Archiving
 
Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!Measure All the (Web Archiving) Things!
Measure All the (Web Archiving) Things!
 
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
A Snapshot of the U.S. Web Archiving Landscape through the 2013 NDSA Survey R...
 
Campaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional ResearchCampaign Web Archives to Support Multi-Institutional Research
Campaign Web Archives to Support Multi-Institutional Research
 
2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights2013 NDSA Web Archiving Survey Report Highlights
2013 NDSA Web Archiving Survey Report Highlights
 
Considerations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection DevelopmentConsiderations for Strategic Web Archive Collection Development
Considerations for Strategic Web Archive Collection Development
 
Advocating for Web Archivability
Advocating for Web ArchivabilityAdvocating for Web Archivability
Advocating for Web Archivability
 
Building Archivable Websites
Building Archivable WebsitesBuilding Archivable Websites
Building Archivable Websites
 
Link Persistence, Website Persistence
Link Persistence, Website PersistenceLink Persistence, Website Persistence
Link Persistence, Website Persistence
 
From Seed to Harvest: Web Archiving Program Considerations for SUL
From Seed to Harvest: Web Archiving Program Considerations for SULFrom Seed to Harvest: Web Archiving Program Considerations for SUL
From Seed to Harvest: Web Archiving Program Considerations for SUL
 
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
A Survey of Research Prospects for more Manageable Personal Digital Photo Col...
 
Tool Academy: Web Archiving
Tool Academy: Web ArchivingTool Academy: Web Archiving
Tool Academy: Web Archiving
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 

Web and Twitter Archiving at the Library of Congress

  • 1. Web and Twitter Archiving at the Library of Congress Web Archive Globalization Workshop June 16, 2011 Nicholas Taylor (@nullhandle) Web Archiving Team Library of Congress
  • 2. why archive the web? • preserve our nation’s history and culture • identify and preserve at- risk digital content • develop tools, models, and methods for digital preservation “World Wide Web 1997: 2 Terabytes in 63 Inches” 2 LIBRARY OF CONGRESS
  • 3. various collection strategies • entire web domain—Internet Archive • national domain—Sweden, Denmark, others • selective (individual URLs) and thematic— Australia • thematic or event-based—Library of Congress http://netpreserve.org/about/archiveList.php 3 LIBRARY OF CONGRESS
  • 4. web archiving at LC • began in 2000 with MINERVA pilot • identify policy issues, establish best practices, build tools (internally and w/ partners) • broaden expertise and understanding of Web Archiving within LC • collect, manage and sustain at-risk digital content LC Prints & Photographs: design for Minerva Congressional Library 4 LIBRARY OF CONGRESS
  • 5. Curators/Recommending Officers In Library Services, Congressional Research Service, and the Law Library pick the Web Archiving Team and Web collections and what URLs to archive, Preservation Engineering Team and research who to contact for permission. In the Office of Strategic Initiatives (OSI). We are project managers and technical staff focused on capture, tools, and permissions. Bibliographic Access MODS records are created in Library Services: the Network Development and Information Technology Office and MARC Standards Office (NetDev) and Technical Architecture Team Acquisitions and Bibliographic Access Also in OSI. Supports Wayback Machine, Heritrix, (ABA) staff do the cataloging. repository and tools development, and data transfers. Contractors are also used in this area. LIBRARY OF CONGRESS
  • 6. LC collections: over 245 TB + 5 TB/month – ongoing collections, including: • Congress/Legislative websites • Legal Blawgs • Public Policy Topics – event-based collections, including: • U.S. National Elections—2000, 2002, 2004, 2006, 2008, 2010 • Iraq War 2003-2009 • September 11 2001 and September 11 Remembrance 2002 • Civil War Sesquicentennial • Olympics 2002 • Supreme Court Nominations • Papal Transition • Case Studies: health care, terrorism, visual image content, organizational Web sites, Crisis in Darfur, “single site” – Overseas Operations collections, including: Egypt 2008; Brazilian, Indian, Indonesian, Philippine, and Thai Elections; Afghanistan Government; Pakistan Nationalisms http://www.loc.gov/webarchiving/collections.html 6 LIBRARY OF CONGRESS
  • 7. web archives access: loc.gov/lcwa 7 LIBRARY OF CONGRESS
  • 8. essential tools • capture: Heritrix (contract crawling w/ IA and in-house) • replay: Wayback • permissions/seed management; capture quality review; reporting; transfer tracking: custom apps built on LAMP stack • transfer: BagIt Library (based on BagIt spec); *nix ingest/staging/storage/access servers; Internet2 connection 8 LIBRARY OF CONGRESS
  • 9. other useful tools • web archiving workflow management: NetArchiveSuite, Web Curator Tool • small-scale web archiving: HTTrack • Firefox add-ons: Firebug, Web Developer 9 LIBRARY OF CONGRESS
  • 10. cataloging for access • collection-level metadata • site-level bibliographic metadata – nominators provide subject heading – HTML metadata extraction via cURL – cataloger assigns keywords • cataloging metadata stored in MODS • assisted keyword assignment: HIVE 10 LIBRARY OF CONGRESS
  • 11. collection-level record example LIBRARY OF CONGRESS
  • 12. MODS bibliographic record example 12 LIBRARY OF CONGRESS
  • 13. Wayback Machine Resource Page 13 LIBRARY OF CONGRESS
  • 14. example of an archived site 14 LIBRARY OF CONGRESS
  • 15. search and discovery • bibliographic metadata search • (not yet) Memento-enabled • full-text search based on NutchWAX unfeasible • Lucene/Solr looks promising 15 LIBRARY OF CONGRESS
  • 16. Challenges for WEB ARCHIVES 16 LIBRARY OF CONGRESS
  • 17. challenges for web archives • technical – large, deep, dynamic, interlinked – continuous transformation, simultaneously growing and disappearing • intellectual property laws and regulations – legal deposit laws, mandates for preservation, laws that do not address web content • economic environment – few good business models for sustaining web collections • social environment – who is responsible and how is responsibility shared? 17 LIBRARY OF CONGRESS
  • 18. capture, replay, and preservation • capturing websites – Heritrix – “form-fronted” databases (i.e., “deep web”) – URLs the crawler can’t see that we want – …and URLs the crawler can see that we don’t – web 2.0 and other “new” web technologies • replaying archived versions – Wayback – non-rigorous website coding – live site “leakage” – significant interactivity may be lost • preserving access to our archives – billions of files – thousands of file types – how do we ensure content is accessible in 10, 25, 50 or more years? LIBRARY OF CONGRESS
  • 19. scaling capacity • budgetary pressures • limited access server disk space – competing w/ other big data projects • new infrastructure for new capabilities photo by Henrik Bennetsen under CC BY-SA 2.0 19 LIBRARY OF CONGRESS
  • 20. when the only tool you have is a library… LC Prints & Photographs: exterior view of the LC Jefferson Building 20 LIBRARY OF CONGRESS
  • 21. …many things look like collections • archive behaves more like discrete records than web – archived sites not contiguously navigable – data doesn’t readily allow for downloading • Twitter archive may prompt re-thinking web archive data access 21 LIBRARY OF CONGRESS
  • 22. web archiving and U.S. copyright law • legal deposit requirement only applies to “published works” (§ 407) • § 108 of the Copyright Act provides library exceptions – doesn’t address digital preservation and web archiving photo by Gabriel de Urioste under CC BY 2.0 LIBRARY OF CONGRESS
  • 23. why not rely on robots.txt? • unreliable proxy for copyright permissions • archival crawler ≠ search crawler • LC disregards robots.txt but leaves contact info last.fm: robots.txt LIBRARY OF CONGRESS
  • 24. capture and access permissions • permissions-based approach access began in 2002 capture offsite • permission plans for each collection developed w/ government no notice no notice counsel • permission requirements depend on site type advocacy/ notice permission policy • more liberal about capture than about offsite access news permission permission LIBRARY OF CONGRESS
  • 25. implications of opt-in permissions • no response treated as denial – very few denials – many non-responsive • case study: September 11, 2001 – 2300 cataloged, 30000 uncataloged URLs – many news sites (“high risk” permissions category) – no permissions sought – very few takedown requests 25 LIBRARY OF CONGRESS
  • 26. the future of permissions • risk of more liberal approach appears low • hope to move to more notice-based, opt-out policy • may affect previously-captured sites as well photo by RJ Sangosti, Denver Post under © (fair use) 26 LIBRARY OF CONGRESS
  • 27. Challenges for the TWITTER ARCHIVE 27 LIBRARY OF CONGRESS
  • 28. why archive Twitter? • historical record of communication, news reporting, and social trends • complements collections and mission http://twitter.com/#!/klerner/status/64895357355704320 LIBRARY OF CONGRESS
  • 29. Twitter Archive FAQs • currently receiving Tweets through Gnip • includes only the public archive – deletions will propagate to archive • access limitations – 6 month embargo on new Tweets – no bulk distribution • downstream users – no commercial use – no substantial re-distribution LIBRARY OF CONGRESS
  • 30. a (literally) growing challenge • ~3 years: time it took from 1st Tweet to billionth • 1 week: time it now takes users to send a billion Tweets • average Tweets/day in 3/10: 50 million • average Tweets/day in 3/11: 149 million http://blog.twitter.com/2011/03/numbers.html 30 LIBRARY OF CONGRESS
  • 31. questions to consider • how does archive fit in w/ existing collections? • how are agreement guidelines interpreted and implemented technically? • what kind(s) of access can we provide? • what context do we provide for content? LIBRARY OF CONGRESS
  • 32. additional goals • justify value to Congress and public • understand and respond to researcher needs • push the institution beyond existing curatorial models LIBRARY OF CONGRESS
  • 33. for more information • Library of Congress Web Archiving Program: http://www.loc.gov/webarchiving/ • Library of Congress Web Archives: http://loc.gov/lcwa/ • National Digital Information Infrastructure and Preservation Program: http://www.digitalpreservation.gov/ • Library of Congress – Twitter FAQ: http://blogs.loc.gov/loc/2010/04/the-library-and-twitter • Section 108 Study Group: http://www.section108.gov/ 33 LIBRARY OF CONGRESS
  • 34. for more information • “Legal Issues in Building Social Media Collections:” http://www.arl.org/bm~doc/mm11sp-okeeffe.pdf • “How the Library of Congress is building the Twitter archive:” http://radar.oreilly.com/2011/06/library-of-congress-tw • “Web Archives: The Future(s):” http://papers.ssrn.com/sol3/papers.cfm?abstract_id=183 34 LIBRARY OF CONGRESS

Editor's Notes

  1. Preserve cultural heritage. Online content is ephemeral. Improve digital preservation tools and methods. Average lifespan of website is 44-75 days See http://jiscpowr.jiscinvolve.org/wp/2009/08/12/whats-the-average-lifespan-of-a-web-page/.
  2. Web archiving is conducted at different scales. Why is anything more than IA needed? IA’s collecting is broad but shallow.
  3. MINERVA pilot started w/ limited staffing and 1 curator. Goal of developing web archiving expertise. WAT formed in 2004; now 6 FTE, 2 contractors, and 80+ curators.
  4. Web archiving staff are technical experts. Curators nominate collections and select URLs. They’re familiarized w/ crawl technology and terminology. NetDev and ABA perform cataloging to facilitate access. ITS supports server environments, repository and tools development.
  5. Web is too big for us not to be selective.
  6. Collections publicly available at http://www.loc.gov/lcwa. Metadata available for all records, but some Web sites restricted to onsite access only (based on permissions obtained).
  7. LC uses custom web archiving workflow management apps b/c of permissions.
  8. This site is restricted to on-site access only.
  9. Links to dated snapshots of site in the archive.
  10. Branded Wayback banner lets visitor know they’re in the archive.
  11. Wayback 1.6 deployment on hold pending resolution of some configuration issues. Under NutchWAX, indexing was too resource-intensive. Internet Archive and British Library are exploring Lucene/Solr.
  12. Web archiving is in state of flux. Lot of content to preserve and it changes all the time. Site you set out to archive is different from one you get. Long time frame and diffuse stakeholder make economics difficult. Producers of web content best positioned but least incented to preserve.
  13. When educating nominators, we analogize to traditional materials acquisition.
  14. Legal deposit is mechanism by which LC has built collections. Ambiguity about meaning of “published” in web context. Lack of clear legal mandate prompted exploration of alternatives. Section 108 Study Group recommendation to allow libraries to collect “publicly disseminated online content,” but recommendations not (yet) adopted legislatively.
  15. Robots.txt is voluntary standard variously used to communicate copyright permissions and/or site load management. Case law suggest that robots.txt not legally binding but may have some legal relevance. Directory exclusions tailored to search bot may yield poor web archive captures.
  16. Nominators identify and solicit permissions from site owners.
  17. Average Tweet size: 2 Kb.
  18. “ Fit” – both in terms of “integrate” and determining what extant curatorial approaches are/aren’t relevant. Agreement guidelines are general; development work intersects with evolving policy discussions. Access determined both by policy and technical capabilities.
  19. Need to countermand notions that Twitter is trivial (e.g., the clichéd critique that it’s just people talking about what they ate for lunch) and, by extension, our acquisition isn’t worthwhile. Need to allow researchers to inform decision-making.