SlideShare a Scribd company logo
1 of 51
Archiving Web News
     A Case Study




                    ©2012 Sven Aas, Mount Holyoke College
Archiving Web News
Sven Aas
Web Team Lead
Mount Holyoke College

saas@mtholyoke.edu
sven.aas@gmail.com
@svenaas

#archwn


                        ©2012 Sven Aas, Mount Holyoke College
The Early Years (’98–01)
            Many years ago Mount Holyoke College published news stories as static web pages.




                                              Text




                                         (Wayback Machine)

#archwn                                                                         ©2012 Sven Aas, Mount Holyoke College
The Portal (’01–08)
          Then a custom portal system was built with the capability to publish news stories.




                                        (Wayback Machine)

#archwn                                                                         ©2012 Sven Aas, Mount Holyoke College
The CMS (’06–12)
          Later a CMS was purchased, but not for News, so the portal was still employed.




                                      (Wayback Machine)

#archwn                                                                       ©2012 Sven Aas, Mount Holyoke College
The News System (’08–)
          Then I built a custom system to house the portal news content and we've been running happily ever since.




#archwn                                                                                    ©2012 Sven Aas, Mount Holyoke College
Enter the Archivist
          Archives received an NHPRC grant to pilot electronic records archiving projects including news.


          LITS                    (Library, Information, and Technology Services)



          DAPS                    (Digital Assets and Preservation Services)



          ASC                     (Archives & Special Collections)



          NHPRC                   (National Historical Publications and Records Commission)




#archwn                                                                               ©2012 Sven Aas, Mount Holyoke College
What We Had



#archwn             ©2012 Sven Aas, Mount Holyoke College
Numbers
            There are two kinds of news items: stories and links. I’ll get to those in a minute



          1998–2010:

           • 34 News Channels
           • 6,111 News Items
            • 916 Links
            • 5,195 Stories
           • 5,260 News Images
#archwn                                                                             ©2012 Sven Aas, Mount Holyoke College
News Item
          Every news item has

           • Title
           • Thumbnail
           • Summary
          Some have

           • Long Title
#archwn                          ©2012 Sven Aas, Mount Holyoke College
Stories and Links

          Stories have

           • Body
           • Sidebar Image
          Links have
           • URL

#archwn                      ©2012 Sven Aas, Mount Holyoke College
Metadata
          • Creation time
          • Modification time
          • Creator
          • Channel assignments
          • Display start/end dates
          • URLs where stories were available
#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Goals

          • Preserve each news item.
          • Maintain characters and markup in
            text as they appeared.

          • Preserve thumbnail and sidebar
            images as they were presented.



#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Out of Scope



          • Original (or current) presentation
          • Higher resolution source images




#archwn                                  ©2012 Sven Aas, Mount Holyoke College
Challenges



#archwn                ©2012 Sven Aas, Mount Holyoke College
Consolidation


          Challenge: Not all items were in the
          Newswire system.
          Approach: Staff and student
          programmers worked to import them.




#archwn                                  ©2012 Sven Aas, Mount Holyoke College
September 12, 2001
      These strays were probably from an bulk operation or import into the portal system which took place on that date.




              Challenge: A large number of news
              items shared a single “creation” date.
              Approach: Students determined real
              dates where possible. An annotation
              was provided in the export warning
              about approximated dates.


#archwn                                                                                     ©2012 Sven Aas, Mount Holyoke College
Organization
              Should we export each channel and let duplication of items take place?




          Challenge: News items may be
          crosslisted in various channels.
          Approach: Consider channels as
          secondary to items. Export each item
          once, and group by year.



#archwn                                                                       ©2012 Sven Aas, Mount Holyoke College
Preservation and Access
           Where do we put this stuff? We had multiple digital asset management systems in place.




          Challenge: Preserve news items and
          make them available in an appropriate
          digital asset management system.
          Approach: Use our existing dSPACE
          server.



#archwn                                                                           ©2012 Sven Aas, Mount Holyoke College
Data


          Challenge: Save news items in an
          archivally appropriate and useful
          format.
          Approach: Render each item as a PDF.




#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Metadata


          Challenge: Making key metadata
          available to digital asset management
          system.
          Approach: Export Dublin Core
          metadata XML file for each news item.



#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Import
      Since I was exporting from an app I controlled I was able to favor choices which made things easier for dSPACE.




              Challenge: Facilitate bulk rather than
              individual import into dSPACE.
              Approach: Cater to dSPACE’s
              expectations in export folder and file
              names and structure.



#archwn                                                                                   ©2012 Sven Aas, Mount Holyoke College
Output structure



#archwn                ©2012 Sven Aas, Mount Holyoke College
Items


          For each item three files are generated:

             • contents
             • dublin_core.xml
             • [item_id].pdf


#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Sample contents

          5682676.pdf




#archwn                    ©2012 Sven Aas, Mount Holyoke College
Sample XML




#archwn                ©2012 Sven Aas, Mount Holyoke College
Sample PDF




#archwn                ©2012 Sven Aas, Mount Holyoke College
Folders
          Exported items are ordered chronologically and grouped by year. dSPACE looks for these numbered folders.




                Each set of item files is in its own folder:

                        • item_000/
                        • item_001/
                        • item_002/



#archwn                                                                                   ©2012 Sven Aas, Mount Holyoke College
Details
          PDF Export:

           • pdfkit and wkhtmltopdf
          Date/Time format:

           • 1998-03-01T00:00:00-05:00
          Channel references:

           • By name rather than ID.
#archwn                                ©2012 Sven Aas, Mount Holyoke College
Problems



#archwn              ©2012 Sven Aas, Mount Holyoke College
<i>citations<i>
          HTML tags caused problems when they showed up in the XML metadata. So we had to strip tags.




           Yeah, that’s right.
           When titles in our Dublin Core XML
           contained tags they were XML-escaped
           on import.
           Solution: Strip tags from titles.



#archwn                                                                            ©2012 Sven Aas, Mount Holyoke College
 
          But tag stripping turned non-breaking spaces into problems, so we had to modify the effects of tag stripping.




                Uh-huh.
                Some summary fields contained only a
                 . Tag stripping turned this into a
                 , which showed up as ? in
                dSPACE.
                Solution: Patch Builder library to avoid
                exporting XML entities.


#archwn                                                                                       ©2012 Sven Aas, Mount Holyoke College
&
          But we overdid it. We NEEDED to encode ampersands and apostrophes, and now we had to do it manually.




               We overdid it. Ampersands and
               apostrophes weren’t getting encoded in
               the XML, and they needed to.
               We had to manually encode these.




#archwn                                                                                ©2012 Sven Aas, Mount Holyoke College
'
           Of course, that wasn’t simple either. You have to figure out what your tools are assuming.




          Complication:

               • XML has five core entities: & < > ' "
               • HTML has 252
               • XHTML has 253
          Guess which one’s in XML but not in
          HTML ...

#archwn                                                                              ©2012 Sven Aas, Mount Holyoke College
Export and Transfer



#archwn                  ©2012 Sven Aas, Mount Holyoke College
On December 19, 2011

          • All content was downloaded to a
            workstation and the export was run
            there.

          • The files were MD5 hashed and
            electronically transferred to DAPS
            and Archives.


#archwn                                 ©2012 Sven Aas, Mount Holyoke College
And all was well.



#archwn                 ©2012 Sven Aas, Mount Holyoke College
But then I received an email from Erin Fahy in DAPS.




          But then ...



#archwn                                                          ©2012 Sven Aas, Mount Holyoke College
#archwn   ©2012 Sven Aas, Mount Holyoke College
##

          • OK. So it turns out some old
            summaries contained ##some unusual
            markup##.
          • At one point this markup was used to
            locate a hyperlink, but for years it had
            been stripped on display.

          • So we decided to strip it too.

#archwn                                      ©2012 Sven Aas, Mount Holyoke College
But you know it’s
          never that simple ...


#archwn                    ©2012 Sven Aas, Mount Holyoke College
'

          Remember '?
          Well, something changed and PDF
          headers which contained an ' got
          screwed up.
          This wasn’t a problem back in
          December! What happened?



#archwn                                   ©2012 Sven Aas, Mount Holyoke College
Approaches were tried.



                .gsub(/'/, "'")




#archwn                             ©2012 Sven Aas, Mount Holyoke College
Approaches were tried.
          "'"
          "'"
          "'"
          '''
          ''
          etc.

                    They didn’t work. None of them worked.


#archwn                                                      ©2012 Sven Aas, Mount Holyoke College
What worked?



                  "'''"
          Are you kidding me?

              It’s a long story. You can ask me later.


#archwn                                                  ©2012 Sven Aas, Mount Holyoke College
But it worked.
          And there was much rejoicing.




                    Monty Python and the Holy Grail
#archwn             ©Python (Monty) Pictures Ltd.     ©2012 Sven Aas, Mount Holyoke College
Related Topics



#archwn               ©2012 Sven Aas, Mount Holyoke College
Drxfer



          Digital Records Transfer System
          https://github.com/svenaas/drxfer


            Drxfer is a web application created to allow record owners to transfer their born-digital
           records into a dropbox filesystem in accordance with pre-existing transfer agreements.




#archwn                                                                               ©2012 Sven Aas, Mount Holyoke College
I Can Convert!
          #TPR2 at HighEdWeb 2012
          9:30am, Monday, Oct. 8 in Milwaukee
          Jason Proctor and Sven Aas
          http://is.gd/TPR2atHEWEB12


          Jason and I will be talking about a series of related conversions and the tools and skills we
          employed to tackle them in our presentation at HighEdWeb 2012. We hope you can attend!




#archwn                                                                               ©2012 Sven Aas, Mount Holyoke College
Questions?



#archwn                ©2012 Sven Aas, Mount Holyoke College
Thank you!
          Please stay in touch!

          I’m at:

          saas@mtholyoke.edu
          sven.aas@gmail.com
          @svenaas




#archwn                           ©2012 Sven Aas, Mount Holyoke College

More Related Content

Viewers also liked

sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1nisarkarmi
 
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...Prasoon Shrivastava
 
Criolla music day
Criolla music dayCriolla music day
Criolla music dayalvarorv14
 
Manual And Automated Control
Manual And Automated ControlManual And Automated Control
Manual And Automated Controlnoorhoff
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษาJulalak Kaewjoonla
 
Some of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_centurySome of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_centuryLiliana Frutos
 

Viewers also liked (15)

Hizb 56
Hizb 56Hizb 56
Hizb 56
 
Publish
PublishPublish
Publish
 
Getting to know you assignment
Getting to know you assignmentGetting to know you assignment
Getting to know you assignment
 
sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1
 
Ihwal penerjemahan
Ihwal penerjemahanIhwal penerjemahan
Ihwal penerjemahan
 
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
 
Corporate communication
Corporate communicationCorporate communication
Corporate communication
 
Criolla music day
Criolla music dayCriolla music day
Criolla music day
 
Manual And Automated Control
Manual And Automated ControlManual And Automated Control
Manual And Automated Control
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
презентация
презентацияпрезентация
презентация
 
SEO Freelance
SEO FreelanceSEO Freelance
SEO Freelance
 
Some of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_centurySome of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_century
 
Team 13365
Team 13365Team 13365
Team 13365
 
Falla cardiaca ppt
Falla cardiaca pptFalla cardiaca ppt
Falla cardiaca ppt
 

Recently uploaded

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Recently uploaded (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Archiving Web News (captioned)

  • 1. Archiving Web News A Case Study ©2012 Sven Aas, Mount Holyoke College
  • 2. Archiving Web News Sven Aas Web Team Lead Mount Holyoke College saas@mtholyoke.edu sven.aas@gmail.com @svenaas #archwn ©2012 Sven Aas, Mount Holyoke College
  • 3. The Early Years (’98–01) Many years ago Mount Holyoke College published news stories as static web pages. Text (Wayback Machine) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 4. The Portal (’01–08) Then a custom portal system was built with the capability to publish news stories. (Wayback Machine) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 5. The CMS (’06–12) Later a CMS was purchased, but not for News, so the portal was still employed. (Wayback Machine) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 6. The News System (’08–) Then I built a custom system to house the portal news content and we've been running happily ever since. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 7. Enter the Archivist Archives received an NHPRC grant to pilot electronic records archiving projects including news. LITS (Library, Information, and Technology Services) DAPS (Digital Assets and Preservation Services) ASC (Archives & Special Collections) NHPRC (National Historical Publications and Records Commission) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 8. What We Had #archwn ©2012 Sven Aas, Mount Holyoke College
  • 9. Numbers There are two kinds of news items: stories and links. I’ll get to those in a minute 1998–2010: • 34 News Channels • 6,111 News Items • 916 Links • 5,195 Stories • 5,260 News Images #archwn ©2012 Sven Aas, Mount Holyoke College
  • 10. News Item Every news item has • Title • Thumbnail • Summary Some have • Long Title #archwn ©2012 Sven Aas, Mount Holyoke College
  • 11. Stories and Links Stories have • Body • Sidebar Image Links have • URL #archwn ©2012 Sven Aas, Mount Holyoke College
  • 12. Metadata • Creation time • Modification time • Creator • Channel assignments • Display start/end dates • URLs where stories were available #archwn ©2012 Sven Aas, Mount Holyoke College
  • 13. Goals • Preserve each news item. • Maintain characters and markup in text as they appeared. • Preserve thumbnail and sidebar images as they were presented. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 14. Out of Scope • Original (or current) presentation • Higher resolution source images #archwn ©2012 Sven Aas, Mount Holyoke College
  • 15. Challenges #archwn ©2012 Sven Aas, Mount Holyoke College
  • 16. Consolidation Challenge: Not all items were in the Newswire system. Approach: Staff and student programmers worked to import them. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 17. September 12, 2001 These strays were probably from an bulk operation or import into the portal system which took place on that date. Challenge: A large number of news items shared a single “creation” date. Approach: Students determined real dates where possible. An annotation was provided in the export warning about approximated dates. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 18. Organization Should we export each channel and let duplication of items take place? Challenge: News items may be crosslisted in various channels. Approach: Consider channels as secondary to items. Export each item once, and group by year. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 19. Preservation and Access Where do we put this stuff? We had multiple digital asset management systems in place. Challenge: Preserve news items and make them available in an appropriate digital asset management system. Approach: Use our existing dSPACE server. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 20. Data Challenge: Save news items in an archivally appropriate and useful format. Approach: Render each item as a PDF. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 21. Metadata Challenge: Making key metadata available to digital asset management system. Approach: Export Dublin Core metadata XML file for each news item. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 22. Import Since I was exporting from an app I controlled I was able to favor choices which made things easier for dSPACE. Challenge: Facilitate bulk rather than individual import into dSPACE. Approach: Cater to dSPACE’s expectations in export folder and file names and structure. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 23. Output structure #archwn ©2012 Sven Aas, Mount Holyoke College
  • 24. Items For each item three files are generated: • contents • dublin_core.xml • [item_id].pdf #archwn ©2012 Sven Aas, Mount Holyoke College
  • 25. Sample contents 5682676.pdf #archwn ©2012 Sven Aas, Mount Holyoke College
  • 26. Sample XML #archwn ©2012 Sven Aas, Mount Holyoke College
  • 27. Sample PDF #archwn ©2012 Sven Aas, Mount Holyoke College
  • 28. Folders Exported items are ordered chronologically and grouped by year. dSPACE looks for these numbered folders. Each set of item files is in its own folder: • item_000/ • item_001/ • item_002/ #archwn ©2012 Sven Aas, Mount Holyoke College
  • 29. Details PDF Export: • pdfkit and wkhtmltopdf Date/Time format: • 1998-03-01T00:00:00-05:00 Channel references: • By name rather than ID. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 30. Problems #archwn ©2012 Sven Aas, Mount Holyoke College
  • 31. &lt;i&gt;citations&lt;i&gt; HTML tags caused problems when they showed up in the XML metadata. So we had to strip tags. Yeah, that’s right. When titles in our Dublin Core XML contained tags they were XML-escaped on import. Solution: Strip tags from titles. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 32. &#160; But tag stripping turned non-breaking spaces into problems, so we had to modify the effects of tag stripping. Uh-huh. Some summary fields contained only a &nbsp;. Tag stripping turned this into a &#160;, which showed up as ? in dSPACE. Solution: Patch Builder library to avoid exporting XML entities. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 33. & But we overdid it. We NEEDED to encode ampersands and apostrophes, and now we had to do it manually. We overdid it. Ampersands and apostrophes weren’t getting encoded in the XML, and they needed to. We had to manually encode these. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 34. ' Of course, that wasn’t simple either. You have to figure out what your tools are assuming. Complication: • XML has five core entities: & < > ' " • HTML has 252 • XHTML has 253 Guess which one’s in XML but not in HTML ... #archwn ©2012 Sven Aas, Mount Holyoke College
  • 35. Export and Transfer #archwn ©2012 Sven Aas, Mount Holyoke College
  • 36. On December 19, 2011 • All content was downloaded to a workstation and the export was run there. • The files were MD5 hashed and electronically transferred to DAPS and Archives. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 37. And all was well. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 38. But then I received an email from Erin Fahy in DAPS. But then ... #archwn ©2012 Sven Aas, Mount Holyoke College
  • 39. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 40. ## • OK. So it turns out some old summaries contained ##some unusual markup##. • At one point this markup was used to locate a hyperlink, but for years it had been stripped on display. • So we decided to strip it too. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 41. But you know it’s never that simple ... #archwn ©2012 Sven Aas, Mount Holyoke College
  • 42. ' Remember '? Well, something changed and PDF headers which contained an ' got screwed up. This wasn’t a problem back in December! What happened? #archwn ©2012 Sven Aas, Mount Holyoke College
  • 43. Approaches were tried. .gsub(/'/, "'") #archwn ©2012 Sven Aas, Mount Holyoke College
  • 44. Approaches were tried. "'" "'" "'" ''' '' etc. They didn’t work. None of them worked. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 45. What worked? "'''" Are you kidding me? It’s a long story. You can ask me later. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 46. But it worked. And there was much rejoicing. Monty Python and the Holy Grail #archwn ©Python (Monty) Pictures Ltd. ©2012 Sven Aas, Mount Holyoke College
  • 47. Related Topics #archwn ©2012 Sven Aas, Mount Holyoke College
  • 48. Drxfer Digital Records Transfer System https://github.com/svenaas/drxfer Drxfer is a web application created to allow record owners to transfer their born-digital records into a dropbox filesystem in accordance with pre-existing transfer agreements. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 49. I Can Convert! #TPR2 at HighEdWeb 2012 9:30am, Monday, Oct. 8 in Milwaukee Jason Proctor and Sven Aas http://is.gd/TPR2atHEWEB12 Jason and I will be talking about a series of related conversions and the tools and skills we employed to tackle them in our presentation at HighEdWeb 2012. We hope you can attend! #archwn ©2012 Sven Aas, Mount Holyoke College
  • 50. Questions? #archwn ©2012 Sven Aas, Mount Holyoke College
  • 51. Thank you! Please stay in touch! I’m at: saas@mtholyoke.edu sven.aas@gmail.com @svenaas #archwn ©2012 Sven Aas, Mount Holyoke College

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. Exported items are ordered chronologically and grouped by year. \nEach item is exported in a numbered folder.\nRemember that we grouped things by year \nNote that the item numbering starts over for each year; \nThis is done so that dSPACE can import the items separately.\n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n