SlideShare a Scribd company logo
1 of 51
Archiving Web News
     A Case Study




                    ©2012 Sven Aas, Mount Holyoke College
Archiving Web News
Sven Aas
Web Team Lead
Mount Holyoke College

saas@mtholyoke.edu
sven.aas@gmail.com
@svenaas

#archwn


                        ©2012 Sven Aas, Mount Holyoke College
The Early Years (’98–01)
            Many years ago Mount Holyoke College published news stories as static web pages.




                                              Text




                                         (Wayback Machine)

#archwn                                                                         ©2012 Sven Aas, Mount Holyoke College
The Portal (’01–08)
          Then a custom portal system was built with the capability to publish news stories.




                                        (Wayback Machine)

#archwn                                                                         ©2012 Sven Aas, Mount Holyoke College
The CMS (’06–12)
          Later a CMS was purchased, but not for News, so the portal was still employed.




                                      (Wayback Machine)

#archwn                                                                       ©2012 Sven Aas, Mount Holyoke College
The News System (’08–)
          Then I built a custom system to house the portal news content and we've been running happily ever since.




#archwn                                                                                    ©2012 Sven Aas, Mount Holyoke College
Enter the Archivist
          Archives received an NHPRC grant to pilot electronic records archiving projects including news.


          LITS                    (Library, Information, and Technology Services)



          DAPS                    (Digital Assets and Preservation Services)



          ASC                     (Archives & Special Collections)



          NHPRC                   (National Historical Publications and Records Commission)




#archwn                                                                               ©2012 Sven Aas, Mount Holyoke College
What We Had



#archwn             ©2012 Sven Aas, Mount Holyoke College
Numbers
            There are two kinds of news items: stories and links. I’ll get to those in a minute



          1998–2010:

           • 34 News Channels
           • 6,111 News Items
            • 916 Links
            • 5,195 Stories
           • 5,260 News Images
#archwn                                                                             ©2012 Sven Aas, Mount Holyoke College
News Item
          Every news item has

           • Title
           • Thumbnail
           • Summary
          Some have

           • Long Title
#archwn                          ©2012 Sven Aas, Mount Holyoke College
Stories and Links

          Stories have

           • Body
           • Sidebar Image
          Links have
           • URL

#archwn                      ©2012 Sven Aas, Mount Holyoke College
Metadata
          • Creation time
          • Modification time
          • Creator
          • Channel assignments
          • Display start/end dates
          • URLs where stories were available
#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Goals

          • Preserve each news item.
          • Maintain characters and markup in
            text as they appeared.

          • Preserve thumbnail and sidebar
            images as they were presented.



#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Out of Scope



          • Original (or current) presentation
          • Higher resolution source images




#archwn                                  ©2012 Sven Aas, Mount Holyoke College
Challenges



#archwn                ©2012 Sven Aas, Mount Holyoke College
Consolidation


          Challenge: Not all items were in the
          Newswire system.
          Approach: Staff and student
          programmers worked to import them.




#archwn                                  ©2012 Sven Aas, Mount Holyoke College
September 12, 2001
      These strays were probably from an bulk operation or import into the portal system which took place on that date.




              Challenge: A large number of news
              items shared a single “creation” date.
              Approach: Students determined real
              dates where possible. An annotation
              was provided in the export warning
              about approximated dates.


#archwn                                                                                     ©2012 Sven Aas, Mount Holyoke College
Organization
              Should we export each channel and let duplication of items take place?




          Challenge: News items may be
          crosslisted in various channels.
          Approach: Consider channels as
          secondary to items. Export each item
          once, and group by year.



#archwn                                                                       ©2012 Sven Aas, Mount Holyoke College
Preservation and Access
           Where do we put this stuff? We had multiple digital asset management systems in place.




          Challenge: Preserve news items and
          make them available in an appropriate
          digital asset management system.
          Approach: Use our existing dSPACE
          server.



#archwn                                                                           ©2012 Sven Aas, Mount Holyoke College
Data


          Challenge: Save news items in an
          archivally appropriate and useful
          format.
          Approach: Render each item as a PDF.




#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Metadata


          Challenge: Making key metadata
          available to digital asset management
          system.
          Approach: Export Dublin Core
          metadata XML file for each news item.



#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Import
      Since I was exporting from an app I controlled I was able to favor choices which made things easier for dSPACE.




              Challenge: Facilitate bulk rather than
              individual import into dSPACE.
              Approach: Cater to dSPACE’s
              expectations in export folder and file
              names and structure.



#archwn                                                                                   ©2012 Sven Aas, Mount Holyoke College
Output structure



#archwn                ©2012 Sven Aas, Mount Holyoke College
Items


          For each item three files are generated:

             • contents
             • dublin_core.xml
             • [item_id].pdf


#archwn                                 ©2012 Sven Aas, Mount Holyoke College
Sample contents

          5682676.pdf




#archwn                    ©2012 Sven Aas, Mount Holyoke College
Sample XML




#archwn                ©2012 Sven Aas, Mount Holyoke College
Sample PDF




#archwn                ©2012 Sven Aas, Mount Holyoke College
Folders
          Exported items are ordered chronologically and grouped by year. dSPACE looks for these numbered folders.




                Each set of item files is in its own folder:

                        • item_000/
                        • item_001/
                        • item_002/



#archwn                                                                                   ©2012 Sven Aas, Mount Holyoke College
Details
          PDF Export:

           • pdfkit and wkhtmltopdf
          Date/Time format:

           • 1998-03-01T00:00:00-05:00
          Channel references:

           • By name rather than ID.
#archwn                                ©2012 Sven Aas, Mount Holyoke College
Problems



#archwn              ©2012 Sven Aas, Mount Holyoke College
<i>citations<i>
          HTML tags caused problems when they showed up in the XML metadata. So we had to strip tags.




           Yeah, that’s right.
           When titles in our Dublin Core XML
           contained tags they were XML-escaped
           on import.
           Solution: Strip tags from titles.



#archwn                                                                            ©2012 Sven Aas, Mount Holyoke College
 
          But tag stripping turned non-breaking spaces into problems, so we had to modify the effects of tag stripping.




                Uh-huh.
                Some summary fields contained only a
                 . Tag stripping turned this into a
                 , which showed up as ? in
                dSPACE.
                Solution: Patch Builder library to avoid
                exporting XML entities.


#archwn                                                                                       ©2012 Sven Aas, Mount Holyoke College
&
          But we overdid it. We NEEDED to encode ampersands and apostrophes, and now we had to do it manually.




               We overdid it. Ampersands and
               apostrophes weren’t getting encoded in
               the XML, and they needed to.
               We had to manually encode these.




#archwn                                                                                ©2012 Sven Aas, Mount Holyoke College
'
           Of course, that wasn’t simple either. You have to figure out what your tools are assuming.




          Complication:

               • XML has five core entities: & < > ' "
               • HTML has 252
               • XHTML has 253
          Guess which one’s in XML but not in
          HTML ...

#archwn                                                                              ©2012 Sven Aas, Mount Holyoke College
Export and Transfer



#archwn                  ©2012 Sven Aas, Mount Holyoke College
On December 19, 2011

          • All content was downloaded to a
            workstation and the export was run
            there.

          • The files were MD5 hashed and
            electronically transferred to DAPS
            and Archives.


#archwn                                 ©2012 Sven Aas, Mount Holyoke College
And all was well.



#archwn                 ©2012 Sven Aas, Mount Holyoke College
But then I received an email from Erin Fahy in DAPS.




          But then ...



#archwn                                                          ©2012 Sven Aas, Mount Holyoke College
#archwn   ©2012 Sven Aas, Mount Holyoke College
##

          • OK. So it turns out some old
            summaries contained ##some unusual
            markup##.
          • At one point this markup was used to
            locate a hyperlink, but for years it had
            been stripped on display.

          • So we decided to strip it too.

#archwn                                      ©2012 Sven Aas, Mount Holyoke College
But you know it’s
          never that simple ...


#archwn                    ©2012 Sven Aas, Mount Holyoke College
'

          Remember '?
          Well, something changed and PDF
          headers which contained an ' got
          screwed up.
          This wasn’t a problem back in
          December! What happened?



#archwn                                   ©2012 Sven Aas, Mount Holyoke College
Approaches were tried.



                .gsub(/'/, "'")




#archwn                             ©2012 Sven Aas, Mount Holyoke College
Approaches were tried.
          "'"
          "'"
          "'"
          '''
          ''
          etc.

                    They didn’t work. None of them worked.


#archwn                                                      ©2012 Sven Aas, Mount Holyoke College
What worked?



                  "'''"
          Are you kidding me?

              It’s a long story. You can ask me later.


#archwn                                                  ©2012 Sven Aas, Mount Holyoke College
But it worked.
          And there was much rejoicing.




                    Monty Python and the Holy Grail
#archwn             ©Python (Monty) Pictures Ltd.     ©2012 Sven Aas, Mount Holyoke College
Related Topics



#archwn               ©2012 Sven Aas, Mount Holyoke College
Drxfer



          Digital Records Transfer System
          https://github.com/svenaas/drxfer


            Drxfer is a web application created to allow record owners to transfer their born-digital
           records into a dropbox filesystem in accordance with pre-existing transfer agreements.




#archwn                                                                               ©2012 Sven Aas, Mount Holyoke College
I Can Convert!
          #TPR2 at HighEdWeb 2012
          9:30am, Monday, Oct. 8 in Milwaukee
          Jason Proctor and Sven Aas
          http://is.gd/TPR2atHEWEB12


          Jason and I will be talking about a series of related conversions and the tools and skills we
          employed to tackle them in our presentation at HighEdWeb 2012. We hope you can attend!




#archwn                                                                               ©2012 Sven Aas, Mount Holyoke College
Questions?



#archwn                ©2012 Sven Aas, Mount Holyoke College
Thank you!
          Please stay in touch!

          I’m at:

          saas@mtholyoke.edu
          sven.aas@gmail.com
          @svenaas




#archwn                           ©2012 Sven Aas, Mount Holyoke College

More Related Content

Viewers also liked

sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1nisarkarmi
 
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...Prasoon Shrivastava
 
Criolla music day
Criolla music dayCriolla music day
Criolla music dayalvarorv14
 
Manual And Automated Control
Manual And Automated ControlManual And Automated Control
Manual And Automated Controlnoorhoff
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษาJulalak Kaewjoonla
 
Some of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_centurySome of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_centuryLiliana Frutos
 

Viewers also liked (15)

Hizb 56
Hizb 56Hizb 56
Hizb 56
 
Publish
PublishPublish
Publish
 
Getting to know you assignment
Getting to know you assignmentGetting to know you assignment
Getting to know you assignment
 
sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1sawanih Mehboob Ellahi 1
sawanih Mehboob Ellahi 1
 
Ihwal penerjemahan
Ihwal penerjemahanIhwal penerjemahan
Ihwal penerjemahan
 
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
Sample - Global Air Conditioner (AC) Market — Estimation & Forecast Through 2...
 
Corporate communication
Corporate communicationCorporate communication
Corporate communication
 
Criolla music day
Criolla music dayCriolla music day
Criolla music day
 
Manual And Automated Control
Manual And Automated ControlManual And Automated Control
Manual And Automated Control
 
มโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษามโนท ศน เทคโนโลย_ทางการศ_กษา
มโนท ศน เทคโนโลย_ทางการศ_กษา
 
презентация
презентацияпрезентация
презентация
 
SEO Freelance
SEO FreelanceSEO Freelance
SEO Freelance
 
Some of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_centurySome of the_most_important_pictures_of_the_20th_century
Some of the_most_important_pictures_of_the_20th_century
 
Team 13365
Team 13365Team 13365
Team 13365
 
Falla cardiaca ppt
Falla cardiaca pptFalla cardiaca ppt
Falla cardiaca ppt
 

Recently uploaded

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Archiving Web News (captioned)

  • 1. Archiving Web News A Case Study ©2012 Sven Aas, Mount Holyoke College
  • 2. Archiving Web News Sven Aas Web Team Lead Mount Holyoke College saas@mtholyoke.edu sven.aas@gmail.com @svenaas #archwn ©2012 Sven Aas, Mount Holyoke College
  • 3. The Early Years (’98–01) Many years ago Mount Holyoke College published news stories as static web pages. Text (Wayback Machine) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 4. The Portal (’01–08) Then a custom portal system was built with the capability to publish news stories. (Wayback Machine) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 5. The CMS (’06–12) Later a CMS was purchased, but not for News, so the portal was still employed. (Wayback Machine) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 6. The News System (’08–) Then I built a custom system to house the portal news content and we've been running happily ever since. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 7. Enter the Archivist Archives received an NHPRC grant to pilot electronic records archiving projects including news. LITS (Library, Information, and Technology Services) DAPS (Digital Assets and Preservation Services) ASC (Archives & Special Collections) NHPRC (National Historical Publications and Records Commission) #archwn ©2012 Sven Aas, Mount Holyoke College
  • 8. What We Had #archwn ©2012 Sven Aas, Mount Holyoke College
  • 9. Numbers There are two kinds of news items: stories and links. I’ll get to those in a minute 1998–2010: • 34 News Channels • 6,111 News Items • 916 Links • 5,195 Stories • 5,260 News Images #archwn ©2012 Sven Aas, Mount Holyoke College
  • 10. News Item Every news item has • Title • Thumbnail • Summary Some have • Long Title #archwn ©2012 Sven Aas, Mount Holyoke College
  • 11. Stories and Links Stories have • Body • Sidebar Image Links have • URL #archwn ©2012 Sven Aas, Mount Holyoke College
  • 12. Metadata • Creation time • Modification time • Creator • Channel assignments • Display start/end dates • URLs where stories were available #archwn ©2012 Sven Aas, Mount Holyoke College
  • 13. Goals • Preserve each news item. • Maintain characters and markup in text as they appeared. • Preserve thumbnail and sidebar images as they were presented. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 14. Out of Scope • Original (or current) presentation • Higher resolution source images #archwn ©2012 Sven Aas, Mount Holyoke College
  • 15. Challenges #archwn ©2012 Sven Aas, Mount Holyoke College
  • 16. Consolidation Challenge: Not all items were in the Newswire system. Approach: Staff and student programmers worked to import them. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 17. September 12, 2001 These strays were probably from an bulk operation or import into the portal system which took place on that date. Challenge: A large number of news items shared a single “creation” date. Approach: Students determined real dates where possible. An annotation was provided in the export warning about approximated dates. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 18. Organization Should we export each channel and let duplication of items take place? Challenge: News items may be crosslisted in various channels. Approach: Consider channels as secondary to items. Export each item once, and group by year. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 19. Preservation and Access Where do we put this stuff? We had multiple digital asset management systems in place. Challenge: Preserve news items and make them available in an appropriate digital asset management system. Approach: Use our existing dSPACE server. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 20. Data Challenge: Save news items in an archivally appropriate and useful format. Approach: Render each item as a PDF. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 21. Metadata Challenge: Making key metadata available to digital asset management system. Approach: Export Dublin Core metadata XML file for each news item. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 22. Import Since I was exporting from an app I controlled I was able to favor choices which made things easier for dSPACE. Challenge: Facilitate bulk rather than individual import into dSPACE. Approach: Cater to dSPACE’s expectations in export folder and file names and structure. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 23. Output structure #archwn ©2012 Sven Aas, Mount Holyoke College
  • 24. Items For each item three files are generated: • contents • dublin_core.xml • [item_id].pdf #archwn ©2012 Sven Aas, Mount Holyoke College
  • 25. Sample contents 5682676.pdf #archwn ©2012 Sven Aas, Mount Holyoke College
  • 26. Sample XML #archwn ©2012 Sven Aas, Mount Holyoke College
  • 27. Sample PDF #archwn ©2012 Sven Aas, Mount Holyoke College
  • 28. Folders Exported items are ordered chronologically and grouped by year. dSPACE looks for these numbered folders. Each set of item files is in its own folder: • item_000/ • item_001/ • item_002/ #archwn ©2012 Sven Aas, Mount Holyoke College
  • 29. Details PDF Export: • pdfkit and wkhtmltopdf Date/Time format: • 1998-03-01T00:00:00-05:00 Channel references: • By name rather than ID. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 30. Problems #archwn ©2012 Sven Aas, Mount Holyoke College
  • 31. &lt;i&gt;citations&lt;i&gt; HTML tags caused problems when they showed up in the XML metadata. So we had to strip tags. Yeah, that’s right. When titles in our Dublin Core XML contained tags they were XML-escaped on import. Solution: Strip tags from titles. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 32. &#160; But tag stripping turned non-breaking spaces into problems, so we had to modify the effects of tag stripping. Uh-huh. Some summary fields contained only a &nbsp;. Tag stripping turned this into a &#160;, which showed up as ? in dSPACE. Solution: Patch Builder library to avoid exporting XML entities. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 33. & But we overdid it. We NEEDED to encode ampersands and apostrophes, and now we had to do it manually. We overdid it. Ampersands and apostrophes weren’t getting encoded in the XML, and they needed to. We had to manually encode these. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 34. ' Of course, that wasn’t simple either. You have to figure out what your tools are assuming. Complication: • XML has five core entities: & < > ' " • HTML has 252 • XHTML has 253 Guess which one’s in XML but not in HTML ... #archwn ©2012 Sven Aas, Mount Holyoke College
  • 35. Export and Transfer #archwn ©2012 Sven Aas, Mount Holyoke College
  • 36. On December 19, 2011 • All content was downloaded to a workstation and the export was run there. • The files were MD5 hashed and electronically transferred to DAPS and Archives. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 37. And all was well. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 38. But then I received an email from Erin Fahy in DAPS. But then ... #archwn ©2012 Sven Aas, Mount Holyoke College
  • 39. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 40. ## • OK. So it turns out some old summaries contained ##some unusual markup##. • At one point this markup was used to locate a hyperlink, but for years it had been stripped on display. • So we decided to strip it too. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 41. But you know it’s never that simple ... #archwn ©2012 Sven Aas, Mount Holyoke College
  • 42. ' Remember '? Well, something changed and PDF headers which contained an ' got screwed up. This wasn’t a problem back in December! What happened? #archwn ©2012 Sven Aas, Mount Holyoke College
  • 43. Approaches were tried. .gsub(/'/, "'") #archwn ©2012 Sven Aas, Mount Holyoke College
  • 44. Approaches were tried. "'" "'" "'" ''' '' etc. They didn’t work. None of them worked. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 45. What worked? "'''" Are you kidding me? It’s a long story. You can ask me later. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 46. But it worked. And there was much rejoicing. Monty Python and the Holy Grail #archwn ©Python (Monty) Pictures Ltd. ©2012 Sven Aas, Mount Holyoke College
  • 47. Related Topics #archwn ©2012 Sven Aas, Mount Holyoke College
  • 48. Drxfer Digital Records Transfer System https://github.com/svenaas/drxfer Drxfer is a web application created to allow record owners to transfer their born-digital records into a dropbox filesystem in accordance with pre-existing transfer agreements. #archwn ©2012 Sven Aas, Mount Holyoke College
  • 49. I Can Convert! #TPR2 at HighEdWeb 2012 9:30am, Monday, Oct. 8 in Milwaukee Jason Proctor and Sven Aas http://is.gd/TPR2atHEWEB12 Jason and I will be talking about a series of related conversions and the tools and skills we employed to tackle them in our presentation at HighEdWeb 2012. We hope you can attend! #archwn ©2012 Sven Aas, Mount Holyoke College
  • 50. Questions? #archwn ©2012 Sven Aas, Mount Holyoke College
  • 51. Thank you! Please stay in touch! I’m at: saas@mtholyoke.edu sven.aas@gmail.com @svenaas #archwn ©2012 Sven Aas, Mount Holyoke College

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. Exported items are ordered chronologically and grouped by year. \nEach item is exported in a numbered folder.\nRemember that we grouped things by year \nNote that the item numbering starts over for each year; \nThis is done so that dSPACE can import the items separately.\n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n
  35. \n
  36. \n
  37. \n
  38. \n
  39. \n
  40. \n
  41. \n
  42. \n
  43. \n
  44. \n
  45. \n
  46. \n
  47. \n
  48. \n
  49. \n
  50. \n
  51. \n