20100401 정영임 da 전략 tft_0330

724 views
673 views

Published on

정보 생애주기별 데이터 보존을 위해 고려할 사항

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
724
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

20100401 정영임 da 전략 tft_0330

  1. 1. 정보 생애주기에 따른 데이터 보존을 위해 고려할 사항<br />- 국가 디지털 아카이빙 전략 연구 TF 내부 세미나 -<br />2010. 4. 1.<br />정영임<br />한국과학기술정보연구원 정보유통본부 지식기반실<br />
  2. 2. - 2 -<br />Table of Contents<br />Digital Archiving in the Framework of Information Life Cycle Management<br />Creation<br />Acquisition<br />Cataloging/Identification<br />Storage<br />Preservation<br />Access<br />
  3. 3. Digital Archiving in the Framework of Information Life Cycle Management<br />Digital archiving framework<br />Considered at all stages of the information life cycle management<br />Information life cycle<br />Creation<br />Acquisition<br />Cataloging/Identification<br />Storage<br />Preservation<br />Access<br />- 3 -<br />
  4. 4. Creation<br />Creation <br />Defined as an act of producing the information product in the broadest sense<br />Should be regarded as a starting point of long-term and preservation<br />Suggestion of provision of a preservation indicator for creators <br />U.S. Department of Agriculture’s Digital Publications Preservation Steering Committee<br />Establishment of guidelines for creators <br />Oak Ridge National Laboratory, USA <br />A Guide To Record Series Supporting Epidemiological Studies Conducted for the Department of Energy<br />Limits on software<br />Format and layout of the documents<br />- 4 -<br />
  5. 5. Creation<br />Adaption of Standard Descriptive Languages<br />Standard groups incorporate XML and RDF architectures <br />Attachment of Metadata on Digital Contents<br />- 5 -<br />
  6. 6. Acquisition and Collection Development<br />Three main aspects to acquisition of digital objects <br />Collection policies<br />Gathering methods<br />Intellectual Property Concerns<br />- 6 -<br />
  7. 7. Establishment of Collection Policies<br />Collection policies<br />Selecting What to Archive<br />Purpose<br />For Dark Archiving: Back issue<br />For Light Archiving: Current issue<br />Criteria <br />Easiness of Content Acquisition<br />Quality of Contents <br />Utilization<br />On-going access fee<br />Content Type Coverage: E-journals/R&D Reports/Patents/Scientific Data<br />Determining Extent<br />Archiving Links<br />Refreshing the Archived Contents<br />- 7 -<br />
  8. 8. Considerations on Gathering Method<br />Gathering methods<br />Hand selection<br />Value Judgment and Retention Scheduling (Edinburgh University Library)<br />Not preserved <br />Preserved for defined period <br />Preserved indefinitely <br />Automatic selection<br />National Library of Sweden: Automatic acquisition without making value judgment (priority: periodicals, static documents, HTML pages >> conferences, usenet groups, ftp archives)<br />EVA projects: Establishment of time limits to avoid the overloading<br />- 8 -<br />
  9. 9. Considerations on Intellectual Property Concerns<br />Reliance on Legislation<br />Freedom of Information Act 2001<br />The public may have unrestricted access to certain records. <br />(Consider what categories of information may need to be viewed by the public - these records need to remain accessible at all times.)<br />In general, due to absence of international digital deposit legislation<br />PANDORA project seeks permission from the copyright owner<br />Swedish and Finnish national library projects do not contact the owners<br />Making Agreement with Content Providers<br />E-journal: Publishers or academic associations<br />CLIR/DLF draft model license, NESLi2 Standard license model<br />Agreement of Cornell University with publishers<br />Government document: Open to public<br />Scientific data: individual creators or data centers<br />Arts and Humanities Data Service provide information on what is needed for a digital archive and what creators are likely to be willing to deposit<br />- 9 -<br />
  10. 10. Agreement of Cornell University with Publishers<br />Topics identified in the agreement(Thomson and Kroch, 2000)<br />The general responsibilities of the publishers and Cornell <br />Characteristics of the data, accompanying metadata, and any additional documentation that are to be deposited <br />Guidelines on transmission methods and media for deposit <br />Procedures for the deposit <br />Procedures and protocols Cornell will use to verify the arrival and completeness of the data <br />Rights of the depositing organizations to audit the repository <br />The respective roles, responsibilities, and rights of the Cornell and the data producers with regard to the data <br />Articulation of Cornell's responsibilities and capabilities with regard to the accessioning, description, management, and even transformation of the deposited data <br />Access policies for users of the repository, and how they may vary over time <br />Conditions on the use of the data, and again how they may vary over time <br />Fees (if any) associated with the deposit <br />Cornell's ability to share the data with partners to create an agreed-upon level of redundancy <br />Clarification of issues surrounding copyright retained by authors <br />- 10 -<br />
  11. 11. Identification and Cataloging<br />Identification<br />Provision of a unique key for finding the digital object and linking object to other related objects<br />Cataloging in the form of metadata<br />Support for organization, access and curation<br />- 11 -<br />
  12. 12. Persistent Identification<br />Problems in using URL as Identifier<br />Use of server as location identifier can result in lack of persistent over time both for the source object and any linked objects<br />Continuous use of URL<br />New approaches on persistent identification<br />OCLC: PURLs<br />ACS: Digital Object Identifier (DOI), MN (Manuscript Number)<br />DTIC: Handle® system<br />AAS: Bibcode, PubRef numbers<br />- 12 -<br />
  13. 13. Creation of Metadata at Cataloging Stage (1/3)<br />Creation Method of Metadata<br />Manual creation of metadata<br />Automatic generation of metadata<br />A project by US Environmental Protection Agency<br />Defense Information Technology Testbed project<br />- 13 -<br />
  14. 14. Creation of Metadata at Cataloging Stage (2/3)<br />Formats of Descriptive Metadata<br />E-journal<br />Full MARC cataloging <br />Traditional library cataloging standards<br />NLA’s PANDORA Archive<br />Current development of descriptive metadata standards<br />MARCXML, MODS(Metadata Object Descriptive Schema)<br />Web-based resources <br />Dublin Core-like format <br />EVA project<br />Non-textual data<br />Identification of metadata elements needed for non-textual data types such as images, video, multimedia and others<br />Z39.87 NISO/AIIM Technical metadata for digital still images<br />AES X089 core audio metadata<br />- 14 -<br />
  15. 15. Creation of Metadata at Cataloging Stage (3/3)<br />Management of Heterogeneous Metadata Format<br />Translation between various metadata formats<br />Key to the development of networked, heterogeneous archives<br />Adaption of packaging metadata standards<br />Open Archival Information System (OAIS) Reference Model<br />Is developed by ISO Consultative Committee for Space DataSystems<br />Encapsulates specific metadata as needed for each object type in a consistent data model<br />Metadata Encoding and Transmission Standard (METS) <br />Is produced by Library of Congress Standards Office and Digital Library Federation<br />Provides framework for holding all types of metadata for digital object<br />Others<br />MPEG-21 Digital Item Declaration Language<br />IMS Global Learning Consortium Content Packaging Standards<br />Sharable Content Object Reference Model (SCORM)<br />CCSDS XML Packaging scheme<br />- 15 -<br />
  16. 16. Development of Technical Model for Storage<br />Recommendation for Developing a technical model for the repository (Cornell University)<br />Establishing a baseline of e-journal software and file format needs <br />Specify the archival repository<br />Specifying monitoring tools that will flag documents within the repository that require migration<br />Specifying a baseline hardware and software infrastructure to house the repository<br />Exploring the need and implementation models for redundancy in the repository<br />- 16 -<br />
  17. 17. Issues on Changing Storage Media<br />Problem of changing storage media<br />Block size, tape size and tape drive mechanism have changed over time.<br />Common Solution<br />Data migration to new storage systems<br />Much cost and imperfect transferring system is still an issue.<br />Check/validation algorithms are extremely important<br />Manual check is still necessary.<br />Atmospheric Radiation Monitoring Center plans to migrate to new storage systems every 4-5 years<br />Each data migration will take 6-12 months<br />- 17 -<br />
  18. 18. Issues on Terabytes of Data Storage<br />Problem of dealing with large-scale data<br />Extensive validation routines to ensure the quality of the information as the information is migrated<br />NCBI has 30 Ph.D.s reviewing the information manually, even after it has passed a variety of validation algorithms<br />Similar cost has been spent for<br />Corrections and additions to particular records<br />Maintenance of a history of changes<br />Approval by the owner of all changes controlled by NCBI<br />Common Solution<br />Large-scale data can be stored in different file formats<br />Biological sequence data is held in simple ASCII files for preservation purposes.<br />Data in a structured database is provided for searching, reporting and maintenance<br />Extensive tasks can be transitioned to a non-profit consortia<br />Protein Data Bank: Collaboratory for Structured Bioinformatics <br />- 18 -<br />
  19. 19. Preservation<br />Long-term preservation<br />No common agreement on the definition of long-term preservation<br />Main aspects on preservation<br />Selection of digital preservation strategies/technologies<br />Cycle for hardware/software migration <br />No specific investigation on the cycle for hw/sw migration has been done.<br />Depending on the particular technologies and subject disciplines, it can be vary from 2 to 10 years.<br />Preservation of the “look and feel” of digital contents<br />- 19 -<br />
  20. 20. Digital Preservation Strategies<br />Bitstream Copying<br />Refreshing<br />Durable/Persistent Media<br />Technology Preservation<br />Digital Archaeology<br />Analog Backups<br />Migration (SW, HW migration)<br />Replication<br />Reliance on Standards<br />Normalization<br />Canonicalization<br />Emulation<br />Encapsulation<br />Universal Virtual Computer<br />- 20 -<br />
  21. 21. Hardware and Software Migration<br />Problems on Migration<br />Migration is not guaranteed to work for all data types<br />Migration of information products having used sophisticated software feature is unreliable<br />Generally, there is no backward compatibility, and if it is possible, there is certainly loss of integrity in the result.<br />Emulation as an alternative to migration<br />Encapsulates the behavior of the hardware/software with the objects<br />MS Word 2000 document with metadata indicating how to reconstruct the document at the engineering level<br />Creates an emulation registry identifying the HW/SW environment and providing information on how to recreate the environment<br />- 21 -<br />
  22. 22. Advantages and Disadvantages of Preservation Strategies<br />- 22 -<br />
  23. 23. Selection of Preservation Strategies<br />A schematic diagram for selection of preservation techniques of digital information. <br />(Lee et al, 2002)<br />- 23 -<br />
  24. 24. Preservation of the Look and Feel<br />Format of materials <br />In order to save the “look and feel” of material<br />TIFF<br />The most prevalent for those organizations involved with the conversion of paper back file<br />E.g.) JSTOR<br />This does not allow the embedded references to be active hyper links<br />SGML/HTML<br />Used by many large publishers after years of converting publication systems from proprietary format to SGML<br />American Astronomical Society (AAS)<br />PDF<br />The most prevalent format for purely electronic documents used for both formal publications and grey literature<br />National Library of Sweden<br />Concerns remain for long-time preservation<br />It may not be accepted as a legal depository form because of its proprietary nature<br />- 24 -<br />
  25. 25. Normalization vs. Native Formats<br />Normalization<br />Process of converting the native format to a standard format<br />AAS, ACS transform the incoming file into SGML-tagged ASCII format<br />Electronic master copy is able to serve as the robust electronic archival copy.<br />Well-tagged copy can be updated periodically, at very little cost.<br />It takes advantage of advances in both technology and standards.<br />Content remains unchanged, but the public electronic version can be updated to remain compatible with the browsers and other access technology<br />Examples of data normalization provided data community<br />NASA Data Active Archive Centers<br />Transform incoming satellite and ground monitoring information into standard Common Data Format<br />U.K’s National Digital Archive of Datasets<br />Transforms the native format into one of its own devising<br />Normalized formats are considered to be the archival versions<br />Intellectual property question<br />- 25 -<br />
  26. 26. Reliance on Standards<br />Emphasis on Standards<br />DOE OSTI <br />Limited the number of acceptable input formats<br />Text in SGML (and its relatives HTML and XML), PDF, WordPerfect and Word.<br />Image in TIFF Group4 and PDF Image<br />- 26 -<br />
  27. 27. Preservation Strategies Used in Major Projects<br />- 27 -<br />CSI: CISTI Csi, ECO: OCLC Electronic Collections Online, EJO: Ohio LINK Electronic Journal Center <br />KB: KB e-Depot, KOP: Kopal DDB, LA: LOCKSS Alliance, LANL: Los Alamos National Laboratory Research Library, <br />NLA: National Library of Australia PANDORA, OSP: Ontario Scholars Portal, PMC: PubMed Central, PORT: Portico<br />
  28. 28. Issues on Access<br />Access Mechanisms<br />Access and display mechanisms<br />Providing access<br />Restricting access<br />Rights Management and Security Requirements<br />Security and version control<br />Creation metadata to manage encryption, watermarks, digital signatures<br />- 28 -<br />
  29. 29. Access Mechanisms<br />Providing Access <br />NLM’s Profiles in Science<br />Creates an electronic archive of the photographs, text, video, etc<br />Electronic archive is used to create new access versions as access mechanisms change<br /> Providing access technologies<br />Super Distribution<br />Value-chain support<br />Restricting Access<br />Usage rule<br />Persistent protection<br />- 29 -<br />
  30. 30. Access<br />Rights Management and Security Requirements<br />Most difficult access issues for digital archiving<br />Security and version control impact digital archiving<br />Right management includes providing or restricting access as appropriate<br />Content protection technologies<br />Contents Encryption<br />Trusted Environment<br />Metadata for managing encryption, watermarks, digital signatures needs to be created.<br />- 30 -<br />
  31. 31. References<br />CLIR, 2002. The State of Digital Preservation: An International Perspective [online] [cited 2009-07-23] <br />Hodge, 2000. Best Practices for Digital Archiving: An Information Life Cycle Approach, D-Lib Magazine:6(1) [online] [cited 2009-07-23] < http://www.dlib.org/dlib/january00/01hodge.html><br />Hodge et al, 2004. Digital Preservation and Permanent Access to Scientific Information, [online] [cited 2009-07-23] <br />ICPSR, 2009. Digital Preservation Management: Implementing Short-term Strategies for Long-term Problems [online] [cited 2009-12-03] http://www.icpsr.umich.edu/dpm/index.html<br />Kenney, A. R., Entlich, R., Hirtle, P. B., McGovern, N. Y. and Buckley E. L., 2006. E-Journal Archiving Metes and Bounds: A Survey of the Landscape[online] [cited 2009-12-03] <br />Lee, K., Slattery, O., Lu, R., Tang, X. and McCrary, V. 2002. The State of the Art and Practice in Digital Preservation, Journal of Research of the National Institute of Standards and Technology: 107(1), 93-106.<br />Thomas, S. E. and Kroch, C. A. 2000, Project Harvest: The Cornell University Library's Proposal to The Andrew W. Mellon Foundation To Develop a Repository for E-Journals, [online] [cited 2010-03-26] <http http://www.diglib.org/preserve/cornellprop.htm ><br />Edinburgh University Library Digital Archives Research Project. A report and recommendations<br />- 31 -<br />

×