Transcript of "Ensuring Data Integrity white paper"
Ensuring Data Integrity in a Digital Preservation Archive Future Perfect Conference 2012: Digital Preservation by Design (March 2012)Abstract Digital Preservation and the Church This paper discusses the challenges of, and History Departmentsome working solutions to, a key requirement of Today, the Church History Departmentdigital preservation—ongoing data integrity of has ultimate responsibility for preservingthe archive. The solutions were developed records of enduring value that originatecooperatively by three vendors in conjunction from its ecclesiastical leaders and within thewith the Church of Jesus Christ of Latter-day various Church departments, the Church’sSaints. State-of-the-art, in-drive data validation educational institutions, and its affiliations.plays a key role in ensuring ongoing data To fulfill its responsibility, the Churchintegrity. History Department has implemented a Digital Records Preservation SystemIntroduction to the Church of Jesus (DRPS) that is based on Ex Libris Rosetta.Christ of Latter-day Saints Rosetta provides configurable preservation workflows and advanced The Church of Jesus Christ of Latter-day preservation planning functions, but onlySaints is a worldwide Christian church with writes a single copy of an Archival more than 14.4 million Information Package1 (AIP) to a storage members and 28,784 device for permanent storage. An congregations. With appropriate storage layer must be headquarters in Salt integrated with Rosetta in order to provide Lake City, Utah the full capabilities of a digital preservation (USA), the Church archive, including AIP replication. operates 136 temples, After investigating a host of potential three universities, a storage layer solutions, the Church Historybusiness college, and thousands of Department chose NetApp StorageGRID toseminaries and institutes of religion around provide the Information Lifecyclethe world that enroll more than 700,000 Management (ILM) capabilities that werestudents in religious training. desired. In particular, StorageGRID’s data The Church has a scriptural mandate to integrity, data resilience, and datakeep records of its proceedings and replication capabilities were attractive.preserve them for future generations. In order to support ILM migration ofAccordingly, the Church has been creating AIPs from disk to tape, StorageGRIDand keeping records since 1830, when it utilizes IBM Tivoli Storage Manager (TSM)was organized. A Church Historian’s Office as an interface to tape libraries.was formed in the 1840s, and in 1972 it wasrenamed the Church History Department. 1
DRPS also employs software extensionsdeveloped by Church Information andCommunications Services (shown in thered boxes below and described later). Nativity scene from biblevideos.lds.org Each department has a detailed records management plan that specifies which of its collections are appraised as having enduring value. Typically, less than a tenth of the collections are targeted for preservation. In the future, selected Church websites will also be preserved. And, a multi-petabyte backlog of audiovisual collections is presently being ingested into DRPS. Architecture of the Church History Department’s Digital Records Preservation System (DRPS) Data Corruption in a Digital Within a decade, the Church anticipates Preservation Archivethat it will have generated a cumulative A critical requirement of a digitalarchival capacity of more than 100 preservation system is the ability topetabytes for a single copy of AIPs. continuously ensure data integrity of itsTherefore, the total cost of ownership of archive. This requirement differentiates aDRPS archival storage must be minimized. tape archive from other tape farms. An internal study showed that the total Modern IT equipment, including servers,cost of automated tape cartridges would be storage, network switches and routers,a third of the corresponding cost of disk incorporate advanced features to minimizearrays. Therefore, the Church History data corruption. Nevertheless, undetectedDepartment currently uses automated tape errors still occur for a variety of reasons.libraries for DRPS archival storage. Whenever data files are written, read, Internal departments of the Church stored, transmitted over a network, orgenerate multiple petabytes of records processed, there is a small but realannually. Record format types range from possibility that corruption will occur.documents and images to videos of the Causes range from hardware and softwarebirth, life, death, and resurrection of the failures to network transmission failuresLord Jesus Christ that were given to the and interruptions. Bit flips (also called bitworld as a free gift last Christmas by the rot) within data stored on tape also causeChurch (available at biblevideos.lds.org). data corruption. 2
Recently, data integrity of the entire Church Information and CommunicationsDRPS tape archive was validated. This Services (ICS) that create SHA-1 fixityvalidation run encountered a 3.3x10-14 bit information for producer files before theyerror rate. are transferred to DRPS for ingest (see the The USC Shoah Foundation Institute for DRPS architecture shown previously).Visual History and Education has observed Within Rosetta, SHA-1 fixity checks area 2.3x10-14 bit error rate within its tape performed three times—(i) when thearchive, which required the preservation deposit server receives a Submissionteam to flip back 1500 bits per 8 petabytes Information Package1 (SIP), (ii) during theof archive capacity.2 SIP validation process, and (iii) when a file These real life measurements—one taken is moved to permanent storage.from a large archive and the other from a Rosetta also provides the capability torelatively small archive—provide a credible perform fixity checks on files after theyestimation of the amount of data corruption have been written to permanent storage,that will occur in a digital preservation tape but the ILM features of StorageGRID do notarchive. Therefore, working solutions must utilize this capability. Therefore,be implemented to detect and correct these StorageGRID must take over control of thebit errors. fixity information once files have been ingested into the grid.DRPS Solutions to Data Corruption By collaborating with Ex Libris on this process, ICS and Ex Libris have been In order to continuously ensure data successful in making the fixity informationintegrity of its tape archive, DRPS employs hand off from Rosetta to StorageGRID.fixity information. This is accomplished with a web service Fixity information is a checksum (i.e., developed by ICS that retrieves SHA-1 hashintegrity value) calculated by a secure hash values generated independently byalgorithm to ensure data integrity of an AIP StorageGRID when the files are written tofile throughout preservation workflows and the StorageGRID gateway node. Ex Librisafter the file has been written to the archive. developed a Rosetta plug-in that calls this By comparing fixity values before and web service and compares the StorageGRIDafter records are written, transferred across SHA-1 hash values with those in thea network, moved or copied, DRPS can Rosetta database, which are known to bedetermine if data corruption has taken correct.place during the workflow or while the AIP Turning now to the storage layer ofis stored in the archive. DRPS uses a variety DRPS, StorageGRID is constructed aroundof hash values, cyclic redundancy check the concept of object storage. To ensurevalues, and error-correcting codes for such object data integrity, StorageGRID providesfixity information. a layered and overlapping set of protection In order to implement fixity information domains that guard against data corruptionas early as possible in the preservation and alteration of files that are written to theprocess, and thus minimize data errors, grid.DRPS provides ingest tools developed by 3
The highest level domain utilizes the minimizes resource use, but is not secureSHA-1 fixity information discussed above. against deliberate alteration. A SHA-1 hash value is generated for Second, a key-based hash value iseach AIP (or object) that Rosetta writes to appended. This value can be verified usingpermanent storage (i.e., to StorageGRID). the key that is stored as part of theAlso called the Object Hash, the SHA-1 metadata managed by StorageGRID.hash value is self-contained and requires no Although this hash value takes moreexternal information for verification. resources to implement than the CRC Each object contains a SHA-1 object hash checksum described above, it is secureof the StorageGRID formatted data that against all forms of tampering as long ascomprise the object. The object hash is the key is protected.generated when the object is created (i.e., The CRC checksum is verified duringwhen the gateway node writes it to the first every StorageGRID object operation—i.e.,storage node). store, retrieve, transmit, receive, access, and To assure data integrity, the object hash background verification. But, as with theis verified every time the object is stored object hash, the key-based hash value isand accessed. Furthermore, a background only verified when the object is accessed.verification process uses the SHA-1 object Once a file has been correctly written to ahash to verify that the object, while stored StorageGRID storage node (i.e., its dataon disk, has neither become corrupt nor has integrity has been ensured through bothbeen altered by tampering. SHA-1 object hash and CRC fixity checks), Underneath the SHA-1 object hash StorageGRID invokes the TSM Clientdomain, StorageGRID also generates a running on the archive node server in orderContent Hash when the object is created. to write the file to tape.Since objects consist of AIP content data As this happens, the SHA-1 (object hash)plus StorageGRID metadata, the content fixity information is not handed off to TSM.hash provides additional protection for AIP Rather, it is superseded with new fixitycontent files. information composed of various cyclic Because the content hash is not self- redundancy check values and error-contained, it requires external information correcting codes that provide TSM end-to-for verification, and therefore is checked end logical block protection when writing theonly when the object is accessed. file to tape. Each StorageGRID object has a third and Thus the DRPS fixity information chainfourth domain of data protection applied, of control is altered when StorageGRIDand two different types of protection are invokes TSM. Nevertheless, validation ofutilized. the file’s data integrity continues seamlessly First, a cyclic redundancy check (CRC) until it is written to tape.checksum is added that can be quicklycomputed to verify that the object has notbeen corrupted or accidentally altered. ThisCRC provides a verification process that 4
The process begins when the TSM server C1 code can be checked once again to verifycalculates and appends a CRC value to each the written data.logical block of the file before transferring it A successful read-while-write operationto a tape drive for writing. Each appended assures that no data corruption hasCRC is called the “original data CRC” for occurred from the time the file’s logicalthat logical block. block was transferred from the TSM server When the tape drive receives a logical until it is written to tape. And using theseblock, it computes its own CRC for the data ECCs and CRCs, the tape drive can validateand compares it to the original data CRC. If logical blocks at full line speed as they arean error is detected, a check condition is being written!generated, forcing a re-drive or a During a read operation (i.e., whenpermanent error—effectively guaranteeing Rosetta accesses an AIP), data is read fromprotection of the logical block during the tape and all three codes (C1, C2, and thetransfer. original data CRC) are decoded and In addition, as the logical block is loaded checked. A read error is generated if anyinto the drive’s main data buffer, two other process indicates an error.processes occur— The original data CRC is then appended (1) Data received at the buffer is cycled to the logical block when it is transferred toback through an on-the-fly verifier that the TSM server so it can be independentlyonce again validates the original data CRC. verified by that server, thus completing theAny introduced error will again force a re- TSM end-to-end logical block protectiondrive or a permanent error. cycle. (2) In parallel, an error-correcting code This advanced and highly efficient TSM(ECC) is computed and appended to the end-to-end logical block protection isdata. Referred to as the “C1 code,” this ECC enabled with state-of-the-art functionsprotects data integrity of the logical block available with IBM LTO-5 and TS1140 tapeas it goes through additional formatting drives.steps—including the addition of an When the TSM server sends the dataadditional ECC, referred to as the “C2 over the network to a TSM client, CRCcode.” checking is done once again to ensure As part of these formatting steps, the C1 integrity of the data as it is written to thecode is checked every time data is read StorageGRID storage node.from the data buffer. Thus, protection of the From there, StorageGRID fixity checkingoriginal data CRC is essentially occurs, as explained previously for objecttransformed to protection from the more access—including content hash and key-powerful C1 code. based hash value checking—until the data Finally, the data is read from the main is transferred to Rosetta for delivery to itsbuffer and is written to tape using a read- requestor, thus completing the DRPS datawhile-write process. During this process, integrity validation cycle.the just written data is read back from tapeand loaded into the main data buffer so the 5
occur after AIPs are written correctly to DRPS Data Integrity Validation tape. SHA-1 control DRPS Ingest Tools SHA-1 created for producer files SHA-1 SHA-1 checked upon ingest Conclusion control and write to permanent storage Web service retrieves StorageGRID The Church of Jesus Christ of Latter-day Storage Extensions SHA-1, then Rosetta plug-in compares with Rosetta SHA-1 Saints is making a substantial investment to SHA-1 StorageGRID SHA-1 created for ingested files continuously ensure data integrity of its control SHA-1 and other fixity checked during write to storage nodes DRPS archive as described in this paper. CRCs, IBM TSM end-to-end logical block ECCs Tivoli Storage Manager protection The benefits of preserving the Church’s exalting and inspiring records cannot be measured in financial terms, however. Summary of the DRPS data integrity validation cycle Those benefits include building character and strengthening families—both of whichEnsuring Ongoing Data Integrity are designed to foster both personal and Unfortunately, continuously ensuring family happiness.data integrity of a DRPS AIP does not endonce the AIP has been written correctly to Referencestape. Periodically, the tape(s) containing the 1 CCSDS 650.0-B-1BLUE BOOK, “Reference ModelAIP needs to be checked to uncover errors for an Open Archival Information System (OAIS),”(i.e., bit flips) that may have occurred since Consultative Committee for Space Data Systemsthe AIP was correctly written. (2002) 2 Private conversation with Sam Gustman (CTO) at Fortunately, IBM LTO-5 and TS1140 tapedrives can perform this check without the USC Shoah Foundation Institute August 19, 2009having to stage the AIP to disk, which isclearly a resource intensive task—especiallyfor an archive with a capacity measured inhundreds of petabytes! IBM LTO-5 and TS1140 drives canperform data integrity validation in-drive,which means a drive can read a tape andconcurrently check the AIP logical blockCRC and ECCs discussed above (C1, C2,and the original data CRC). Good or badstatus is reported as soon as these internalchecks are completed. And this is donewithout requiring any other resources! Clearly, this advanced capabilityenhances the ability of DRPS to performperiodic data integrity validations of theentire archive more frequently, which willfacilitate the correction of bit flips that 6