Gabe Nault Data Integrity

•Download as PPTX, PDF•

1 like•862 views

Future Perfect 2012

The document discusses ensuring data integrity in the LDS Church's digital preservation archive (DRPS). It describes the DRPS system architecture which uses multiple copies across locations, automatic replication, and various integrity checks like fixity values and cyclic redundancy checks to ensure data integrity from ingest to permanent tape storage. It highlights how tape storage provides better long-term preservation than disk, though it presents some access challenges, and discusses ongoing efforts to verify archive integrity through periodic reading and drive-level error checking of tapes.

Technology Spiritual

Ensuring Data Integrity
in a Digital Preservation
Archive
Gabe Nault
LDS Church
naultga@ldschurch.org
Future Perfect Conference
2012
image courtesy of IBM

Introducing the LDS Church
• The Church of Jesus Christ of Latter-day Saints
• Global Christian church with 14 million members
• 3 universities, 1 college
• State-of-the-art audio-
visual capabilities
• Scriptural mandate to
keep and preserve records since
since 1830
photo by Henok Montoya

The Church History Department
• Preserves records of enduring value from
Church leaders, departments, universities,
and affiliations (more than 35 organizations)
• Typically, less than
10% of records are
candidates for
preservation

Church History Library on Temple Square

Granite Mountain Records Vault
• Bored into a solid granite
mountain

• Stores large microfilm
collections and valuable
Church artifacts

• Plans recently developed
to renovate the facility for
digital preservation

The Media Services Department
• Audiovisual records will
consume majority of our
archive capacity

Mormon Tabernacle Choir and Orchestra

Free Bible videos from biblevideos.lds.org

• 100+ PB in a decade
for a single copy!
Conference Center on Temple Square

DRPS System Architecture
Fixity
Creation DRPS Ingest Tools

Preservation
Digital Functions
Records Fixity
Storage Extensions
Preservation Bridge

Information
System Lifecycle StorageGRID
Management

Tape IBM
Interface Tivoli Storage Manager

DRPS Highlights
• Multiple copies in multiple geographic
locations (eventually)
• Approximately 1 PB spinning media
• Automatic replication to remote site(s)
• End to end data integrity
• Tape base permanent storage

Why Tape for Preservation?
Total cost of storage ownership study
• TCO - Over ten years, ownership and
operating costs of tape are three to
fifteen times less than associated costs
for disk arrays IBM TS3500
Tape Libraries
• Cost advantages of tape are expected image courtesy of IBM

to increase over time
• Conclusion—for now, tape is required
to sustain a multi-PB digital archive
• But . . . tape presents some challenges

Why Tape for Preservation?
Limitations
• Latency
• Limited to sequential access
• Limited number of read/writes IBM TS3500
• Leads to greater system and Tape Libraries
image courtesy of IBM
operational complexity

Data Integrity
• Data integrity validation is provided by fixity checks
when data is written, transferred, moved, or copied
• Fixity checking should be performed from file
creation to permanent storage to delivery
• Periodic validation of the entire archive should also
be performed to detect data corruption(bit rot, drive
errors, tape degradation, etc)
• DRPS uses a variety of integrity values for fixity

DRPS Data Integrity Validation
SHA-1
control
DRPS Ingest Tools SHA-1 created for producer files

SHA-1 SHA-1 checked upon ingest
control and write to permanent storage

Web service retrieves StorageGRID
Storage Extensions SHA-1, then Rosetta plug-in
compares with Rosetta SHA-1

SHA-1 SHA-1 created for ingested files
control
StorageGRID

StorageGRID Fixity Checking
• StorageGRID is constructed around the concept
of object storage
StorageGRID
• Provides a layered/overlapping set of protection
domains to guard against object data corruption
1. SHA-1 object hash—checked on store and access
2. Content hash—checked on access
3. CRC checksum—checked with every operation
4. Key-based hash value—checked on access

TSM End-to-End Logical Block Protection
• Supersedes SHA-1 fixity information with
cyclic redundancy check values (CRCs)
and error-correcting codes (ECCs)

• Enabled with new, state-of-the-art
functionality of IBM LTO-5 and TS1140
tape drives

• Seamlessly extends validation
of data integrity as data is
written to tape

TSM End-to-End Logical Block Protection
1. TSM server calculates and
appends “original data CRC”
to logical data block

2. Tape drive computes its
own CRC and compares
to original data CRC

TSM End-to-End Logical Block Protection
3. As logical block is loaded into
drive data buffer, on-the-fly
verifier checks original data CRC

4. In parallel, a “C1 code”
(ECC) is computed and
appended

TSM End-to-End Logical Block Protection
5. An additional ECC, referred to
as “C2 code,” is added to the
logical block

6. More powerful than the
original data CRC, the
C1 code is checked every
time data is read from the buffer

TSM End-to-End Logical Block Protection
7. Data written to tape at full
line speed with read-while-
write process
8. Just written data loaded to
buffer and C1 code checked

Successful read-while-
write operation assures no
data corruption from TSM
server to tape

TSM End-to-End Logical Block Protection
9. When tape is read, all codes
(C1, C2, original data CRC)
are checked by drive
10. Original data CRC appended
to logical block

11. TSM server verifies original
data CRC, completing TSM
end-to-end logical block
protection cycle

Ongoing Archive Data Integrity
• We must assume that data may become
corrupted after being written correctly
to tape

• Therefore, tapes must be read
periodically to identify and correct data
errors

image courtesy of IBM

Ongoing Archive Data Integrity
• Staging IEs to disk to verify integrity
is resource intensive!
• IBM LTO-5 and TS1140 tape drives
provide a more efficient solution
• During “SCSI Verify” operation, a
tape is mounted, drive checks all image courtesy of IBM

codes (C1, C2, original data CRC) as
data is being read (at full line speed)
• Only status is reported as these
internal checks are completed

Summary
• Fixity information is the SHA-1
DRPS Ingest Tools
key to data integrity control

• SHA-1 values ensure data SHA-1
integrity to StorageGRID control

• TSM end-to-end logical Storage Extensions
block protection ensures
data integrity to tape SHA-1
control
StorageGRID
• In-drive validation enables
ongoing integrity checks CRCs, IBM
ECCs Tivoli Storage Manager
for the entire archive

Rate of Bit Errors
• Preliminary validation of DRPS archive
resulted in a 3.3x10-14 bit error rate
• USC Shoah Foundation Institute visit
• 8 PB tape archive of videotaped interviews of
Holocaust survivors and other witnesses
• Experienced 1500 bit flips in 8 PB
(2.3x10-14 bit error rate)`

"In this session for administrators of all skill levels, you’ll get a deep technical dive into Red Hat Storage Server and GlusterFS administration. We’ll start with the basics of what scale-out storage is, and learn about the unique implementation of Red Hat Storage Server and its advantages over legacy and competing technologies. From the basic knowledge and design principles, we’ll move to a live start-to-finish demonstration. Your experience will include: Building a cluster. Allocating resources. Creating and modifying volumes of different types. Accessing data via multiple client protocols. A resiliency demonstration. Expanding and contracting volumes. Implementing directory quotas. Recovering from and preventing split-brain. Asynchronous parallel geo-replication. Behind-the-curtain views of configuration files and logs. Extended attributes used by GlusterFS. Performance tuning basics. New and upcoming feature demonstrations. Those new to the scale-out product will leave this session with the knowledge and confidence to set up their first Red Hat Storage Server environment. Experienced administrators will sharpen their skills and gain insights into the newest features. IT executives and managers will gain a valuable overview to help fuel the drive for next-generation infrastructures."

Ceph Block Devices: A Deep Dive

joshdurgin

Ceph is an open source distributed storage system designed for scalability and reliability. Ceph's block device, RADOS block device (RBD), is widely used to store virtual machines, and is the most popular block storage used with OpenStack. In this session, you'll learn how RBD works, including how it: * Uses RADOS classes to make access easier from user space and within the Linux kernel. * Implements thin provisioning. * Builds on RADOS self-managed snapshots for cloning and differential backups. * Increases performance with caching of various kinds. * Uses watch/notify RADOS primitives to handle online management operations. * Integrates with QEMU, libvirt, and OpenStack.

Cloud storage solution technical requirementtaotao1240

The Quick Migration of File Servers

Shinagawa Laboratory, The University of Tokyo

AdvFS/Advanced File System Ccncepts

Justin Goldberg

The Future of GlusterFS and Gluster.org

John Mark Walker

This presentation is based on the US FDA Workshop that I attended in Mumbai. Pharmaceutical Industry is challenged for the Data Integrity. The reason for Data Integrity Fraud are obvious, but are overlooked. The importance of the genuine data must be accepted from the patient safety angle. There has to be a sincere attempt from the Top Management to eliminate this problem. It is trust that needs to be developed and maintained.

21 CFR Part 58 & Data integrity

Anand Pandya

Portrait Of Jesus In The Tabernacle 4

New Life Bible Chapel

FDA Data Integrity Issues - DMS hot fixesVidyasagar P

Greg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations

John Blue

Data Integrity in a GxP-regulated Environment - Pauwels Consulting Academy

Pauwels Consulting

On Tuesday, December 6, 2016, our colleague Angelo Rossi, Senior Regulatory Compliance Consultant, gave an interesting presentation about “Data Integrity in a GxP-regulated Environment” at the Brussels Office of Pauwels Consulting in Diegem. In his presentation, Angelo covered definitions and concepts of data integrity, the change in regulatory focus, lessons learned from recent FDA warning letters, importants highlights of regulations and guidelines. Angelo also presented a practical example of data integrity for a computerized system. Please contact us at contact@pauwelsconsulting.com or +32 9 324 70 80 if you have any further questions regarding our consulting services in this area.

The garments of the high priest

bishop01

Portrait Of Jesus In The Tabernacle I

New Life Bible Chapel

Data Integrity webinar - Essentials & Solutions

Looking for expertise or support on Data Integrity? Contact us today. Recently, the pharmaceutical industry has been challenged with the regulatory requirements to provide complete, consistent and accurate data, throughout all GMP regulated processes. Moreover, during audits the regulatory bodies have observed a level of inconsistency in the application of the predicate rules in GMP processes. This has become a growing concern and has led to a set of new (draft) guidances from different market authorities. Index: Data Integrity – Why / What Data life cycle Core Data Integrity concepts & building blocks Short & mid-term actions enabling a focused road to compliance

The Shadow of Jesus in the Tabernacle

Rick Bowen

High priest's garmentsKyle Lammott

High Priestly Garments

New Life Bible Chapel

Data Integrity Validation Keynote Address Boston August 2016

Ajaz Hussain

Inter connect2016 yss1841-cloud-storage-options-v4

Tony Pearson

002-Storage Basics and Application Environments V1.0.pptx

DrewMe1

Viewers also liked

Tabernacle

Jesuit Tertianship in Dublin, Ireland

How to – data integrity checks in batch processing

Son Nguyen

The Tabernacle

David Walters

Data integrityUrooj Sabar

Portrait Of Jesus In The Tabernacle 2

New Life Bible Chapel

Data Integrity GMP Scientific

Shamik (Sam) Pandit

Portrait Of Jesus In The Tabernacle 3

New Life Bible Chapel

Data integrity challenges and solutions

Nandkumar Chodankar (Ph D Tech)

21 CFR Part 58 & Data integrity

Anand Pandya

Portrait Of Jesus In The Tabernacle 4

New Life Bible Chapel

FDA Data Integrity Issues - DMS hot fixesVidyasagar P

Greg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations

John Blue

Data Integrity in a GxP-regulated Environment - Pauwels Consulting Academy

Pauwels Consulting

The garments of the high priest

bishop01

Portrait Of Jesus In The Tabernacle I

New Life Bible Chapel

Data Integrity webinar - Essentials & Solutions

The Shadow of Jesus in the Tabernacle

Rick Bowen

High priest's garmentsKyle Lammott

High Priestly Garments

New Life Bible Chapel

Data Integrity Validation Keynote Address Boston August 2016

Ajaz Hussain

Viewers also liked (20)

Tabernacle

How to – data integrity checks in batch processing

The Tabernacle

Data integrity

Portrait Of Jesus In The Tabernacle 2

Data Integrity GMP Scientific

Portrait Of Jesus In The Tabernacle 3

Data integrity challenges and solutions

21 CFR Part 58 & Data integrity

Portrait Of Jesus In The Tabernacle 4

FDA Data Integrity Issues - DMS hot fixes

Greg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations

Data Integrity in a GxP-regulated Environment - Pauwels Consulting Academy

The garments of the high priest

Portrait Of Jesus In The Tabernacle I

Data Integrity webinar - Essentials & Solutions

The Shadow of Jesus in the Tabernacle

High priest's garments

High Priestly Garments

Data Integrity Validation Keynote Address Boston August 2016

Similar to Gabe Nault Data Integrity

Inter connect2016 yss1841-cloud-storage-options-v4

Tony Pearson

002-Storage Basics and Application Environments V1.0.pptx

DrewMe1

San

Vishal Patil

San

Arkshita sahoo

IBM Cloud Object Storage System (powered by Cleversafe) and its Applications

Tony Pearson

S ss0885 spectrum-scale-elastic-edge2015-v5

Tony Pearson

Spectrum Scale Unified File and Object with WAN Caching

Sandeep Patil

Software Defined Analytics with File and Object Access Plus Geographically Di...

Trishali Nayar

IWM Infrastructure

Rosie Forrest

A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...

Ben Stopford

The Pendulum Swings Back: Converged and Hyperconverged Environments

Tony Pearson

Securing the Container Pipeline at Salesforce by Cem Gurkok

Docker, Inc.

Customer trust and security is paramount for Salesforce. While containerization is great for DevOps due to flexibility, speed, isolation, transient existence, ease of management and patching, it becomes a challenging environment when the sensitivity level of the data traversing the environment increases. Monitoring systems, applications and network; performing disk, memory and network forensics in case of an incident; and vulnerability detection can easily become daunting tasks in such a volatile environment. In this presentation we would like to discuss the infrastructure we have built to address these issues and to secure our Docker container platform while we rapidly containerize Salesforce. Our solutions focus on securing the container pipeline, building security into the architecture, monitoring, Docker forensics (disk, memory, network), and automation. We also would like to demonstrate some of our live memory analysis capabilities we leverage to assure container and application integrity during execution.

SNW Pres City Of Safford 101209 Lr

Eric Herzog

Cloud computing UNIT 2.1 presentation in

RahulBhole12

VDI storage and storage virtualizationSisimon Soman

LAN Fundamentals

Steven Baule

Performance

Christophe Marchal

The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration

Cybera Inc.

Timesten Architecture

SrirakshaSrinivasan2

Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks

Amazon Web Services

Learning Objectives: - Recognize why and when to use Amazon EFS and the economic benefits versus other solutions - Understand key technical, performance, and security concepts - See Amazon EFS in action with live demo The vast majority of applications and workloads interact with data storage via a file system interface and require file system semantics. As businesses move to the cloud they require storage resources that integrates with their existing applications and tools. In this technical session, we will explore file storage with Amazon Elastic File System (Amazon EFS) and its targeted use cases. Attendees will learn about the Amazon EFS features and benefits, how to identify applications that are appropriate for use with Amazon EFS, and details about its performance and security models. We will highlight and demonstrate how to deploy Amazon EFS in our most common use cases and will share tips for success throughout.

Similar to Gabe Nault Data Integrity (20)

Inter connect2016 yss1841-cloud-storage-options-v4

002-Storage Basics and Application Environments V1.0.pptx

San

IBM Cloud Object Storage System (powered by Cleversafe) and its Applications

S ss0885 spectrum-scale-elastic-edge2015-v5

Spectrum Scale Unified File and Object with WAN Caching

Software Defined Analytics with File and Object Access Plus Geographically Di...

IWM Infrastructure

A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...

The Pendulum Swings Back: Converged and Hyperconverged Environments

Securing the Container Pipeline at Salesforce by Cem Gurkok

SNW Pres City Of Safford 101209 Lr

Cloud computing UNIT 2.1 presentation in

VDI storage and storage virtualization

LAN Fundamentals

Performance

The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration

Timesten Architecture

Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks

Recently uploaded

Essentials of Automations: The Art of Triggers and Actions in FME

Safe Software

In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation. We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios. Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

James Anderson

Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management. The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM). Speakers: Bob Boule Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle. Gopinath Rebala Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Albert Hoitingh

Large Language Model (LLM) and it’s Geospatial Applications

Rohit Gautam

A tale of scale & speed: How the US Navy is enabling software delivery from l...

sonjaschweigert1

Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved: - Reduction in onboarding time from 5 weeks to 1 day - Improved developer experience and productivity through actionable findings and reduction of false positives - Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO) Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production. We will cover: - How to remove silos in DevSecOps - How to build efficient development pipeline roles and component templates - How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence) - How to streamline operations with automated policy checks on container images

Introduction to CHERI technology - Cybersecurity

mikeeftimakis1

Microsoft - Power Platform_G.Aspiotis.pdf

Uni Systems S.M.S.A.

Free Complete Python - A step towards Data Science

RinaMondal9

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

nkrafacyberclub

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

SOFTTECHHUB

As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

Neo4j

Dr. Sean Tan, Head of Data Science, Changi Airport Group Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.

How to Get CNIC Information System with Paksim Ga.pptx

danishmna97

DevOps and Testing slides at DASA Connect

Kari Kakkonen

PCI PIN Basics Webinar from the Controlcase Team

ControlCase

Securing your Kubernetes cluster_ a step-by-step guide to success !

KatiaHIMEUR1

Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster. However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks. In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Neo4j

Removing Uninteresting Bytes in Software Fuzzing

Aftab Hussain

Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process. In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds. - These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.

UiPath Test Automation using UiPath Test Suite series, part 6

DianaGray10

Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI. UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities. Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes. What will you get from this session? 1. Insights into integrating generative AI. 2. Understanding how this integration enhances test automation within the UiPath platform 3. Practical demonstrations 4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath Topics covered: What is generative AI Test Automation with generative AI and Open AI. UiPath integration with generative AI Speaker: Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP

Generative AI Deep Dive: Advancing from Proof of Concept to Production

Aggregage

GridMate - End to end testing is a critical piece to ensure quality and avoid...

ThomasParaiso2

Recently uploaded (20)

Essentials of Automations: The Art of Triggers and Actions in FME

GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...

Encryption in Microsoft 365 - ExpertsLive Netherlands 2024

Large Language Model (LLM) and it’s Geospatial Applications

A tale of scale & speed: How the US Navy is enabling software delivery from l...

Introduction to CHERI technology - Cybersecurity

Microsoft - Power Platform_G.Aspiotis.pdf

Free Complete Python - A step towards Data Science

Secstrike : Reverse Engineering & Pwnable tools for CTF.pptx

Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!

GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...

How to Get CNIC Information System with Paksim Ga.pptx

DevOps and Testing slides at DASA Connect

PCI PIN Basics Webinar from the Controlcase Team

Securing your Kubernetes cluster_ a step-by-step guide to success !

GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024

Removing Uninteresting Bytes in Software Fuzzing

UiPath Test Automation using UiPath Test Suite series, part 6

Generative AI Deep Dive: Advancing from Proof of Concept to Production

GridMate - End to end testing is a critical piece to ensure quality and avoid...

Gabe Nault Data Integrity

1. Ensuring Data Integrity in a Digital Preservation Archive Gabe Nault LDS Church naultga@ldschurch.org Future Perfect Conference 2012 image courtesy of IBM

2. Introducing the LDS Church • The Church of Jesus Christ of Latter-day Saints • Global Christian church with 14 million members • 3 universities, 1 college • State-of-the-art audiovisual capabilities • Scriptural mandate to keep and preserve records since since 1830 photo by Henok Montoya

3. The Church History Department • Preserves records of enduring value from Church leaders, departments, universities, and affiliations (more than 35 organizations) • Typically, less than 10% of records are candidates for preservation Church History Library on Temple Square

4. Granite Mountain Records Vault • Bored into a solid granite mountain • Stores large microfilm collections and valuable Church artifacts • Plans recently developed to renovate the facility for digital preservation

5. The Media Services Department • Audiovisual records will consume majority of our archive capacity Mormon Tabernacle Choir and Orchestra Free Bible videos from biblevideos.lds.org • 100+ PB in a decade for a single copy! Conference Center on Temple Square

6. DRPS System Architecture Fixity Creation DRPS Ingest Tools Preservation Digital Functions Records Fixity Storage Extensions Preservation Bridge Information System Lifecycle StorageGRID Management Tape IBM Interface Tivoli Storage Manager

7. DRPS Infrastructure

8. DRPS Highlights • Multiple copies in multiple geographic locations (eventually) • Approximately 1 PB spinning media • Automatic replication to remote site(s) • End to end data integrity • Tape base permanent storage

9. Why Tape for Preservation? Total cost of storage ownership study • TCO - Over ten years, ownership and operating costs of tape are three to fifteen times less than associated costs for disk arrays IBM TS3500 Tape Libraries • Cost advantages of tape are expected image courtesy of IBM to increase over time • Conclusion—for now, tape is required to sustain a multi-PB digital archive • But . . . tape presents some challenges

10. Why Tape for Preservation? Limitations • Latency • Limited to sequential access • Limited number of read/writes IBM TS3500 • Leads to greater system and Tape Libraries image courtesy of IBM operational complexity

11. Data Integrity • Data integrity validation is provided by fixity checks when data is written, transferred, moved, or copied • Fixity checking should be performed from file creation to permanent storage to delivery • Periodic validation of the entire archive should also be performed to detect data corruption(bit rot, drive errors, tape degradation, etc) • DRPS uses a variety of integrity values for fixity

12. DRPS Fixity Data Flow

13. DRPS Data Integrity Validation SHA-1 control DRPS Ingest Tools SHA-1 created for producer files SHA-1 SHA-1 checked upon ingest control and write to permanent storage Web service retrieves StorageGRID Storage Extensions SHA-1, then Rosetta plug-in compares with Rosetta SHA-1 SHA-1 SHA-1 created for ingested files control StorageGRID

14. StorageGRID Fixity Checking • StorageGRID is constructed around the concept of object storage StorageGRID • Provides a layered/overlapping set of protection domains to guard against object data corruption 1. SHA-1 object hash—checked on store and access 2. Content hash—checked on access 3. CRC checksum—checked with every operation 4. Key-based hash value—checked on access

15. DRPS Data Integrity Validation SHA-1 control DRPS Ingest Tools SHA-1 created for producer files SHA-1 SHA-1 checked upon ingest control and write to permanent storage Web service retrieves StorageGRID Storage Extensions SHA-1, then Rosetta plug-in compares with Rosetta SHA-1 SHA-1 SHA-1 created for ingested files control StorageGRID SHA-1 and other fixity checked during write to storage nodes CRCs, IBM TSM end-to-end logical block ECCs Tivoli Storage Manager protection

16. TSM End-to-End Logical Block Protection • Supersedes SHA-1 fixity information with cyclic redundancy check values (CRCs) and error-correcting codes (ECCs) • Enabled with new, state-of-the-art functionality of IBM LTO-5 and TS1140 tape drives • Seamlessly extends validation of data integrity as data is written to tape

17. TSM End-to-End Logical Block Protection 1. TSM server calculates and appends “original data CRC” to logical data block 2. Tape drive computes its own CRC and compares to original data CRC

18. TSM End-to-End Logical Block Protection 3. As logical block is loaded into drive data buffer, on-the-fly verifier checks original data CRC 4. In parallel, a “C1 code” (ECC) is computed and appended

19. TSM End-to-End Logical Block Protection 5. An additional ECC, referred to as “C2 code,” is added to the logical block 6. More powerful than the original data CRC, the C1 code is checked every time data is read from the buffer

20. TSM End-to-End Logical Block Protection 7. Data written to tape at full line speed with read-while- write process 8. Just written data loaded to buffer and C1 code checked Successful read-while- write operation assures no data corruption from TSM server to tape

21. TSM End-to-End Logical Block Protection 9. When tape is read, all codes (C1, C2, original data CRC) are checked by drive 10. Original data CRC appended to logical block 11. TSM server verifies original data CRC, completing TSM end-to-end logical block protection cycle

22. Ongoing Archive Data Integrity • We must assume that data may become corrupted after being written correctly to tape • Therefore, tapes must be read periodically to identify and correct data errors image courtesy of IBM

23. Ongoing Archive Data Integrity • Staging IEs to disk to verify integrity is resource intensive! • IBM LTO-5 and TS1140 tape drives provide a more efficient solution • During “SCSI Verify” operation, a tape is mounted, drive checks all image courtesy of IBM codes (C1, C2, original data CRC) as data is being read (at full line speed) • Only status is reported as these internal checks are completed

24. Summary • Fixity information is the SHA-1 DRPS Ingest Tools key to data integrity control • SHA-1 values ensure data SHA-1 integrity to StorageGRID control • TSM end-to-end logical Storage Extensions block protection ensures data integrity to tape SHA-1 control StorageGRID • In-drive validation enables ongoing integrity checks CRCs, IBM ECCs Tivoli Storage Manager for the entire archive

25. Thank you! Questions?

26. Trademarks The Ex Libris logo and Rosetta are trademarks of Ex Libris Group. The IBM logo and Tivoli Storage Manager are trademarks of International Business Machines Corporation. The NetApp logo and StorageGRID are trademarks of NetApp, Inc.

27. Rate of Bit Errors • Preliminary validation of DRPS archive resulted in a 3.3x10-14 bit error rate • USC Shoah Foundation Institute visit • 8 PB tape archive of videotaped interviews of Holocaust survivors and other witnesses • Experienced 1500 bit flips in 8 PB (2.3x10-14 bit error rate)`

Editor's Notes

Good morning! My presentation will cover the challenges of, and some working solutions to, a key requirement of digital preservation—ongoing data integrity of the archive. The solutions I will discuss were developed cooperatively by three vendors in conjunction with the Church of Jesus Christ of Latter-day Saints. By the way, you will be able to download a white paper that covers my presentation along with the presentation itself when the conference is over.
First let me introduce the Church.Its full name is the Church of Jesus Christ of Latter-day Saints. Headquarters are in Salt Lake City, Utah – a Western State in the United Sates of America. The building shown here is the Salt Lake Temple, which has come to be a symbol for the Church. For your information, the Church operates 134 temples around the world. Temples are not weekly meeting places; rather, they are sacred places where families are sealed together forever – beyond life here on earth.We are a global Christian Church with more than 14 million members.The Church has more than 700,000 students enrolled in religious training around the world.It also operates three universities and a business college. Education is very important to members of the Church of Jesus Christ of Latter-day Saints.Over the last two decades, the Church has developed state-of-the-art digital audiovisual capabilities to support its vast, worldwide communications needs. I will talk more about this later.The Church has a scriptural mandate to keep records of its proceedings and preserve them for future generations. Accordingly, the Church has been creating and keeping records since 1830, when it was organized. A Church Historian’s Office was formed in the 1840s, and in 1972 it was renamed the Church History Department.
Today, the Church History Department has ultimate responsibility for preserving records of enduring value that originate from its ecclesiastical leaders and within the various Church departments, the Church’s educational institutions, and its affiliations. In order to carry out this responsibility, the Church History Department’s Records Management team helps each Church organization develop a records management plan. The plan identifies all records used by the organization and establishes a record retention and disposition schedule for each collection.Usually, less than 10% of the records have a final disposition of “archive.” Only these records are preserved for future generations.
Unfortunately, reading all the tapes in the archive in order to stage AIPs to disk so servers can check the fixity information is clearly a resource intensive task—especially for an archive with a capacity measured in hundreds of petabytes! Fortunately, IBM LTO-5 and TS1140 tape drives provide a much more efficient solution.During a “Verify” operation, IBM LTO-5 and TS1140 drives perform data integrity validation in-drive, which means a drive reads a tape and concurrently checks the three logical block CRCs and ECCs discussed previously at full line speed.Good or bad status is reported as soon as these internal checks are completed. And this is done without requiring any other resources! Clearly, this advanced capability enhances the ability of DRPS to perform periodic data integrity validations of the entire archive more frequently, which will facilitate the correction of bit flips after AIPs are written correctly to tape.
I mentioned earlier that the Church has developed state-of-the-art digital audiovisual capabilities to support its vast, worldwide communications needs. The Media Services Department uses these capabilities to support the rest of the Church organizations in their audiovisual needs. Because of the average size of MSD audiovisual files, which is several hundred gigabytes so far, MSD audiovisual files will consume the vast majority of archive capacity in the Church History Department’s Digital Records Preservation System.One example of audiovisual records we preserve is weekly broadcasts of Music and the Spoken Word—the world’s longest continuous network broadcast (now in its 83rd year). Each broadcast features an inspirational message and music performed by the Mormon Tabernacle Choir, also known as “America’s Choir,” and the Orchestra at Temple Square. These National Radio Hall of Fame broadcasts are clearly a priceless treasure for the world that are being preserved for future generations.Another example of audiovisual records we preserve is semiannual broadcasts of General Conference, which is held in the remarkable Conference Center, shown here, that seats 21,000. The meetings are broadcast in high definition video via satellite to more than 7,400 Church buildings in 102 countries. The broadcasts are simultaneously translated into 76 languages. Ultimately, digital audio tracks for 96 languages are created and preserved to augment the digital video taping of each meeting. Not surprisingly, the Church is the world’s largest language broadcaster. As a gift to the world, the Church launched a new website last Christmas that provides free Bible videos of the birth, life, death, and resurrection of the Lord Jesus Christ. Viewable with a free mobile app, these videos are faithful to the biblical account, and of course will be preserved for future generations. I encourage you to visit the website at biblevideos.lds.org.With audiovisual files such as these, we expect that our archive capacity within a decade will exceed 100 petabytes for a single copy!
The Church History Department’s Digital Records Preservation System, or DRPS, is based on Ex Libris Rosetta. Rosetta provides configurable preservation workflows and advanced preservation planning functions, but only writes a single copy of an AIP to a storage device for permanent storage. Therefore, an appropriate storage layer must be integrated with Rosetta in order to provide the full capabilities of a digital preservation archive, including AIP replication.After investigating a host of potential storage layer solutions, the Church History Department chose NetApp StorageGRID to provide the ILM capabilities that were desired. In particular, StorageGRID’s data integrity, data resilience, and data replication capabilities were attractive. In order to support ILM migration of AIPs from disk to tape, StorageGRID utilizes IBM Tivoli Storage Manager, or TSM, as an interface to tape libraries.DRPS also employs software extensions developed by my team, which is part of Church Information and Communications Services . The first is a set of ingest tools that help with fixity information creation, which I will discuss later.The second involves a fixity information bridge that will also be described later.
You may wonder why we chose to use tape libraries for the DRPS archive. In 2008, an internal study was performed to compare the costs of acquisition, maintenance, administration, data center floor space, and power to archive hundreds of petabytes of digital records using disk arrays, optical disks, virtual tape libraries, and automated tape cartridges. The model also incorporated assumptions about increasing storage densities of these different storage technologies over time.Calculating all costs over a ten year period, the study concluded that the total cost of ownership of automated tape cartridges would be 33.7% of the next closest storage technology, which was disk arrays.Based on discussions with major storage providers, we believe that the cost of power and the cost per terabyte advantages of tape will only increase over time.Therefore, we concluded that, at least for now, we should use tape libraries to sustain our digital archive that is expected to skyrocket to a multiple petabyte capacity in just a few years. When we made this decision, we were NOT naive to the challenges of tape we would be facing.
You may wonder why we chose to use tape libraries for the DRPS archive. In 2008, an internal study was performed to compare the costs of acquisition, maintenance, administration, data center floor space, and power to archive hundreds of petabytes of digital records using disk arrays, optical disks, virtual tape libraries, and automated tape cartridges. The model also incorporated assumptions about increasing storage densities of these different storage technologies over time.Calculating all costs over a ten year period, the study concluded that the total cost of ownership of automated tape cartridges would be 33.7% of the next closest storage technology, which was disk arrays.Based on discussions with major storage providers, we believe that the cost of power and the cost per terabyte advantages of tape will only increase over time.Therefore, we concluded that, at least for now, we should use tape libraries to sustain our digital archive that is expected to skyrocket to a multiple petabyte capacity in just a few years. When we made this decision, we were NOT naive to the challenges of tape we would be facing.
One of those challenges has to do with ensuring data integrity of the tape archive. This is a critical requirement for any digital preservation archive, and it differentiates a tape archive from other types of tape farms. Modern IT equipment, including servers, storage, and network switches and routers, incorporate advanced features to minimize data corruption. Nevertheless, undetected errors still occur for a variety of reasons. Whenever data files are written, read, stored, transmitted over a network, or processed, there is a small but real possibility that corruption will occur. Causes range from hardware and software failures to network transmission failures and interruptions. Bit flips within files stored on tape can also cause data corruption.Fixity information enables data integrity validation. Fixity information is a checksum, or integrity value, that is calculated by a secure hash algorithm to ensure data integrity of an AIP file throughout preservation workflows and after the file has been written to the archive. By comparing fixity hash values before and after records are written, transferred over a network, moved or copied, a digital preservation system can determine if data corruption has taken place during its workflows or while the AIP is stored in the archive.To do data integrity validation correctly, end-to-end fixity checking should be performed from file ingest to storing the file on permanent storage to eventual access and delivery.Furthermore, data integrity validation of the entire archive should be performed periodically to detect and correct bit flips (also known as bit rot).DRPS uses a variety of hash values, cyclic redundancy check values, and error-correcting codes for such fixity information.
As mentioned earlier, DRPS employs a variety of hash values, cyclic redundancy check values, and error-correcting codes in order to ensure data integrity of its tape archive. The chain of control of this fixity information is illustrated here.In order to implement fixity information as early as possible in the preservation process, and thus minimize data errors, DRPS provides ingest tools developed by my team that create SHA-1 fixity information for producer files before they are transferred to DRPS for ingest Control of this SHA-1 fixity information is transferred when a file is ingested into Rosetta. Within Rosetta, SHA-1 fixity checks are performed three times—(1) when the deposit server receives a Submission Information Package (SIP), (2) during the SIP validation process, and (3) when an AIP is moved to permanent storage. Rosetta also provides the capability to perform fixity checks on AIP files written to permanent storage, but the ILM features of StorageGRID do not utilize this capability. Therefore, StorageGRID must take over control of the SHA-1 fixity information once files have been written to it.By collaborating with Ex Libris on this process, ICS and Ex Libris have been successful in making the fixity information hand off from Rosetta to StorageGRID. This is accomplished with a web service we developed that retrieves SHA-1 hash values generated independently by StorageGRID when the files are written to the StorageGRID gateway node. Ex Libris developed a Rosetta plug-in that calls this web service and compares the StorageGRID SHA-1 hash values with those in the Rosetta database, which are known to be correct.
Before I go any further with the DRPS data integrity validation chain, I’d like to discuss in some detail how StorageGRID handles fixity checking.First, StorageGRID is constructed around the concept of object storage, which enables it to provide advanced Information Lifecycle Management capabilities.To ensure object data integrity, StorageGRID provides a layered and overlapping set of protection domains that guard against object data corruption and alteration of files that are written to the grid. The first domain is called the SHA-1 object hash—this is the same SHA-1 hash value we just discussed with the previous slide. It is generated when the object (or AIP) is created (i.e., when the gateway node writes it to the first storage node), and it is verified every time the object is stored and accessed.The second domain is called the content hash. Because this hash is not self-contained, it requires external information for verification, and therefore is checked only when the object is accessed. The third domain is a cyclic redundancy check, or CRC, checksum. It is verified during every StorageGRID object operation—store, retrieve, transmit, receive, access, and background verification.And finally, the fourth domain is a key-based hash value. Using the hash key, this domain secures against all forms of tampering. As you see, StorageGRID provides very sophisticated and advanced fixity checking, which is a major reason we selected it for our DRPS storage layer.
Continuing with the DRPS data integrity validation chain . . .. . . StorageGRID uses the four levels of fixity checking we just discussed to ensure integrity of AIPs that are written to the grid—from the gateway node to the storage nodes. Once a file has been correctly written to a storage node, StorageGRID invokes the TSM Client running on the archive node server in order to write the file to tape. As this happens, the SHA-1 fixity information is not handed off to TSM. Rather, TSM end-to-end logical block protection takes over.
TSM end-to-end logical block protection utilizes CRCs and ECCs that supersede the use of SHA-1 fixity information while TSM is in control of the file.This advanced protection is enabled with brand new, state-of-the-art functionality provided by IBM LTO-5 and TS1140 tape drives, which I will soon illustrate.While the DRPS fixity information chain of control is altered when StorageGRID invokes TSM, validation of the file’s data integrity continues seamlessly until it is correctly written to tape using TSM end-to-end logical block protection.
The TSM end-to-end logical block protection process begins when . . . . . . the TSM server calculates and appends a cyclic redundancy check value, or CRC, to each AIP logical block before transferring it to a tape drive for writing. Each appended CRC is called the “original data CRC” for that logical block. When the tape drive receives a logical block, it computes its own CRC for the data and compares it to the original data CRC. If an error is detected, a check condition is generated, forcing a re-drive or a permanent error. This step effectively guarantees protection of the logical block during transfer.
As the logical block is loaded into the drive’s main data buffer, two parallel processes occur. In one process, data is cycled back through an on-the-fly verifier that once again validates the original data CRC. Any introduced error will force a re-drive or a permanent error. In parallel, an error-correcting code, or ECC, is computed and appended to the data. Referred to as the “C1 code,” this ECC protects data integrity of the logical block as it goes through additional formatting steps . . .
. . . including the addition of an additional ECC, referred to as the “C2 code.”As part of these formatting steps, the C1 code is checked every time data is read from the data buffer. Thus, protection of the original data CRC is essentially transformed to protection from the more powerful C1 code.
Finally the data is read from the main data buffer and is written to tape using a read-while-write process. During this process, the just written data is read back from tape and loaded into the main data buffer so the C1 code can be checked once again to verify the written data. A successful read-while-write operation assures that no data corruption has occurred from the time the AIP logical block was transferred from the TSM server until it is written to tape. And using these ECCs and CRCs, the tape drive can validate AIP logical blocks at full line speed as they are being written!
During a read operation, data is read from the tape and all three codes (C1, C2, and the original data CRC) are decoded and checked, and a read error is generated if any process indicates an error. The original data CRC is then appended to the logical block.When the logical block is transferred to the TSM server, the original data CRC is independently verified by that server, thus completing the TSM end-to-end logical block protection cycle. I didn’t mention this previously, but TSM also performs data integrity validation during client sessions when data is sent between a client and the server, or vice versa.
Unfortunately, continuously ensuring data integrity of a DRPS AIP does not end once the AIP has been written correctly to tape. We must assume that bits will flip after being written correctly to tape.As we discussed earlier, the USC Shoah Foundation Institute has seen a 10-14 bit error rate, and we have also when we recently validated our entire tape archive. Therefore, we believe that all written tapes in the archive must be read periodically to find and correct bit flips that have occurred since the tapes were written correctly.
Unfortunately, reading all the tapes in the archive in order to stage AIPs to disk so servers can check the fixity information is clearly a resource intensive task—especially for an archive with a capacity measured in hundreds of petabytes! Fortunately, IBM LTO-5 and TS1140 tape drives provide a much more efficient solution.During a “Verify” operation, IBM LTO-5 and TS1140 drives perform data integrity validation in-drive, which means a drive reads a tape and concurrently checks the three logical block CRCs and ECCs discussed previously at full line speed.Good or bad status is reported as soon as these internal checks are completed. And this is done without requiring any other resources! Clearly, this advanced capability enhances the ability of DRPS to perform periodic data integrity validations of the entire archive more frequently, which will facilitate the correction of bit flips after AIPs are written correctly to tape.
To summarize my presentation, fixity information is the key to archive data integrity.For the Church History Department’s Digital Records Preservation System, SHA-1 fixity values ensure data integrity all the way from the producer to StorageGRID archive nodes.From there, TSM end-to-end logical block protection takes over to ensure data integrity until the data is correctly written to tape.And finally, the new in-drive data integrity validation capability of IBM tape drives enables DRPS to perform periodic data integrity checks of the entire archive to provide continuous data integrity.
Sizing bit errors in your digital archive may be somewhat difficult. I heard from a tape vendor at a recent preservation conference that tape exhibits a 10-19 bit error rate. This rate is optimistic, however, compared to the results we recently encountered when we performed a data integrity validation of our entire DRPS archive. We realized a 3.3x10-14 bit error rate—which is five orders of magnitude higher than the vendor claim!10-19 is also higher than what I encountered when I visited the University of Southern California’s Shoah Foundation Institute in 2009. But first, some background on this Institute. It was established by Steve Spielberg, the great film producer, after he finished filming Schindler’s List. Shoah is the Hebrew term for Holocaust. More than 51,000 interviews of Holocaust survivors and other witness have been videotaped by the Shoah Institute.Currently, 87% of the 204,000+ Betacam SP master tapes have been converted to Motion JPEG 2000 preservation masters. When I visited the Institute in 2009, the tape archive capacity was 8 petabytes.Sam Gustman, the CTO of the Shoah Foundation Institute, told me that his team had encountered 1500 bit flips in those 8 petabytes. This translates to a bit error rate of 2.3x10-14—also five orders of magnitude higher than the vendor claim!I believe these real life measurements provide credible guidance for tape archives.

Gabe Nault Data Integrity

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Gabe Nault Data Integrity

Similar to Gabe Nault Data Integrity (20)

More from Future Perfect 2012

More from Future Perfect 2012 (20)

Recently uploaded

Recently uploaded (20)

Gabe Nault Data Integrity

Editor's Notes