The document discusses ensuring data integrity in the LDS Church's digital preservation archive (DRPS). It describes the DRPS system architecture which uses multiple copies across locations, automatic replication, and various integrity checks like fixity values and cyclic redundancy checks to ensure data integrity from ingest to permanent tape storage. It highlights how tape storage provides better long-term preservation than disk, though it presents some access challenges, and discusses ongoing efforts to verify archive integrity through periodic reading and drive-level error checking of tapes.
Geek Sync | Guide to Understanding and Monitoring TempdbIDERA Software
You can watch the replay for this Geek Sync webcast in the IDERA Resource Center: http://ow.ly/7OmW50A5qNs
Every SQL Server system you work with has a tempdb database. In this Geek Sync, you’ll learn how tempdb is structured, what it’s used for and the common performance problems that are tied to this shared resource.
Red Hat Storage Server Administration Deep DiveRed_Hat_Storage
"In this session for administrators of all skill levels, you’ll get a deep technical dive into Red Hat Storage Server and GlusterFS administration.
We’ll start with the basics of what scale-out storage is, and learn about the unique implementation of Red Hat Storage Server and its advantages over legacy and competing technologies. From the basic knowledge and design principles, we’ll move to a live start-to-finish demonstration. Your experience will include:
Building a cluster.
Allocating resources.
Creating and modifying volumes of different types.
Accessing data via multiple client protocols.
A resiliency demonstration.
Expanding and contracting volumes.
Implementing directory quotas.
Recovering from and preventing split-brain.
Asynchronous parallel geo-replication.
Behind-the-curtain views of configuration files and logs.
Extended attributes used by GlusterFS.
Performance tuning basics.
New and upcoming feature demonstrations.
Those new to the scale-out product will leave this session with the knowledge and confidence to set up their first Red Hat Storage Server environment. Experienced administrators will sharpen their skills and gain insights into the newest features. IT executives and managers will gain a valuable overview to help fuel the drive for next-generation infrastructures."
Ceph is an open source distributed storage system designed for scalability and reliability. Ceph's block device, RADOS block device (RBD), is widely used to store virtual machines, and is the most popular block storage used with OpenStack.
In this session, you'll learn how RBD works, including how it:
* Uses RADOS classes to make access easier from user space and within the Linux kernel.
* Implements thin provisioning.
* Builds on RADOS self-managed snapshots for cloning and differential backups.
* Increases performance with caching of various kinds.
* Uses watch/notify RADOS primitives to handle online management operations.
* Integrates with QEMU, libvirt, and OpenStack.
Keiichi Matsuzawa, Mitsuo Hayasaka, Takahiro Shinagawa.
The Quick Migration of File Servers.
In Proceedings of the 11th ACM International Systems and Storage Conference (SYSTOR 2018), Jun 2018. Acceptance Ratio: 22.7%.
http://dx.doi.org/10.1145/3211890.3211894
Geek Sync | Guide to Understanding and Monitoring TempdbIDERA Software
You can watch the replay for this Geek Sync webcast in the IDERA Resource Center: http://ow.ly/7OmW50A5qNs
Every SQL Server system you work with has a tempdb database. In this Geek Sync, you’ll learn how tempdb is structured, what it’s used for and the common performance problems that are tied to this shared resource.
Red Hat Storage Server Administration Deep DiveRed_Hat_Storage
"In this session for administrators of all skill levels, you’ll get a deep technical dive into Red Hat Storage Server and GlusterFS administration.
We’ll start with the basics of what scale-out storage is, and learn about the unique implementation of Red Hat Storage Server and its advantages over legacy and competing technologies. From the basic knowledge and design principles, we’ll move to a live start-to-finish demonstration. Your experience will include:
Building a cluster.
Allocating resources.
Creating and modifying volumes of different types.
Accessing data via multiple client protocols.
A resiliency demonstration.
Expanding and contracting volumes.
Implementing directory quotas.
Recovering from and preventing split-brain.
Asynchronous parallel geo-replication.
Behind-the-curtain views of configuration files and logs.
Extended attributes used by GlusterFS.
Performance tuning basics.
New and upcoming feature demonstrations.
Those new to the scale-out product will leave this session with the knowledge and confidence to set up their first Red Hat Storage Server environment. Experienced administrators will sharpen their skills and gain insights into the newest features. IT executives and managers will gain a valuable overview to help fuel the drive for next-generation infrastructures."
Ceph is an open source distributed storage system designed for scalability and reliability. Ceph's block device, RADOS block device (RBD), is widely used to store virtual machines, and is the most popular block storage used with OpenStack.
In this session, you'll learn how RBD works, including how it:
* Uses RADOS classes to make access easier from user space and within the Linux kernel.
* Implements thin provisioning.
* Builds on RADOS self-managed snapshots for cloning and differential backups.
* Increases performance with caching of various kinds.
* Uses watch/notify RADOS primitives to handle online management operations.
* Integrates with QEMU, libvirt, and OpenStack.
Keiichi Matsuzawa, Mitsuo Hayasaka, Takahiro Shinagawa.
The Quick Migration of File Servers.
In Proceedings of the 11th ACM International Systems and Storage Conference (SYSTOR 2018), Jun 2018. Acceptance Ratio: 22.7%.
http://dx.doi.org/10.1145/3211890.3211894
This presentation is based on the US FDA Workshop that I attended in Mumbai.
Pharmaceutical Industry is challenged for the Data Integrity. The reason for Data Integrity Fraud are obvious, but are overlooked. The importance of the genuine data must be accepted from the patient safety angle. There has to be a sincere attempt from the Top Management to eliminate this problem. It is trust that needs to be developed and maintained.
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and OperationsJohn Blue
Data Integrity, Using Records for Benchmarking and Operations - Greg Bilbrey, from the 2015 Allen D. Leman Swine Conference, September 19-22, 2015, St. Paul, Minnesota, USA.
More presentations at http://www.swinecast.com/2015-leman-swine-conference-material
Data Integrity in a GxP-regulated Environment - Pauwels Consulting AcademyPauwels Consulting
On Tuesday, December 6, 2016, our colleague Angelo Rossi, Senior Regulatory Compliance Consultant, gave an interesting presentation about “Data Integrity in a GxP-regulated Environment” at the Brussels Office of Pauwels Consulting in Diegem.
In his presentation, Angelo covered definitions and concepts of data integrity, the change in regulatory focus, lessons learned from recent FDA warning letters, importants highlights of regulations and guidelines. Angelo also presented a practical example of data integrity for a computerized system.
Please contact us at contact@pauwelsconsulting.com or +32 9 324 70 80 if you have any further questions regarding our consulting services in this area.
Looking for expertise or support on Data Integrity? Contact us today.
Recently, the pharmaceutical industry has been challenged with the regulatory requirements to provide complete, consistent and accurate data, throughout all GMP regulated processes.
Moreover, during audits the regulatory bodies have observed a level of inconsistency in the application of the predicate rules in GMP processes. This has become a growing concern and has led to a set of new (draft) guidances from different market authorities.
Index:
Data Integrity – Why / What
Data life cycle
Core Data Integrity concepts & building blocks
Short & mid-term actions enabling a focused road to compliance
A Bible Study series by Br. Viji Roberts of New Life Bible Chapel, looking at the characteristics of the High Priestly Garments and its portrayal of Jesus Christ
Data Integrity Validation Keynote Address Boston August 2016Ajaz Hussain
Critical Update — Navigate the 2016 FDA Data Integrity Compliance Draft Guidance and Other Global Regulations
Evaluate the Global Regulatory Landscape
Expectations of the 2016 FDA Draft Guidance
Overcome Top Challenges
Proactive Approach to Assurance of Data Integrity
Bonus Material
Inter connect2016 yss1841-cloud-storage-options-v4Tony Pearson
This session will cover private and public cloud storage options, including flash, disk and tape, to address the different types of cloud storage requirements. It will also explain the use of Active File Management for local space management and global access to files, and support for file-and-sync.
This presentation is based on the US FDA Workshop that I attended in Mumbai.
Pharmaceutical Industry is challenged for the Data Integrity. The reason for Data Integrity Fraud are obvious, but are overlooked. The importance of the genuine data must be accepted from the patient safety angle. There has to be a sincere attempt from the Top Management to eliminate this problem. It is trust that needs to be developed and maintained.
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and OperationsJohn Blue
Data Integrity, Using Records for Benchmarking and Operations - Greg Bilbrey, from the 2015 Allen D. Leman Swine Conference, September 19-22, 2015, St. Paul, Minnesota, USA.
More presentations at http://www.swinecast.com/2015-leman-swine-conference-material
Data Integrity in a GxP-regulated Environment - Pauwels Consulting AcademyPauwels Consulting
On Tuesday, December 6, 2016, our colleague Angelo Rossi, Senior Regulatory Compliance Consultant, gave an interesting presentation about “Data Integrity in a GxP-regulated Environment” at the Brussels Office of Pauwels Consulting in Diegem.
In his presentation, Angelo covered definitions and concepts of data integrity, the change in regulatory focus, lessons learned from recent FDA warning letters, importants highlights of regulations and guidelines. Angelo also presented a practical example of data integrity for a computerized system.
Please contact us at contact@pauwelsconsulting.com or +32 9 324 70 80 if you have any further questions regarding our consulting services in this area.
Looking for expertise or support on Data Integrity? Contact us today.
Recently, the pharmaceutical industry has been challenged with the regulatory requirements to provide complete, consistent and accurate data, throughout all GMP regulated processes.
Moreover, during audits the regulatory bodies have observed a level of inconsistency in the application of the predicate rules in GMP processes. This has become a growing concern and has led to a set of new (draft) guidances from different market authorities.
Index:
Data Integrity – Why / What
Data life cycle
Core Data Integrity concepts & building blocks
Short & mid-term actions enabling a focused road to compliance
A Bible Study series by Br. Viji Roberts of New Life Bible Chapel, looking at the characteristics of the High Priestly Garments and its portrayal of Jesus Christ
Data Integrity Validation Keynote Address Boston August 2016Ajaz Hussain
Critical Update — Navigate the 2016 FDA Data Integrity Compliance Draft Guidance and Other Global Regulations
Evaluate the Global Regulatory Landscape
Expectations of the 2016 FDA Draft Guidance
Overcome Top Challenges
Proactive Approach to Assurance of Data Integrity
Bonus Material
Inter connect2016 yss1841-cloud-storage-options-v4Tony Pearson
This session will cover private and public cloud storage options, including flash, disk and tape, to address the different types of cloud storage requirements. It will also explain the use of Active File Management for local space management and global access to files, and support for file-and-sync.
S ss0885 spectrum-scale-elastic-edge2015-v5Tony Pearson
IBM Spectrum Scale offerings include the Spectrum Scale software that you can deploy on your own choice of hardware, Elastic Storage Server and Storwize V7000 Unified pre-built systems.
Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar
Introduction to Spectrum Scale Active File Management (AFM)
and its use cases. Spectrum Scale Protocols - Unified File & Object Access (UFO) Feature Details
AFM + Object : Unique Wan Caching for Object Store
Securing the Container Pipeline at Salesforce by Cem Gurkok Docker, Inc.
Customer trust and security is paramount for Salesforce. While containerization is great for DevOps due to flexibility, speed, isolation, transient existence, ease of management and patching, it becomes a challenging environment when the sensitivity level of the data traversing the environment increases. Monitoring systems, applications and network; performing disk, memory and network forensics in case of an incident; and vulnerability detection can easily become daunting tasks in such a volatile environment.
In this presentation we would like to discuss the infrastructure we have built to address these issues and to secure our Docker container platform while we rapidly containerize Salesforce. Our solutions focus on securing the container pipeline, building security into the architecture, monitoring, Docker forensics (disk, memory, network), and automation. We also would like to demonstrate some of our live memory analysis capabilities we leverage to assure container and application integrity during execution.
This was the overview for the basics of the OSI model and the concepts around networking for an NIU course I taught on networking. Another instructor also named Steve taught the course previously, and much of the presentation was based on his work but I can't remember what was modified. It was meant for educators without a deep technical background. .
Lightning talk showing various aspectos of software system performance. It goes through: latency, data structures, garbage collection, troubleshooting method like workload saturation method, quick diagnostic tools, famegraph and perfview
The Efficient Use of Cyberinfrastructure to Enable Data Analysis CollaborationCybera Inc.
Dave Fellinger
CTO, DataDirect Networks
Presented at the Cybera/CANARIE National Summit 2009, as part of the session "What's Next: Key Areas of Emerging Cyberinfrastructure."
This session explored some of the up-and-coming areas of cyberinfrastructure and why they are increasingly being considered as essential elements to innovative research and development.
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech TalksAmazon Web Services
Learning Objectives:
- Recognize why and when to use Amazon EFS and the economic benefits versus other solutions
- Understand key technical, performance, and security concepts
- See Amazon EFS in action with live demo
The vast majority of applications and workloads interact with data storage via a file system interface and require file system semantics. As businesses move to the cloud they require storage resources that integrates with their existing applications and tools. In this technical session, we will explore file storage with Amazon Elastic File System (Amazon EFS) and its targeted use cases. Attendees will learn about the Amazon EFS features and benefits, how to identify applications that are appropriate for use with Amazon EFS, and details about its performance and security models. We will highlight and demonstrate how to deploy Amazon EFS in our most common use cases and will share tips for success throughout.
Essentials of Automations: The Art of Triggers and Actions in FMESafe Software
In this second installment of our Essentials of Automations webinar series, we’ll explore the landscape of triggers and actions, guiding you through the nuances of authoring and adapting workspaces for seamless automations. Gain an understanding of the full spectrum of triggers and actions available in FME, empowering you to enhance your workspaces for efficient automation.
We’ll kick things off by showcasing the most commonly used event-based triggers, introducing you to various automation workflows like manual triggers, schedules, directory watchers, and more. Plus, see how these elements play out in real scenarios.
Whether you’re tweaking your current setup or building from the ground up, this session will arm you with the tools and insights needed to transform your FME usage into a powerhouse of productivity. Join us to discover effective strategies that simplify complex processes, enhancing your productivity and transforming your data management practices with FME. Let’s turn complexity into clarity and make your workspaces work wonders!
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Albert Hoitingh
In this session I delve into the encryption technology used in Microsoft 365 and Microsoft Purview. Including the concepts of Customer Key and Double Key Encryption.
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!SOFTTECHHUB
As the digital landscape continually evolves, operating systems play a critical role in shaping user experiences and productivity. The launch of Nitrux Linux 3.5.0 marks a significant milestone, offering a robust alternative to traditional systems such as Windows 11. This article delves into the essence of Nitrux Linux 3.5.0, exploring its unique features, advantages, and how it stands as a compelling choice for both casual users and tech enthusiasts.
Dr. Sean Tan, Head of Data Science, Changi Airport Group
Discover how Changi Airport Group (CAG) leverages graph technologies and generative AI to revolutionize their search capabilities. This session delves into the unique search needs of CAG’s diverse passengers and customers, showcasing how graph data structures enhance the accuracy and relevance of AI-generated search results, mitigating the risk of “hallucinations” and improving the overall customer journey.
How to Get CNIC Information System with Paksim Ga.pptxdanishmna97
Pakdata Cf is a groundbreaking system designed to streamline and facilitate access to CNIC information. This innovative platform leverages advanced technology to provide users with efficient and secure access to their CNIC details.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Securing your Kubernetes cluster_ a step-by-step guide to success !KatiaHIMEUR1
Today, after several years of existence, an extremely active community and an ultra-dynamic ecosystem, Kubernetes has established itself as the de facto standard in container orchestration. Thanks to a wide range of managed services, it has never been so easy to set up a ready-to-use Kubernetes cluster.
However, this ease of use means that the subject of security in Kubernetes is often left for later, or even neglected. This exposes companies to significant risks.
In this talk, I'll show you step-by-step how to secure your Kubernetes cluster for greater peace of mind and reliability.
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024Neo4j
Neha Bajwa, Vice President of Product Marketing, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Removing Uninteresting Bytes in Software FuzzingAftab Hussain
Imagine a world where software fuzzing, the process of mutating bytes in test seeds to uncover hidden and erroneous program behaviors, becomes faster and more effective. A lot depends on the initial seeds, which can significantly dictate the trajectory of a fuzzing campaign, particularly in terms of how long it takes to uncover interesting behaviour in your code. We introduce DIAR, a technique designed to speedup fuzzing campaigns by pinpointing and eliminating those uninteresting bytes in the seeds. Picture this: instead of wasting valuable resources on meaningless mutations in large, bloated seeds, DIAR removes the unnecessary bytes, streamlining the entire process.
In this work, we equipped AFL, a popular fuzzer, with DIAR and examined two critical Linux libraries -- Libxml's xmllint, a tool for parsing xml documents, and Binutil's readelf, an essential debugging and security analysis command-line tool used to display detailed information about ELF (Executable and Linkable Format). Our preliminary results show that AFL+DIAR does not only discover new paths more quickly but also achieves higher coverage overall. This work thus showcases how starting with lean and optimized seeds can lead to faster, more comprehensive fuzzing campaigns -- and DIAR helps you find such seeds.
- These are slides of the talk given at IEEE International Conference on Software Testing Verification and Validation Workshop, ICSTW 2022.
UiPath Test Automation using UiPath Test Suite series, part 6DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 6. In this session, we will cover Test Automation with generative AI and Open AI.
UiPath Test Automation with generative AI and Open AI webinar offers an in-depth exploration of leveraging cutting-edge technologies for test automation within the UiPath platform. Attendees will delve into the integration of generative AI, a test automation solution, with Open AI advanced natural language processing capabilities.
Throughout the session, participants will discover how this synergy empowers testers to automate repetitive tasks, enhance testing accuracy, and expedite the software testing life cycle. Topics covered include the seamless integration process, practical use cases, and the benefits of harnessing AI-driven automation for UiPath testing initiatives. By attending this webinar, testers, and automation professionals can gain valuable insights into harnessing the power of AI to optimize their test automation workflows within the UiPath ecosystem, ultimately driving efficiency and quality in software development processes.
What will you get from this session?
1. Insights into integrating generative AI.
2. Understanding how this integration enhances test automation within the UiPath platform
3. Practical demonstrations
4. Exploration of real-world use cases illustrating the benefits of AI-driven test automation for UiPath
Topics covered:
What is generative AI
Test Automation with generative AI and Open AI.
UiPath integration with generative AI
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionAggregage
Join Maher Hanafi, VP of Engineering at Betterworks, in this new session where he'll share a practical framework to transform Gen AI prototypes into impactful products! He'll delve into the complexities of data collection and management, model selection and optimization, and ensuring security, scalability, and responsible use.
GridMate - End to end testing is a critical piece to ensure quality and avoid...ThomasParaiso2
End to end testing is a critical piece to ensure quality and avoid regressions. In this session, we share our journey building an E2E testing pipeline for GridMate components (LWC and Aura) using Cypress, JSForce, FakerJS…
GridMate - End to end testing is a critical piece to ensure quality and avoid...
Gabe Nault Data Integrity
1. Ensuring Data Integrity
in a Digital Preservation
Archive
Gabe Nault
LDS Church
naultga@ldschurch.org
Future Perfect Conference
2012
image courtesy of IBM
2. Introducing the LDS Church
• The Church of Jesus Christ of Latter-day Saints
• Global Christian church with 14 million members
• 3 universities, 1 college
• State-of-the-art audio-
visual capabilities
• Scriptural mandate to
keep and preserve records since
since 1830
photo by Henok Montoya
3. The Church History Department
• Preserves records of enduring value from
Church leaders, departments, universities,
and affiliations (more than 35 organizations)
• Typically, less than
10% of records are
candidates for
preservation
Church History Library on Temple Square
4. Granite Mountain Records Vault
• Bored into a solid granite
mountain
• Stores large microfilm
collections and valuable
Church artifacts
• Plans recently developed
to renovate the facility for
digital preservation
5. The Media Services Department
• Audiovisual records will
consume majority of our
archive capacity
Mormon Tabernacle Choir and Orchestra
Free Bible videos from biblevideos.lds.org
• 100+ PB in a decade
for a single copy!
Conference Center on Temple Square
6. DRPS System Architecture
Fixity
Creation DRPS Ingest Tools
Preservation
Digital Functions
Records Fixity
Storage Extensions
Preservation Bridge
Information
System Lifecycle StorageGRID
Management
Tape IBM
Interface Tivoli Storage Manager
8. DRPS Highlights
• Multiple copies in multiple geographic
locations (eventually)
• Approximately 1 PB spinning media
• Automatic replication to remote site(s)
• End to end data integrity
• Tape base permanent storage
9. Why Tape for Preservation?
Total cost of storage ownership study
• TCO - Over ten years, ownership and
operating costs of tape are three to
fifteen times less than associated costs
for disk arrays IBM TS3500
Tape Libraries
• Cost advantages of tape are expected image courtesy of IBM
to increase over time
• Conclusion—for now, tape is required
to sustain a multi-PB digital archive
• But . . . tape presents some challenges
10. Why Tape for Preservation?
Limitations
• Latency
• Limited to sequential access
• Limited number of read/writes IBM TS3500
• Leads to greater system and Tape Libraries
image courtesy of IBM
operational complexity
11. Data Integrity
• Data integrity validation is provided by fixity checks
when data is written, transferred, moved, or copied
• Fixity checking should be performed from file
creation to permanent storage to delivery
• Periodic validation of the entire archive should also
be performed to detect data corruption(bit rot, drive
errors, tape degradation, etc)
• DRPS uses a variety of integrity values for fixity
13. DRPS Data Integrity Validation
SHA-1
control
DRPS Ingest Tools SHA-1 created for producer files
SHA-1 SHA-1 checked upon ingest
control and write to permanent storage
Web service retrieves StorageGRID
Storage Extensions SHA-1, then Rosetta plug-in
compares with Rosetta SHA-1
SHA-1 SHA-1 created for ingested files
control
StorageGRID
14. StorageGRID Fixity Checking
• StorageGRID is constructed around the concept
of object storage
StorageGRID
• Provides a layered/overlapping set of protection
domains to guard against object data corruption
1. SHA-1 object hash—checked on store and access
2. Content hash—checked on access
3. CRC checksum—checked with every operation
4. Key-based hash value—checked on access
15. DRPS Data Integrity Validation
SHA-1
control
DRPS Ingest Tools SHA-1 created for producer files
SHA-1 SHA-1 checked upon ingest
control and write to permanent storage
Web service retrieves StorageGRID
Storage Extensions SHA-1, then Rosetta plug-in
compares with Rosetta SHA-1
SHA-1 SHA-1 created for ingested files
control
StorageGRID
SHA-1 and other fixity checked
during write to storage nodes
CRCs, IBM TSM end-to-end logical block
ECCs Tivoli Storage Manager
protection
16. TSM End-to-End Logical Block Protection
• Supersedes SHA-1 fixity information with
cyclic redundancy check values (CRCs)
and error-correcting codes (ECCs)
• Enabled with new, state-of-the-art
functionality of IBM LTO-5 and TS1140
tape drives
• Seamlessly extends validation
of data integrity as data is
written to tape
17. TSM End-to-End Logical Block Protection
1. TSM server calculates and
appends “original data CRC”
to logical data block
2. Tape drive computes its
own CRC and compares
to original data CRC
18. TSM End-to-End Logical Block Protection
3. As logical block is loaded into
drive data buffer, on-the-fly
verifier checks original data CRC
4. In parallel, a “C1 code”
(ECC) is computed and
appended
19. TSM End-to-End Logical Block Protection
5. An additional ECC, referred to
as “C2 code,” is added to the
logical block
6. More powerful than the
original data CRC, the
C1 code is checked every
time data is read from the buffer
20. TSM End-to-End Logical Block Protection
7. Data written to tape at full
line speed with read-while-
write process
8. Just written data loaded to
buffer and C1 code checked
Successful read-while-
write operation assures no
data corruption from TSM
server to tape
21. TSM End-to-End Logical Block Protection
9. When tape is read, all codes
(C1, C2, original data CRC)
are checked by drive
10. Original data CRC appended
to logical block
11. TSM server verifies original
data CRC, completing TSM
end-to-end logical block
protection cycle
22. Ongoing Archive Data Integrity
• We must assume that data may become
corrupted after being written correctly
to tape
• Therefore, tapes must be read
periodically to identify and correct data
errors
image courtesy of IBM
23. Ongoing Archive Data Integrity
• Staging IEs to disk to verify integrity
is resource intensive!
• IBM LTO-5 and TS1140 tape drives
provide a more efficient solution
• During “SCSI Verify” operation, a
tape is mounted, drive checks all image courtesy of IBM
codes (C1, C2, original data CRC) as
data is being read (at full line speed)
• Only status is reported as these
internal checks are completed
24. Summary
• Fixity information is the SHA-1
DRPS Ingest Tools
key to data integrity control
• SHA-1 values ensure data SHA-1
integrity to StorageGRID control
• TSM end-to-end logical Storage Extensions
block protection ensures
data integrity to tape SHA-1
control
StorageGRID
• In-drive validation enables
ongoing integrity checks CRCs, IBM
ECCs Tivoli Storage Manager
for the entire archive
26. Trademarks
The Ex Libris logo and Rosetta are trademarks of Ex Libris Group.
The IBM logo and Tivoli Storage Manager are trademarks of International Business Machines Corporation.
The NetApp logo and StorageGRID are trademarks of NetApp, Inc.
27. Rate of Bit Errors
• Preliminary validation of DRPS archive
resulted in a 3.3x10-14 bit error rate
• USC Shoah Foundation Institute visit
• 8 PB tape archive of videotaped interviews of
Holocaust survivors and other witnesses
• Experienced 1500 bit flips in 8 PB
(2.3x10-14 bit error rate)`
Editor's Notes
Good morning! My presentation will cover the challenges of, and some working solutions to, a key requirement of digital preservation—ongoing data integrity of the archive. The solutions I will discuss were developed cooperatively by three vendors in conjunction with the Church of Jesus Christ of Latter-day Saints. By the way, you will be able to download a white paper that covers my presentation along with the presentation itself when the conference is over.
First let me introduce the Church.Its full name is the Church of Jesus Christ of Latter-day Saints. Headquarters are in Salt Lake City, Utah – a Western State in the United Sates of America. The building shown here is the Salt Lake Temple, which has come to be a symbol for the Church. For your information, the Church operates 134 temples around the world. Temples are not weekly meeting places; rather, they are sacred places where families are sealed together forever – beyond life here on earth.We are a global Christian Church with more than 14 million members.The Church has more than 700,000 students enrolled in religious training around the world.It also operates three universities and a business college. Education is very important to members of the Church of Jesus Christ of Latter-day Saints.Over the last two decades, the Church has developed state-of-the-art digital audiovisual capabilities to support its vast, worldwide communications needs. I will talk more about this later.The Church has a scriptural mandate to keep records of its proceedings and preserve them for future generations. Accordingly, the Church has been creating and keeping records since 1830, when it was organized. A Church Historian’s Office was formed in the 1840s, and in 1972 it was renamed the Church History Department.
Today, the Church History Department has ultimate responsibility for preserving records of enduring value that originate from its ecclesiastical leaders and within the various Church departments, the Church’s educational institutions, and its affiliations. In order to carry out this responsibility, the Church History Department’s Records Management team helps each Church organization develop a records management plan. The plan identifies all records used by the organization and establishes a record retention and disposition schedule for each collection.Usually, less than 10% of the records have a final disposition of “archive.” Only these records are preserved for future generations.
Unfortunately, reading all the tapes in the archive in order to stage AIPs to disk so servers can check the fixity information is clearly a resource intensive task—especially for an archive with a capacity measured in hundreds of petabytes! Fortunately, IBM LTO-5 and TS1140 tape drives provide a much more efficient solution.During a “Verify” operation, IBM LTO-5 and TS1140 drives perform data integrity validation in-drive, which means a drive reads a tape and concurrently checks the three logical block CRCs and ECCs discussed previously at full line speed.Good or bad status is reported as soon as these internal checks are completed. And this is done without requiring any other resources! Clearly, this advanced capability enhances the ability of DRPS to perform periodic data integrity validations of the entire archive more frequently, which will facilitate the correction of bit flips after AIPs are written correctly to tape.
I mentioned earlier that the Church has developed state-of-the-art digital audiovisual capabilities to support its vast, worldwide communications needs. The Media Services Department uses these capabilities to support the rest of the Church organizations in their audiovisual needs. Because of the average size of MSD audiovisual files, which is several hundred gigabytes so far, MSD audiovisual files will consume the vast majority of archive capacity in the Church History Department’s Digital Records Preservation System.One example of audiovisual records we preserve is weekly broadcasts of Music and the Spoken Word—the world’s longest continuous network broadcast (now in its 83rd year). Each broadcast features an inspirational message and music performed by the Mormon Tabernacle Choir, also known as “America’s Choir,” and the Orchestra at Temple Square. These National Radio Hall of Fame broadcasts are clearly a priceless treasure for the world that are being preserved for future generations.Another example of audiovisual records we preserve is semiannual broadcasts of General Conference, which is held in the remarkable Conference Center, shown here, that seats 21,000. The meetings are broadcast in high definition video via satellite to more than 7,400 Church buildings in 102 countries. The broadcasts are simultaneously translated into 76 languages. Ultimately, digital audio tracks for 96 languages are created and preserved to augment the digital video taping of each meeting. Not surprisingly, the Church is the world’s largest language broadcaster. As a gift to the world, the Church launched a new website last Christmas that provides free Bible videos of the birth, life, death, and resurrection of the Lord Jesus Christ. Viewable with a free mobile app, these videos are faithful to the biblical account, and of course will be preserved for future generations. I encourage you to visit the website at biblevideos.lds.org.With audiovisual files such as these, we expect that our archive capacity within a decade will exceed 100 petabytes for a single copy!
The Church History Department’s Digital Records Preservation System, or DRPS, is based on Ex Libris Rosetta. Rosetta provides configurable preservation workflows and advanced preservation planning functions, but only writes a single copy of an AIP to a storage device for permanent storage. Therefore, an appropriate storage layer must be integrated with Rosetta in order to provide the full capabilities of a digital preservation archive, including AIP replication.After investigating a host of potential storage layer solutions, the Church History Department chose NetApp StorageGRID to provide the ILM capabilities that were desired. In particular, StorageGRID’s data integrity, data resilience, and data replication capabilities were attractive. In order to support ILM migration of AIPs from disk to tape, StorageGRID utilizes IBM Tivoli Storage Manager, or TSM, as an interface to tape libraries.DRPS also employs software extensions developed by my team, which is part of Church Information and Communications Services . The first is a set of ingest tools that help with fixity information creation, which I will discuss later.The second involves a fixity information bridge that will also be described later.
You may wonder why we chose to use tape libraries for the DRPS archive. In 2008, an internal study was performed to compare the costs of acquisition, maintenance, administration, data center floor space, and power to archive hundreds of petabytes of digital records using disk arrays, optical disks, virtual tape libraries, and automated tape cartridges. The model also incorporated assumptions about increasing storage densities of these different storage technologies over time.Calculating all costs over a ten year period, the study concluded that the total cost of ownership of automated tape cartridges would be 33.7% of the next closest storage technology, which was disk arrays.Based on discussions with major storage providers, we believe that the cost of power and the cost per terabyte advantages of tape will only increase over time.Therefore, we concluded that, at least for now, we should use tape libraries to sustain our digital archive that is expected to skyrocket to a multiple petabyte capacity in just a few years. When we made this decision, we were NOT naive to the challenges of tape we would be facing.
You may wonder why we chose to use tape libraries for the DRPS archive. In 2008, an internal study was performed to compare the costs of acquisition, maintenance, administration, data center floor space, and power to archive hundreds of petabytes of digital records using disk arrays, optical disks, virtual tape libraries, and automated tape cartridges. The model also incorporated assumptions about increasing storage densities of these different storage technologies over time.Calculating all costs over a ten year period, the study concluded that the total cost of ownership of automated tape cartridges would be 33.7% of the next closest storage technology, which was disk arrays.Based on discussions with major storage providers, we believe that the cost of power and the cost per terabyte advantages of tape will only increase over time.Therefore, we concluded that, at least for now, we should use tape libraries to sustain our digital archive that is expected to skyrocket to a multiple petabyte capacity in just a few years. When we made this decision, we were NOT naive to the challenges of tape we would be facing.
One of those challenges has to do with ensuring data integrity of the tape archive. This is a critical requirement for any digital preservation archive, and it differentiates a tape archive from other types of tape farms. Modern IT equipment, including servers, storage, and network switches and routers, incorporate advanced features to minimize data corruption. Nevertheless, undetected errors still occur for a variety of reasons. Whenever data files are written, read, stored, transmitted over a network, or processed, there is a small but real possibility that corruption will occur. Causes range from hardware and software failures to network transmission failures and interruptions. Bit flips within files stored on tape can also cause data corruption.Fixity information enables data integrity validation. Fixity information is a checksum, or integrity value, that is calculated by a secure hash algorithm to ensure data integrity of an AIP file throughout preservation workflows and after the file has been written to the archive. By comparing fixity hash values before and after records are written, transferred over a network, moved or copied, a digital preservation system can determine if data corruption has taken place during its workflows or while the AIP is stored in the archive.To do data integrity validation correctly, end-to-end fixity checking should be performed from file ingest to storing the file on permanent storage to eventual access and delivery.Furthermore, data integrity validation of the entire archive should be performed periodically to detect and correct bit flips (also known as bit rot).DRPS uses a variety of hash values, cyclic redundancy check values, and error-correcting codes for such fixity information.
As mentioned earlier, DRPS employs a variety of hash values, cyclic redundancy check values, and error-correcting codes in order to ensure data integrity of its tape archive. The chain of control of this fixity information is illustrated here.In order to implement fixity information as early as possible in the preservation process, and thus minimize data errors, DRPS provides ingest tools developed by my team that create SHA-1 fixity information for producer files before they are transferred to DRPS for ingest Control of this SHA-1 fixity information is transferred when a file is ingested into Rosetta. Within Rosetta, SHA-1 fixity checks are performed three times—(1) when the deposit server receives a Submission Information Package (SIP), (2) during the SIP validation process, and (3) when an AIP is moved to permanent storage. Rosetta also provides the capability to perform fixity checks on AIP files written to permanent storage, but the ILM features of StorageGRID do not utilize this capability. Therefore, StorageGRID must take over control of the SHA-1 fixity information once files have been written to it.By collaborating with Ex Libris on this process, ICS and Ex Libris have been successful in making the fixity information hand off from Rosetta to StorageGRID. This is accomplished with a web service we developed that retrieves SHA-1 hash values generated independently by StorageGRID when the files are written to the StorageGRID gateway node. Ex Libris developed a Rosetta plug-in that calls this web service and compares the StorageGRID SHA-1 hash values with those in the Rosetta database, which are known to be correct.
Before I go any further with the DRPS data integrity validation chain, I’d like to discuss in some detail how StorageGRID handles fixity checking.First, StorageGRID is constructed around the concept of object storage, which enables it to provide advanced Information Lifecycle Management capabilities.To ensure object data integrity, StorageGRID provides a layered and overlapping set of protection domains that guard against object data corruption and alteration of files that are written to the grid. The first domain is called the SHA-1 object hash—this is the same SHA-1 hash value we just discussed with the previous slide. It is generated when the object (or AIP) is created (i.e., when the gateway node writes it to the first storage node), and it is verified every time the object is stored and accessed.The second domain is called the content hash. Because this hash is not self-contained, it requires external information for verification, and therefore is checked only when the object is accessed. The third domain is a cyclic redundancy check, or CRC, checksum. It is verified during every StorageGRID object operation—store, retrieve, transmit, receive, access, and background verification.And finally, the fourth domain is a key-based hash value. Using the hash key, this domain secures against all forms of tampering. As you see, StorageGRID provides very sophisticated and advanced fixity checking, which is a major reason we selected it for our DRPS storage layer.
Continuing with the DRPS data integrity validation chain . . .. . . StorageGRID uses the four levels of fixity checking we just discussed to ensure integrity of AIPs that are written to the grid—from the gateway node to the storage nodes. Once a file has been correctly written to a storage node, StorageGRID invokes the TSM Client running on the archive node server in order to write the file to tape. As this happens, the SHA-1 fixity information is not handed off to TSM. Rather, TSM end-to-end logical block protection takes over.
TSM end-to-end logical block protection utilizes CRCs and ECCs that supersede the use of SHA-1 fixity information while TSM is in control of the file.This advanced protection is enabled with brand new, state-of-the-art functionality provided by IBM LTO-5 and TS1140 tape drives, which I will soon illustrate.While the DRPS fixity information chain of control is altered when StorageGRID invokes TSM, validation of the file’s data integrity continues seamlessly until it is correctly written to tape using TSM end-to-end logical block protection.
The TSM end-to-end logical block protection process begins when . . . . . . the TSM server calculates and appends a cyclic redundancy check value, or CRC, to each AIP logical block before transferring it to a tape drive for writing. Each appended CRC is called the “original data CRC” for that logical block. When the tape drive receives a logical block, it computes its own CRC for the data and compares it to the original data CRC. If an error is detected, a check condition is generated, forcing a re-drive or a permanent error. This step effectively guarantees protection of the logical block during transfer.
As the logical block is loaded into the drive’s main data buffer, two parallel processes occur. In one process, data is cycled back through an on-the-fly verifier that once again validates the original data CRC. Any introduced error will force a re-drive or a permanent error. In parallel, an error-correcting code, or ECC, is computed and appended to the data. Referred to as the “C1 code,” this ECC protects data integrity of the logical block as it goes through additional formatting steps . . .
. . . including the addition of an additional ECC, referred to as the “C2 code.”As part of these formatting steps, the C1 code is checked every time data is read from the data buffer. Thus, protection of the original data CRC is essentially transformed to protection from the more powerful C1 code.
Finally the data is read from the main data buffer and is written to tape using a read-while-write process. During this process, the just written data is read back from tape and loaded into the main data buffer so the C1 code can be checked once again to verify the written data. A successful read-while-write operation assures that no data corruption has occurred from the time the AIP logical block was transferred from the TSM server until it is written to tape. And using these ECCs and CRCs, the tape drive can validate AIP logical blocks at full line speed as they are being written!
During a read operation, data is read from the tape and all three codes (C1, C2, and the original data CRC) are decoded and checked, and a read error is generated if any process indicates an error. The original data CRC is then appended to the logical block.When the logical block is transferred to the TSM server, the original data CRC is independently verified by that server, thus completing the TSM end-to-end logical block protection cycle. I didn’t mention this previously, but TSM also performs data integrity validation during client sessions when data is sent between a client and the server, or vice versa.
Unfortunately, continuously ensuring data integrity of a DRPS AIP does not end once the AIP has been written correctly to tape. We must assume that bits will flip after being written correctly to tape.As we discussed earlier, the USC Shoah Foundation Institute has seen a 10-14 bit error rate, and we have also when we recently validated our entire tape archive. Therefore, we believe that all written tapes in the archive must be read periodically to find and correct bit flips that have occurred since the tapes were written correctly.
Unfortunately, reading all the tapes in the archive in order to stage AIPs to disk so servers can check the fixity information is clearly a resource intensive task—especially for an archive with a capacity measured in hundreds of petabytes! Fortunately, IBM LTO-5 and TS1140 tape drives provide a much more efficient solution.During a “Verify” operation, IBM LTO-5 and TS1140 drives perform data integrity validation in-drive, which means a drive reads a tape and concurrently checks the three logical block CRCs and ECCs discussed previously at full line speed.Good or bad status is reported as soon as these internal checks are completed. And this is done without requiring any other resources! Clearly, this advanced capability enhances the ability of DRPS to perform periodic data integrity validations of the entire archive more frequently, which will facilitate the correction of bit flips after AIPs are written correctly to tape.
To summarize my presentation, fixity information is the key to archive data integrity.For the Church History Department’s Digital Records Preservation System, SHA-1 fixity values ensure data integrity all the way from the producer to StorageGRID archive nodes.From there, TSM end-to-end logical block protection takes over to ensure data integrity until the data is correctly written to tape.And finally, the new in-drive data integrity validation capability of IBM tape drives enables DRPS to perform periodic data integrity checks of the entire archive to provide continuous data integrity.
Sizing bit errors in your digital archive may be somewhat difficult. I heard from a tape vendor at a recent preservation conference that tape exhibits a 10-19 bit error rate. This rate is optimistic, however, compared to the results we recently encountered when we performed a data integrity validation of our entire DRPS archive. We realized a 3.3x10-14 bit error rate—which is five orders of magnitude higher than the vendor claim!10-19 is also higher than what I encountered when I visited the University of Southern California’s Shoah Foundation Institute in 2009. But first, some background on this Institute. It was established by Steve Spielberg, the great film producer, after he finished filming Schindler’s List. Shoah is the Hebrew term for Holocaust. More than 51,000 interviews of Holocaust survivors and other witness have been videotaped by the Shoah Institute.Currently, 87% of the 204,000+ Betacam SP master tapes have been converted to Motion JPEG 2000 preservation masters. When I visited the Institute in 2009, the tape archive capacity was 8 petabytes.Sam Gustman, the CTO of the Shoah Foundation Institute, told me that his team had encountered 1500 bit flips in those 8 petabytes. This translates to a bit error rate of 2.3x10-14—also five orders of magnitude higher than the vendor claim!I believe these real life measurements provide credible guidance for tape archives.