INTELLIGENCE. INNOVATION. INTEGRITY
Database Archiving
Managing Data for Long Term Retention
Craig S. Mullins
craig.mullins@neonesoft.com
INTELLIGENCE. INNOVATION. INTEGRITY
Emergence of Data Management Functions
The Long Term Data Storage Problem
Long Term Data Storage Solutions
Database Archiving Capabilities
Agenda
INTELLIGENCE. INNOVATION. INTEGRITY
Difference between DBA and DM
 Database Administration
— Backup/Recovery
— Disaster Recovery
— Reorganization
— Performance Monitoring
— Application Call Level Tuning
— Data Structure Tuning
— Capacity Planning
Managing the database environment Managing the content and uses of data
 Data Management
— Database Security
— Data Privacy Protection
— Data Quality Improvement
— Data Quality Monitoring
— Database Archiving
— Data Extraction
— Metadata Management
INTELLIGENCE. INNOVATION. INTEGRITY
Data Management Functions
 Database Security
— Authorization Auditing
— Access Auditing
— Intrusion Detection
— Replication Auditing
 Data Quality
— Data Profiling
— Data Quality Assessment
— Data Cleansing
— Data Quality Filtering
— Data Profile Monitoring
 Data Archiving
— Short term Reference Database
— Long Term Database Archiving
 Data Extraction
— Maintain privacy
— Maintain Security
 Metadata Management
— Complete Encapsulation
— Change History Auditing
INTELLIGENCE. INNOVATION. INTEGRITY
Database Administration Functions
 Very well defined tasks
 Very well defined Job Title and Description
 Overwhelming vendor support
 DBMS architectures fully supportive
 Functions fall entirely in IT
 Must be done well to support efficient
operational environment
INTELLIGENCE. INNOVATION. INTEGRITY
Data Management Functions
 Tasks definitions are emerging
 No standard Job Titles or Descriptions
 More aligned with business units than IT
 IT management has not been supportive (NMP)
 Executive management has not been supportive
 DBMS architectures built without consideration of DM
 Little Vendor Support
 Companies have accrued many penalties for not
paying attention to DM requirements
INTELLIGENCE. INNOVATION. INTEGRITY
Emerging Data Management Drivers
Recent Regulations:
Corporate Governance
Data Privacy
Data Retention
Data Accuracy
Increasing Data Quality Costs
Increasing Data Volumes
Increasing uses/ users of data
More
Emphasis and
Spending on
Data Management
Functions
Significant
Tangible
Benefits
INTELLIGENCE. INNOVATION. INTEGRITY
Drivers Impacts on Functions
Compliance
Quality Costs
Expanded Uses
Increased Volumes
Database Security
Data Quality
Data Archiving
Data Extraction
Metadata Management
INTELLIGENCE. INNOVATION. INTEGRITY
Long Term Database Archiving
INTELLIGENCE. INNOVATION. INTEGRITY
Trends Impacting Archive Needs
AmountofData
Time Required
Com
pliance
Protection
0 30+ Yrs
Data Retention Issues:
Volume of data
Length of retention
requirement
Varied types of data
Security issues
INTELLIGENCE. INNOVATION. INTEGRITY
Archiving All Types of Data
Paper
Blueprints
Forms
Claims
Word
Excel
PDF
XML
IMS
DB2
ORACLE
SYBASE
SQL Server
IDMS
VSAM
Programs
UNIX Files
Outlook
Lotus Notes
Attachments
Sound
Pictures
Video
INTELLIGENCE. INNOVATION. INTEGRITY
Database Archiving
Database Archiving:
The process of removing selected data records from
operational databases that are not expected to be referenced
again and storing them in an archive data store where
they can be retrieved if needed.
Purge
INTELLIGENCE. INNOVATION. INTEGRITY
Mandatory Retention Period
Data Archive and ILM
Create DiscardOperational Reference Archive
Needed for
completing
business
transactions
Needed for
reporting
or expected
queries
Needed for
compliance
and business
protection
INTELLIGENCE. INNOVATION. INTEGRITY
We need to keep more data: a lot more data (125% CAGR)
For more years: a lot more years
We need to preserve original content and meaning
Old retention period
New retention period
Data Retention Requirements refer to the length of time you need to keep data
Determined by laws: external regulations
Determined by business needs: internal needs for analytic applications
Data Retention: Database Archiving
INTELLIGENCE. INNOVATION. INTEGRITY
Some Sample Regulations
Impacting Data Retention
INTELLIGENCE. INNOVATION. INTEGRITY
E-Discovery
Electronic evidence is the predominant form of discovery today.
(Gartner, Research Note G00136366)
Electronic evidence could encompass anything that is stored anywhere.
(Gartner, Research Note G00133224)
When data is being collected (for e-discovery) it is imperative that it is not
changed in any way. Metadata must be preserved…
(Gartner, Research Note G00133224)
Gartner Strategic Planning Assumption
 Through 2007, more than half of IT organizations and in-house legal
departments will lack the people and the appropriate skills to handle
electronic discovery requirements (0.8 probability).
(Gartner,Research Note G00131014)
INTELLIGENCE. INNOVATION. INTEGRITY
E-Discovery
Federal Rules of Civil Procedure, Rule 34b
 Took effect December 2006
 A party who produces documents for inspection
shall produce them . . . as they are kept in the
usual course of business..."
 The amended rules state that requested information must be
turned over within 120 days after a complaint has been
served.
So data stored in database systems must be able
to be produced in electronic form.
INTELLIGENCE. INNOVATION. INTEGRITY
What Does It All Mean?
Enterprises must recognize that there is a
business value in organizing their information
and data.
Organizations that fail to respond run the risk
of seeing more of their cases decided on
questions of process rather than merit.
(Gartner, 20-April-2007, Research Note G00148170:
Cost of E-Discovery Threatens to Skew Justice System)
INTELLIGENCE. INNOVATION. INTEGRITY
Operational Efficiency
Database Archiving can also be undertaken to
improve operational efficiency
 Large volumes of data can interfere with
production operations
– efficiency of transactions
– efficiency of utilities: COPY, REORG, etc.
– Storage
» Gartner: databases copied an average of 6 times!
INTELLIGENCE. INNOVATION. INTEGRITY
What Solutions Are Out There?
 Keep Data in Operational Database
— Problems with authenticity of large amounts of data over long
retention times
 Store Data in UNLOAD files (or backups)
— Problems with schema change and reading archived data;
using backups poses even more serious problems
 Move Data to a Parallel Reference Database
— Combines problems of the previous two
 Move Data to a Database Archive
INTELLIGENCE. INNOVATION. INTEGRITY
Components of a
Database Archiving Solution
Data Data
Extract Recall
Archive data store and retrieve
Metadata
capture, design,
maintenance
Archive data
query and
access
Archive
administration
Data &
metadata
Archive
Databases
metadata
policies
history
INTELLIGENCE. INNOVATION. INTEGRITY
What Do You Need to Support
Database Archiving?
 Policy based archiving: logical selection
 Keep data for very long periods of time
 Store very large amounts of data in archive
 Maintain Archives for ever changing operational systems
 Become independent from Applications/DBMS/Systems
 Become independent from Operational Metadata
 Protect authenticity of data
 Access data when needed; as needed
 Discard data after retention period
INTELLIGENCE. INNOVATION. INTEGRITY
Policy based archiving
 Why :
— Business objects are archived, not files
— Rules for when something is ready can be complex
— Data ready to be archived is distributed over database
 Implications:
— User must provide policies for when something is
moved
 How:
— Full metadata description of data
— Flexible specification of policy : “WHERE clause”
— Support accessing data outside archive set
INTELLIGENCE. INNOVATION. INTEGRITY
For Example…
Part Number
Type
Description
Unit Type
Cost
Price
Substitute Parts
Parts Master is
the parent table
to all other tables
PARTS MASTER
Part Number
PO Number
Vendor ID
Quantity Ordered
Unit Cost
Date Ordered
Date Received
ORDER INFO
Part Number
Dept. ID
CHIT ID
Qty Disbursed
Date Disbursed
DISBURSEMENT
Part Number
Bin Number
Qty on Hand
Qty on Order
Qty Backorder
STORAGE INFO
Part Number
Year
Q1 Disbursed
Q2 Disbursed
Q3 Disbursed
Q4 Disbursed
SUMMARY BY
QUARTER
INTELLIGENCE. INNOVATION. INTEGRITY
Keep Data for a Long Time
 Why: retention requirements in decades
 Implications:
— Archive will outlive applications/DBMS/systems that
generated them
— Archive will outlive people who designed and managed
operational systems
— Archive will outlive media we store it on
 How:
— Unique data store
— Application/DBMS/system independence
— Metadata independence
— Continuous management of storage
— Continuous management of archive content
INTELLIGENCE. INNOVATION. INTEGRITY
Keep Very Large Amounts of Data
 Why :
— Large volumes of data today
— Increasing rates of data volume growth
— Long retention periods
 Implications:
— Archive won’t fit in DBMS solutions
— Must partition contents
— Cannot read all of archive to satisfy queries
— Must support management functions at
partition level
 How:
— Unique data store
– Supports partitioning of data
– Unlimited number of partitions
– Manages partitions independently
— Indexing and scoping
INTELLIGENCE. INNOVATION. INTEGRITY
Maintain Archive for
Changing Operational Systems
 Why :
— Metadata changes frequently
— Applications are re-engineered periodically
– Change DBMS platform
– Change System platform
– Replace with new application
– Consolidate after M&A
 Implications:
— Archive must support multiple variations of an application
— Archive must deal with metadata changes
 How:
— Manage applications as major archive streams having
multiple minor streams with metadata differences
— Achieve independence from operating environment
INTELLIGENCE. INNOVATION. INTEGRITY
Achieve Application Independence
 Why:
— Operational applications will not be available
— Operational systems will not be available
 Implications:
— Archive must satisfy all query requirements from within
— Archive data must include metadata needed for
interpretation of data
— Archive system will be moved to new systems from time to
time
 How:
— Store metadata and data in archive together
— Implement archive system on multiple systems
— Implement archive system on new systems
INTELLIGENCE. INNOVATION. INTEGRITY
Achieve Metadata Independence
 Why :
— Operational metadata is inadequate
— Operational metadata changes
— Operational systems keep only the “current” metadata
— Data in archive often does not mirror data in operational
structures
 Implications:
— Archive must encapsulate metadata
— Metadata must be improved
 How:
— Metadata Capture, Validate, Enhance capabilities
— Store structure that encapsulates with data
— Keeps multiple versions of metadata
INTELLIGENCE. INNOVATION. INTEGRITY
Protect Authenticity of Data
 Why :
— Potential use in lawsuits/ investigations
— Potential use in business analysis
 Implications:
— Protect from unwanted changes
— Show original input
— Cannot be managed in operational environment
 How:
— SQL Access that does not support I/U/D
— Do not modify archive data on metadata changes
— Encryption as stored
— Checksum for detection of sabotage
— Limit access to functions
— Audit use of functions
— Maintain offsite backup copies for restore if sabotaged
INTELLIGENCE. INNOVATION. INTEGRITY
Access Data Directly From Archive
 Why :
— Cannot depend on application environment
 Implications:
— Full access capability within archive system
 How:
— Industry standard interface (e.g. JDBC)
— LOAD format output for
– For load into a database
– May be different from database came from
— Recall format output for
– Showing original input bit-for-bit
— Requires full and accurate metadata
— Ability to review metadata
— Ability to function across metadata changes
INTELLIGENCE. INNOVATION. INTEGRITY
Discard Function
 Why :
— Legal exposure for data kept too long
 Implications:
— Data cannot be kept in archive beyond retention period
— Must be removed with no exposure to forensic software
 How:
— Policy based discard
— System level function
— Tightly controlled and audited
— True “zero out” capability
— Discard from backups as well
INTELLIGENCE. INNOVATION. INTEGRITY
So Where do we store the archive?
 NOT a Relational Database!
— Only supports 1 definition of data
— Problem with very large amounts of data
— Cannot protect from unwanted changes
— Requires excessive administration
 New Database Archive Structure
— Stores data and metadata
— Partitions data by metadata groupings
— Unlimited number of partitions
— Does not support INSERT/UPDATE/DELETE functions
— Manages by partitions
— Indexed and scoped
INTELLIGENCE. INNOVATION. INTEGRITY
Summary Points
 Keeping data in operational systems is a bad idea
 Putting data in UNLOAD files is a bad idea
 Putting data in a parallel references database is a bad idea
 Using a DBMS to store the archive does not work
 Database archiving requires a great deal of data design
— Establishing and maintaining metadata
— Designing how data looks in the archive
— Achieving application independence
 Database archives must be continuously managed
— Copying data for storage problems (e.g. media rot)
— Copying data for system changes
— Copying data for data encoding standard changes
— Logging, auditing, and monitoring
– Archive events
– Partition management
– Accesses
 Must staff-up for database archival
INTELLIGENCE. INNOVATION. INTEGRITY
Craig S. Mullins
NEON Enterprise Software
craig.mullins@neonesoft.com
www.neonesoft.com
www.craigsmullins.com
www.DB2portal.com
My Books: http://www.craigsmullins.com/booklnks.htm
My Blogs: http://www.craigsmullins.com/blogs.htm
INTELLIGENCE. INNOVATION. INTEGRITY
Intelligent Solutions for Enterprise Data. Depend On It.
http://www.neonesoft.com

Database Archiving - Managing Data for Long Retention Periods

  • 1.
    INTELLIGENCE. INNOVATION. INTEGRITY DatabaseArchiving Managing Data for Long Term Retention Craig S. Mullins craig.mullins@neonesoft.com
  • 2.
    INTELLIGENCE. INNOVATION. INTEGRITY Emergenceof Data Management Functions The Long Term Data Storage Problem Long Term Data Storage Solutions Database Archiving Capabilities Agenda
  • 3.
    INTELLIGENCE. INNOVATION. INTEGRITY Differencebetween DBA and DM  Database Administration — Backup/Recovery — Disaster Recovery — Reorganization — Performance Monitoring — Application Call Level Tuning — Data Structure Tuning — Capacity Planning Managing the database environment Managing the content and uses of data  Data Management — Database Security — Data Privacy Protection — Data Quality Improvement — Data Quality Monitoring — Database Archiving — Data Extraction — Metadata Management
  • 4.
    INTELLIGENCE. INNOVATION. INTEGRITY DataManagement Functions  Database Security — Authorization Auditing — Access Auditing — Intrusion Detection — Replication Auditing  Data Quality — Data Profiling — Data Quality Assessment — Data Cleansing — Data Quality Filtering — Data Profile Monitoring  Data Archiving — Short term Reference Database — Long Term Database Archiving  Data Extraction — Maintain privacy — Maintain Security  Metadata Management — Complete Encapsulation — Change History Auditing
  • 5.
    INTELLIGENCE. INNOVATION. INTEGRITY DatabaseAdministration Functions  Very well defined tasks  Very well defined Job Title and Description  Overwhelming vendor support  DBMS architectures fully supportive  Functions fall entirely in IT  Must be done well to support efficient operational environment
  • 6.
    INTELLIGENCE. INNOVATION. INTEGRITY DataManagement Functions  Tasks definitions are emerging  No standard Job Titles or Descriptions  More aligned with business units than IT  IT management has not been supportive (NMP)  Executive management has not been supportive  DBMS architectures built without consideration of DM  Little Vendor Support  Companies have accrued many penalties for not paying attention to DM requirements
  • 7.
    INTELLIGENCE. INNOVATION. INTEGRITY EmergingData Management Drivers Recent Regulations: Corporate Governance Data Privacy Data Retention Data Accuracy Increasing Data Quality Costs Increasing Data Volumes Increasing uses/ users of data More Emphasis and Spending on Data Management Functions Significant Tangible Benefits
  • 8.
    INTELLIGENCE. INNOVATION. INTEGRITY DriversImpacts on Functions Compliance Quality Costs Expanded Uses Increased Volumes Database Security Data Quality Data Archiving Data Extraction Metadata Management
  • 9.
  • 10.
    INTELLIGENCE. INNOVATION. INTEGRITY TrendsImpacting Archive Needs AmountofData Time Required Com pliance Protection 0 30+ Yrs Data Retention Issues: Volume of data Length of retention requirement Varied types of data Security issues
  • 11.
    INTELLIGENCE. INNOVATION. INTEGRITY ArchivingAll Types of Data Paper Blueprints Forms Claims Word Excel PDF XML IMS DB2 ORACLE SYBASE SQL Server IDMS VSAM Programs UNIX Files Outlook Lotus Notes Attachments Sound Pictures Video
  • 12.
    INTELLIGENCE. INNOVATION. INTEGRITY DatabaseArchiving Database Archiving: The process of removing selected data records from operational databases that are not expected to be referenced again and storing them in an archive data store where they can be retrieved if needed. Purge
  • 13.
    INTELLIGENCE. INNOVATION. INTEGRITY MandatoryRetention Period Data Archive and ILM Create DiscardOperational Reference Archive Needed for completing business transactions Needed for reporting or expected queries Needed for compliance and business protection
  • 14.
    INTELLIGENCE. INNOVATION. INTEGRITY Weneed to keep more data: a lot more data (125% CAGR) For more years: a lot more years We need to preserve original content and meaning Old retention period New retention period Data Retention Requirements refer to the length of time you need to keep data Determined by laws: external regulations Determined by business needs: internal needs for analytic applications Data Retention: Database Archiving
  • 15.
    INTELLIGENCE. INNOVATION. INTEGRITY SomeSample Regulations Impacting Data Retention
  • 16.
    INTELLIGENCE. INNOVATION. INTEGRITY E-Discovery Electronicevidence is the predominant form of discovery today. (Gartner, Research Note G00136366) Electronic evidence could encompass anything that is stored anywhere. (Gartner, Research Note G00133224) When data is being collected (for e-discovery) it is imperative that it is not changed in any way. Metadata must be preserved… (Gartner, Research Note G00133224) Gartner Strategic Planning Assumption  Through 2007, more than half of IT organizations and in-house legal departments will lack the people and the appropriate skills to handle electronic discovery requirements (0.8 probability). (Gartner,Research Note G00131014)
  • 17.
    INTELLIGENCE. INNOVATION. INTEGRITY E-Discovery FederalRules of Civil Procedure, Rule 34b  Took effect December 2006  A party who produces documents for inspection shall produce them . . . as they are kept in the usual course of business..."  The amended rules state that requested information must be turned over within 120 days after a complaint has been served. So data stored in database systems must be able to be produced in electronic form.
  • 18.
    INTELLIGENCE. INNOVATION. INTEGRITY WhatDoes It All Mean? Enterprises must recognize that there is a business value in organizing their information and data. Organizations that fail to respond run the risk of seeing more of their cases decided on questions of process rather than merit. (Gartner, 20-April-2007, Research Note G00148170: Cost of E-Discovery Threatens to Skew Justice System)
  • 19.
    INTELLIGENCE. INNOVATION. INTEGRITY OperationalEfficiency Database Archiving can also be undertaken to improve operational efficiency  Large volumes of data can interfere with production operations – efficiency of transactions – efficiency of utilities: COPY, REORG, etc. – Storage » Gartner: databases copied an average of 6 times!
  • 20.
    INTELLIGENCE. INNOVATION. INTEGRITY WhatSolutions Are Out There?  Keep Data in Operational Database — Problems with authenticity of large amounts of data over long retention times  Store Data in UNLOAD files (or backups) — Problems with schema change and reading archived data; using backups poses even more serious problems  Move Data to a Parallel Reference Database — Combines problems of the previous two  Move Data to a Database Archive
  • 21.
    INTELLIGENCE. INNOVATION. INTEGRITY Componentsof a Database Archiving Solution Data Data Extract Recall Archive data store and retrieve Metadata capture, design, maintenance Archive data query and access Archive administration Data & metadata Archive Databases metadata policies history
  • 22.
    INTELLIGENCE. INNOVATION. INTEGRITY WhatDo You Need to Support Database Archiving?  Policy based archiving: logical selection  Keep data for very long periods of time  Store very large amounts of data in archive  Maintain Archives for ever changing operational systems  Become independent from Applications/DBMS/Systems  Become independent from Operational Metadata  Protect authenticity of data  Access data when needed; as needed  Discard data after retention period
  • 23.
    INTELLIGENCE. INNOVATION. INTEGRITY Policybased archiving  Why : — Business objects are archived, not files — Rules for when something is ready can be complex — Data ready to be archived is distributed over database  Implications: — User must provide policies for when something is moved  How: — Full metadata description of data — Flexible specification of policy : “WHERE clause” — Support accessing data outside archive set
  • 24.
    INTELLIGENCE. INNOVATION. INTEGRITY ForExample… Part Number Type Description Unit Type Cost Price Substitute Parts Parts Master is the parent table to all other tables PARTS MASTER Part Number PO Number Vendor ID Quantity Ordered Unit Cost Date Ordered Date Received ORDER INFO Part Number Dept. ID CHIT ID Qty Disbursed Date Disbursed DISBURSEMENT Part Number Bin Number Qty on Hand Qty on Order Qty Backorder STORAGE INFO Part Number Year Q1 Disbursed Q2 Disbursed Q3 Disbursed Q4 Disbursed SUMMARY BY QUARTER
  • 25.
    INTELLIGENCE. INNOVATION. INTEGRITY KeepData for a Long Time  Why: retention requirements in decades  Implications: — Archive will outlive applications/DBMS/systems that generated them — Archive will outlive people who designed and managed operational systems — Archive will outlive media we store it on  How: — Unique data store — Application/DBMS/system independence — Metadata independence — Continuous management of storage — Continuous management of archive content
  • 26.
    INTELLIGENCE. INNOVATION. INTEGRITY KeepVery Large Amounts of Data  Why : — Large volumes of data today — Increasing rates of data volume growth — Long retention periods  Implications: — Archive won’t fit in DBMS solutions — Must partition contents — Cannot read all of archive to satisfy queries — Must support management functions at partition level  How: — Unique data store – Supports partitioning of data – Unlimited number of partitions – Manages partitions independently — Indexing and scoping
  • 27.
    INTELLIGENCE. INNOVATION. INTEGRITY MaintainArchive for Changing Operational Systems  Why : — Metadata changes frequently — Applications are re-engineered periodically – Change DBMS platform – Change System platform – Replace with new application – Consolidate after M&A  Implications: — Archive must support multiple variations of an application — Archive must deal with metadata changes  How: — Manage applications as major archive streams having multiple minor streams with metadata differences — Achieve independence from operating environment
  • 28.
    INTELLIGENCE. INNOVATION. INTEGRITY AchieveApplication Independence  Why: — Operational applications will not be available — Operational systems will not be available  Implications: — Archive must satisfy all query requirements from within — Archive data must include metadata needed for interpretation of data — Archive system will be moved to new systems from time to time  How: — Store metadata and data in archive together — Implement archive system on multiple systems — Implement archive system on new systems
  • 29.
    INTELLIGENCE. INNOVATION. INTEGRITY AchieveMetadata Independence  Why : — Operational metadata is inadequate — Operational metadata changes — Operational systems keep only the “current” metadata — Data in archive often does not mirror data in operational structures  Implications: — Archive must encapsulate metadata — Metadata must be improved  How: — Metadata Capture, Validate, Enhance capabilities — Store structure that encapsulates with data — Keeps multiple versions of metadata
  • 30.
    INTELLIGENCE. INNOVATION. INTEGRITY ProtectAuthenticity of Data  Why : — Potential use in lawsuits/ investigations — Potential use in business analysis  Implications: — Protect from unwanted changes — Show original input — Cannot be managed in operational environment  How: — SQL Access that does not support I/U/D — Do not modify archive data on metadata changes — Encryption as stored — Checksum for detection of sabotage — Limit access to functions — Audit use of functions — Maintain offsite backup copies for restore if sabotaged
  • 31.
    INTELLIGENCE. INNOVATION. INTEGRITY AccessData Directly From Archive  Why : — Cannot depend on application environment  Implications: — Full access capability within archive system  How: — Industry standard interface (e.g. JDBC) — LOAD format output for – For load into a database – May be different from database came from — Recall format output for – Showing original input bit-for-bit — Requires full and accurate metadata — Ability to review metadata — Ability to function across metadata changes
  • 32.
    INTELLIGENCE. INNOVATION. INTEGRITY DiscardFunction  Why : — Legal exposure for data kept too long  Implications: — Data cannot be kept in archive beyond retention period — Must be removed with no exposure to forensic software  How: — Policy based discard — System level function — Tightly controlled and audited — True “zero out” capability — Discard from backups as well
  • 33.
    INTELLIGENCE. INNOVATION. INTEGRITY SoWhere do we store the archive?  NOT a Relational Database! — Only supports 1 definition of data — Problem with very large amounts of data — Cannot protect from unwanted changes — Requires excessive administration  New Database Archive Structure — Stores data and metadata — Partitions data by metadata groupings — Unlimited number of partitions — Does not support INSERT/UPDATE/DELETE functions — Manages by partitions — Indexed and scoped
  • 34.
    INTELLIGENCE. INNOVATION. INTEGRITY SummaryPoints  Keeping data in operational systems is a bad idea  Putting data in UNLOAD files is a bad idea  Putting data in a parallel references database is a bad idea  Using a DBMS to store the archive does not work  Database archiving requires a great deal of data design — Establishing and maintaining metadata — Designing how data looks in the archive — Achieving application independence  Database archives must be continuously managed — Copying data for storage problems (e.g. media rot) — Copying data for system changes — Copying data for data encoding standard changes — Logging, auditing, and monitoring – Archive events – Partition management – Accesses  Must staff-up for database archival
  • 35.
    INTELLIGENCE. INNOVATION. INTEGRITY CraigS. Mullins NEON Enterprise Software craig.mullins@neonesoft.com www.neonesoft.com www.craigsmullins.com www.DB2portal.com My Books: http://www.craigsmullins.com/booklnks.htm My Blogs: http://www.craigsmullins.com/blogs.htm
  • 36.
    INTELLIGENCE. INNOVATION. INTEGRITY IntelligentSolutions for Enterprise Data. Depend On It. http://www.neonesoft.com

Editor's Notes

  • #4 The agenda today covers four areas. First we’ll talk about data management functions and outline, at a high level, the different data management job roles and duties. Then we’ll describe the problems of long-term data storage: what is driving the need and the difficulties that will be encountered. Next we’ll describe several “solutions” that have been attempted. And finally we’ll outline the components and capabilities of a database archiving solution.
  • #6 Database security is probably the fastest growing of these categories. (As many of your probably know, analysts say that as much as 70% of data security violations are internal to a company, instead of from external threats.) The data breaches in the news along with government regulations drives the need for companies to better secure their data. Authorization auditing answers “who can do what?” (BIG PROBLEM WITH IMS because of granularity issues) whereas access auditing answers “who is doing what, when?” Replication auditing produces a record of every data extraction (no product yet). Data archiving has two high-level forms: short-term and long-term. And we’ll cover more on archiving throughout the course of this pitch. I’m not going to cover everything on this slide in great detail, suffice it to say that all of the topics on this slide, in the end, go back to metadata management. It is not really possible to perform any of these duties accurately without first tackling metadata management.
  • #8 Now contrast the DBA side of things with the DM side of things. There are new and emerging tasks within this realm and there are no standard job titles or descriptions. This job is more business related (instead of technology-focused, though there is a tech component). But these jobs can fall within IT, the biz unit, or in some middle ground. Data Management capabilities are not well-supported by management. Even though the cost of poor data quality has been estimated at 15% of operating profits, execs are willing to trade that off for other initiatives. DM professionals are always complaining that even tho they built a case for data quality implementation never gets approved. But the world seems to be changing here (driven by government regulations). Though some penalties might seem steep, in reality, they are not when compared to the cost of fixing the problems. Of course, that equation never factors in the benefits that will accrue from doing things (like data quality) right.
  • #9 Obviously, the number one driver of data management initiatives is likely to be government regulations. But once again, the fines levied are likely to be far less than the cost of a solution. Perhaps an even greater driver than a financial one is “embarrassment.” What company wants to be covered in the press as being the brunt of a data breach or as being prosecuted under Sarbanes-Oxley. This, in turn, can evolve into financial loss as customers stop doing business with such a company. Regulations, IMHO, are a good thing. Sometimes they can cause people & companies to do things that they never would’ve done in the first place. SEAT BELTS example (they didn’t even exist, then no one used them, then law passed, now people are afraid of a $50 citation so they use them – wouldn’t you think fear of death would be worse than fear of losing $50? Evidently not.) At any rate, there are other drivers that can be used to encourage more spending on DM functions: the cost of bad quality data, the cost of gathering and storing more data, and the increasing uses for and users of that data. All of these factors require more attention to be paid to data management and quality.
  • #10 The point of this slide is that the data management drivers do not just impact a single function. For example, “Compliance” has an impact on all of the functions. “Quality Costs” do not really impact Database Security, but there is an impact on all of the other functions. The bottom line is that business needs will continue to driver data management functionality across the spectrum.
  • #11 OK, at this point let’s focus in on only one of the Data Management functions we’ve discussed – Long-Term Database Archiving. This is the main topic of this presentation and everything to this point has been setup to get us here.
  • #12 Organization are processing and storing more and more data every year. Average yearly rate of growth: 125% - large volumes of data interfering with operations (the more data in the operational database, the slower all processes may run) And regulations (as well as business practices) dictate that data once stored, be retained for longer periods of time. - No longer months or years, but in some cases multiple decades More varied types of data are being stored in databases. Not just (structured) characters, numbers, dates & times, but also (unstructured) large text, images, video, and more are being stored. - Unstructured data greatly expands storage needs The retained data must be protected from modification – it must represent the authentic business transactions at the time the business was conducted. - need for better protection from modification - need for isolation of content from changes
  • #13 Database Archiving is part of a larger topic, namely Data Archiving. This particular graph is a representation of the data archiving classification put together by one of the analyst groups. Each of these “things” needs to be archived to fulfill regulatory, legal, and business requirements. Each of these areas, though, requires different archival processing requirements due to its form and nature. What works to archive e-mail is not sufficient for archiving database data, and so on. In other words, each box commands its own technology. Now there may be a need for integration across these areas. On one flight I was sitting next to a guy who was an energy company executive. He was responsible for capturing data after explosions. Any time there was an explosion he and his team immediately flew in an gathered as much data as possible before it got contaminated or destroyed. Such information is imperative in determining cause, as well as for possible future legal claims. At any rate, he was interested in a single product or solution that could store and archive all of these items. Such a beast does not yet exist.
  • #14 At any rate, we are focusing today on database archiving.
  • #15 This slide delineates the various states of data over its useful life. This could be three separate databases or a single database or any combination thereof. Don’t think about data warehousing in this context – here we are talking about the single, official store of data. We start out, of course, with creation of data. After creating the data it moves into its operational state. This is where it serves it primary business purpose. Transactions are enacted upon data in this state. Then we move to the reference state. Data in this form is still needed for reporting and querying purposes. After this state we move into an area where the data is no longer needed for completing business transactions and the chance of it being needed for querying and reporting is small to none. However, the data still needs to be saved for regulatory compliance and other legal purposes, particularly if it pertains to a financial transaction. Finally, after a period of time, the data is no longer needed at all and it can be discarded. This actually should have a much stronger emphasis: the data must be discarded. In many cases the only reason old data is being kept is to enable lawsuits. When there is no legal requirement to maintain such data, companies demand that such data be destroyed – why enable anyone to sue you if it is not a legal requirement?
  • #16 OK, now let’s take a closer look at the requirements for data retention. First thing to keep in mind is that laws NEVER say you have to archive data, just that you retain data. So if you can keep it in the operational database, OK. But if not, you have to archive it. Data may need to be retained for both internal and external reasons. Internal because the company needs it for some business purpose – or external because the law demands it. We need to keep a lot more data than ever before. First off, this is so because the amount of data in general is growing. According to industry analysts, enterprise databases are growing at the rate of 125% annually. Even more interesting is that as much as 80% of the information in those databases is not actively used (in other words, it is ready for archiving). As we have mentioned, regulatory compliance is driving data retention. In the USA we have Sarbanes-Oxley and HIPAA; in Europe there are laws governing how long telecommunications data must be retained (http://news.bbc.co.uk/2/hi/europe/4527840.stm); and it will keep getting worse. According to Gartner there are more than 150 laws that determine how long data must be retained. Consultants are making entire practices out of reviewing and interpreting these laws for businesses. (Type of data along with where collected to determine the mandatory retention period) And data that is retained must be authentic. What good does retaining the data do if it can be modified? Let’s consider an example, in this case an airplane manufacturer: Each plane that rolls off the assembly line is unique so the specs for each plane need to be retained for the life of that plane. And many planes are flying these days for more than 50 years. The bottom line is this: We are moving from retention periods of months to maybe 2 or 3 years to decades, perhaps 50 to 70 years or more.
  • #17 Here we have some examples of regulations that impact data retention. This is just a sampling of the more than 150 different regulations (at the local, state, national, and international levels) that impact data retention. Regulations Drive Retention and Discovery Source: www.domains.cio.com/symantec/wp/ebs_ediscovery_ev_wp.pdf
  • #18 <number> According to Gartner: As the custodians of corporate information systems, IT departments must understand the legal ramifications surrounding the information stored in those systems. Understanding the legal process of discovery as it pertains to electronically stored information is essential.
  • #19 The regulations dictate the need to retain the data, legal discovery rules dictate how that data must be produced in a court of law. The two work hand-in-hand. Today, most discovery requests are for electronic data, not paper documents. The FRCP changes which took effect in December impact how retained data must be produced for legal purposes. - Most organizations are not prepared to handle an e-discovery request. Requires marriage of IT/business/legal.
  • #21 Archiving is a process used to move data from the operational database to another data store to be kept for the duration of the retention period when it is unacceptable to keep the data in the operational database for that long. So, why might it be unacceptable? - large volumes of data interfering with operations (the more data in the operational database, the slower all processes may run) - need for better protection from modification - need for isolation of content from changes
  • #22 OK, so if we have to archive data, how can we go about doing it? Well, there is always bullet number 1. If you can store it in the operational database then you save a lot of work. But it does not work well for large amounts of data or very long retention periods. And the more data in the database the less efficient transaction processing and reporting will be. So we could take a simple approach of using UNLOAD files. But this is not really useful in today’s environment. It assumes the same data schema will be used if you want to ever RELOAD the data (and that is not likely over long periods of time). When they run into audit situations, where they're asked to produce transaction records spanning multiple years, they have to slog through the work of restoring the data. This is a very labor-intensive, manual process. And costly… some sites have spent hundreds of thousands of dollars to respond to audit requests when they have to resort to such manual data restoration. It might be feasible for 2 to 3-year timespans, but not 20 or more years. In situation number 3, we consider cloning and moving data. But this has a lot of the same problems as UNLOADing the data. Think about what happens as the database schema changes – and how you’d go about accessing the data? Finally, we come to the proper approach for retaining large amounts of data for long periods of time – database archiving.
  • #23 This diagram depicts the necessary components of a database archiving solution. Starting with the databases down the left side is the Extract portion, and up the right side is the data recall portion. Several sites have been very vocal about needing this capability, but I think it is a dubious requirement. If you can query from the archive recalling data into the operational database is not really a necessity, it it? And what if the operational database changes (e.g. Oracle to SAP)? The whole process requires metadata to operate. You must capture, validate and enhance the metadata to drive the archive process. You need to know the structure of the operational database and the structure of the archive. There is also the case of the metadata about data retention for the archive. This policy-based metadata must be maintained and monitored against the archive, for example, to determine if data needs to be discarded. And, as you can see off to the bottom left we also need to be able to query the archived data. This will not necessarily be the most efficient access because of differences in the metadata over time, but for queries against archived data performance is not a paramount concern. Finally, we will also have on-going maintenance of the archive: security, access audit, administration of the structures (REORG), backup/recovery, etc.
  • #24 OK, so let’s take a look at the actual functionality needed to support database archiving. This slide outlines the various features and functions needed and we will discuss each of these bullets in separate ensuing slides.
  • #25 Think about this: we are not just archiving pieces of data, but logical “things” that we may need to access later. For example, you will archive a purchase order cherry-picked out of the database – perhaps across multiple tables or structures. This implies that there are policies, perhaps complex ones, that will drive what data is archived and when it is archived. This needs to be driven by the business user knowledgeable about the data’s characteristics.
  • #26 This slide shows the data stored in each table. The archive design process is going to reflect a lot of decision making Which data items need to be archived In some cases, which table actually contains the archive record How should the database relationships be reflected in the archive Are there other relationships (perhaps coded in the application) that need to be addressed in the archive
  • #27 As we discussed already, data retention requirements are stated in decades. This means that our Archive will outlive applications/DBMS/systems that generated them – there may be a DB2 35 years from now but it will look a LOT different than it does today. It also means that our Archive will outlive the people who designed and managed the operational systems. And the Archive will outlive the media we store it on so we will need to manage the data over time and re-platform (no media is guaranteed to exist over long periods of time). Media rot destroys tape and most are not guaranteed more than 7 years at most. For example, there is only 6 minutes of tape left of the first Super Bowl. How do we resolve these issues? Using a unique data store because a standard DBMS is not the right answer. The archive must be immune from DBMS changes over time. Application/DBMS/system independence because you don’t want your archive to prevent you from improving your operational environment. Metadata independence because we don’t want to rely on reading the DB2 Catalog or some copybook for metadata. Continuous management of storage because of media rot. Continuous management of archive content – administration of the archive to ensure its viability.
  • #28 Current levels of data storage are larger than ever before and when combined with increasing rates of data growth and long retention period requirements we have an explosive combination. [ Visa, for example, has had billion-plus txn weeks over the holidays the past 2 years.] The implication is that the Archive won’t fit in today’s existing DBMS solutions. We over-reach the limits when we talk about retaining data for such long periods of time. And though the limits may change with time we don’t want to rely on those limits changing before we require them to. It also implies that we’ll need to partition the archive and have an intelligent query capability to understand what data is in what archive partition. Once again, this means a unique data store is the best method. It can be designed to support an unlimited number of partitions – each managed independently – with indexes scoped by partition where each partition knows the range of values for every field (or column) in the partition.
  • #29 Because the operational system WILL change over such long retention periods, the archive needs to be independent from the operational system. We constantly change our databases. Applications, computing platforms, and so on, to support new technology, M+A, or even management whim. The Archive needs to be able to support multiple variations of the application as it changes. This includes keeping up to date with the metadata changes within each variation of the system for which we are archiving data. The HOW here is interesting. The Archive will never change metadata in any partition. The Archive data is static; unchanging. This means we flip the normal operating mode, which is to modify the existing data to the new format when metadata changes. Instead, when we query against the Archive data can come back in multiple formats – basically, whatever the format of the data was when it was archived (because it can NOT change once archived).
  • #30 The Archive needs to function independently from the operational application(s). We may need to retain data past the life of the application itself. Or perhaps the oeprational application has changed significantly (SAP to Oracle). The Archive must allow querying of its data. It needs to have the structural metadata to allow the Archive to make sense of and process queries submitted against its data. And think about our long retention period again. Over 50 years do you think your computing platform will change? Fifty years ago was 1956 – the age of batch, tape, and punched cards. What will 2056 look like? We need to be able to re-platform the archive to support this. Basically, the Archive needs the metadata about its archived data in order to operate effectively.
  • #31 The Archive needs its own metadata – it cannot depend on metadata in application programs. For the Archive to be effective we must be able to interpret the archived data from nothing but what is in the archive. And remember, some of the metadata can change over time so it must be attached to the data in the partitions of the archive as of that state in time.
  • #32 The data must be authentic. If it can change after it is archived then it is not of any value or use. This is a requirement for lawsuits and regulatory compliance. Furthermore, if the data is available to the business it can be used for business analysis, too. Do you really think it will NOT be used once it is there? There are many HOWs that are important here: the query language must not support modification, the data can be encrypted, checksum can be used to verify data accuracy, and backup copies can be used for verification, too. Remember, the archive data NEVER changes, so one backup (maybe two to be safe) is all that is needed.
  • #33 We’ve hinted at this in previous slides, but we make the requirement explicit here: we must be able to query data directly from the Archive itself. This means a query language with an SQL-like interface (for compatibility purposes). The query is the portion of the Archive that is actually most important. If you cannot access the data that is retained, then it is as if it is not retained at all. The answers to the queries can be fuzzy though. A query may span multiple partitions representing many years and states of the data. So the query processing must keep this in mind as it delivers an answer. The answer may fall into several categories: guaranteed accurate, fuzzy because of changes over time, or perhaps even no answer at all that is useful because of metadata changes. So we’ll need experts able to compose and re-compose queries against an Archive taking such constraints into account.
  • #34 The ability to discard the data exactly when it must be gotten rid of is important. If the legal requirement is to retain the data for 35 years, then 35 years + 1 day is not acceptable! The data is there, in many cases, to support lawsuits against your company. If there is no legal mandate to keep it then why go to the expense of keeping it to help others sue you? Of course, you must be able to deliver on discovery in a lawsuit. This means querying the Archive to deliver the requested data. This is a potentially HUGE issue. If you lose once in court because you cannot produce data for discovery, then lawyers will descend upon your company filing suit against that same data they know you don’t have. And you will lose. Now keep in mind that Discard does not just mean delete. It means REALLY and TRULY delete forever. Zero out the data by writing over top of it. It must not be able to be recovered by forensic experts or software.
  • #35 (Note: XML satisfied all of this except for volume.)
  • #37 <number>