Data Migration White Paper

3,796 views
3,593 views

Published on

Data Migration is a field that is often overlooked as a simple step between retiring an existing platform and bringing to life a new one. The reality is that we are constantly migrating data between systems, implementing new applications, setting up co-existence scenarios and sharing information in a way that could be thought of as data migration.

The actual task of data migration is a complex one requiring experience and a detailed understanding of the steps to be taken, both in terms of business process and technical implementation. This includes a clear delineation of responsibility between IT and business, since far too often the job of manipulating data is relegated to IT, well outside of their core competency. Enabling the migration requires a team experienced in abstracting the technical details such that the business team can implement their rules for migration without ever writing a single line of code, a task for IT.

This document focuses on technical architecture rather than process. We describe an architecture that we’ve found to be ideal in dozens of large data migration scenarios at organizations ranging from global Fortune 500 companies to local Small Businesses. This architecture is portable between different software toolsets; we’ve implemented it in hand-coded systems as well as open source and best-of-breed ETL/DQ solutions.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,796
On SlideShare
0
From Embeds
0
Number of Embeds
21
Actions
Shares
0
Downloads
171
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Data Migration White Paper

  1. 1. Technical Data Migration in Six Steps by Wael O. Elrifai, July 2011 London - New York - Dubai - Hong Kong - Mumbai 2011
  2. 2. AbstractData Migration is a field that is often overlooked as a simple step between retiring an existingplatform and bringing to life a new one. The reality is that we are constantly migrating data betweensystems, implementing new applications, setting up co-existence scenarios and sharing informationin a way that could be thought of as data migration.The actual task of data migration is a complex one requiring experience and a detailedunderstanding of the steps to be taken, both in terms of business process and technicalimplementation. This includes a clear delineation of responsibility between IT and business, since fartoo often the job of manipulating data is relegated to IT, well outside of their core competency.Enabling the migration requires a team experienced in abstracting the technical details such that thebusiness team can implement their rules for migration without ever writing a single line of code, atask for IT.This document focuses on technical architecture rather than process. We describe an architecturethat we’ve found to be ideal in dozens of large data migration scenarios at organizations rangingfrom global Fortune 500 companies to local Small Businesses. This architecture is portable betweendifferent software toolsets; we’ve implemented it in hand-coded systems as well as open source andbest-of-breed ETL/DQ solutions.This document is aimed at anyone who’s about to embark on a data migration task, or who is tryingto rescue an existing project that is “off-plan”. To gain the most from this document, you should onlyrequire a rudimentary understanding of the terminology of data handling, terms such as ETL(Extract, Transform, and Load) and Reference Data (e.g. ISO currency codes, lists of possible postcode). 2 Confidential - not for redistribution
  3. 3. IntroductionThe Peak solution for data migration involves a 6-stage process designed to isolate and validateelemental data changes at the lowest level, hierarchy changes, referential integrity checks, duplicatechecks and capture metadata into distinct steps such that they can be easily tracked, reversed, and acomplete lineage exists for audit purposes. The diagram describes the process at a high level, a moredetailed description can be found below. Please bear in mind that a “stage” refers to the stagingarea and the preceding technical ETL/DQ process between the previous stage and that stage. Thisdocument is by no means a complete description of the systems and processes required, but it’s agood indicator as to the technical elements and requirements of a good data migration system.This architecture is not tied to a particular ETL tool, but Peak does recommend the use of tools to aidin re-usability and visibility of process. Some tools are better suited to the “pivot” tasks required atsome points though all major commercial and open source ETL tools can provide the necessaryfunctionality. No data is deleted at any point within the stages, any data that does not pass aparticular stage is excluded via the error logging mechanism and a “warning” or “error” flag attachedto the record by Stage and Surrogate ID (SID).Stage 1: Data is read directly from source systems in the native code page and put into databasestaging areas in a format and ER model that exactly matches the source systems, whether they areERPs, bespoke systems, flat files, or semi-structured data. Some metadata is added to aid in trackingof records in the form of a surrogate ID (SID) as well as timestamp, source, and other metadata asappropriate for the system.Stage 2: Data is moved from the Stage 1 area to the Stage 2 area. Only that subset of columnsconsidered necessary in the data map to target systems are moved, including any fields necessary toperform calculations but not necessarily designed to be loaded to the target, as well as all metadata.SIDs are maintained to provide easy “back tracking”. Any filtering rules to remove data known to beirrelevant to the migration is performed. Duplicates are removed as deemed appropriate (i.e.customer master data might be removed while transactions are persisted). Data is still maintained inthe entity-relationship model of the source systems less the columns that have been deemedextraneous. All actions are logged in the audit and error data model with error data being excludedfrom the target.Stage 3: Data is mapped from the old entity-relationship model to the new one, metadata istransferred, SIDs maintained, and all actions logged while error records are excluded from thetarget. It is important to note that no data translations or calculations have taken place yet. The datais in the target ER model but not necessarily in the target data formats. All data types are defined asVARCHAR (MAX LENGTH) to avoid rejecting any source data.Stage 4: Data is translated to meet target system rules, implement functional requirements, andchecked against target system reference data. All metadata and SIDs are maintained and all actionslogged, both audit and error information. A second duplicate check is performed here to account forthe potential for “new” duplicates create by data translation, of course the concept of a duplicate isone requiring a business and technical definition. The target data model including data types have 3 Confidential - not for redistribution
  4. 4. now been put in place and a column-by-column check is enforced to ensure that no data violate thedatatype constraints in the target system. Error records are excluded from the target and logged.“Dictionary” translations, those translations where a given input from a given source systemtranslates to a particular output (e.g. “Source System A” uses a 3 character country code while“Target System X” uses a 2 character one) will have the code abstracted such that the translation isstored in a value mapping table. Those tables can be easily maintained by the business users via aspreadsheet that gets loaded into the value mapping table. This ensures that the business maintainsresponsibility for the data translation while the development team is responsible for correctly codingthe abstract rule.Stage 5: Referential integrity checks are performed to ensure that the data structures arecompletely ready to be written to the target system. Audit and error data is logged and metadata ismaintained. At this point the data should be error free, any “child” records that did not pass errorchecks have been excluded and any parent records that didn’t pass error checks have also beenexcluded along with any related child records such that orphan records cannot be written to thetarget system. Alteration of the parent-child exclusion rules is possible but heavily dependent on thenature of the target systems (i.e. SAP will not allow creation of a parent-child hierarchy if parent andchild legal entities don’t already exist in the system).Stage 6: This stage involves invoking any APIs necessary and passing data in the structure required tothose methods. It also includes writing data directly in cases requiring a direct-write rather than API,IDOCS, or similar calls.Reconciliation: The stage 6 reconciliation is performed by reading the data structures from thetarget system and comparing backwards to Stage 5 data, performing logical DELETEs against exactmatches.Fault, fix, and follow-through: Using our internally developed audit and error data model along withthe technical steps detailed here allows a single view of all excluded data, why, and where it wasexcluded, and also exposes a simple interface in which a business user can change the data (withfurther audit information logged, of course). This system can be built to log exclusions based onminor differences (i.e. variations of ‘-‘ vs. ‘/’ on dates) or to only indicate major failures (i.e.variations in transaction values). Ultimately the system allows for “fixing” data in situ and “re-migrating” the data to the new system.This system is meant to support test data migrations such that all rules can be verified and enforcedwith the goal of ZERO errors in the stages by the time of the final cutover. 4 Confidential - not for redistribution
  5. 5. Added valueThe occasion of a data migration often involves work that can be used to support other initiativesbringing new value to the enterprise. Master Data Management As part of the data and application migration, it is often necessary to capture and validate various “master data” including:  Customer data  Product data  Reference data  Value mappings for translation: “Dictionary translations” as described in Stage 4. Having developed the data model and captured the requisite information, the possibility of implementing a full featured Master Data Hub can be considered. Master data could be created, maintained, or simply exposed via this central repository with all systems synchronized in real-time or batch modes to this Single Version of Truth. Co-existence At the time of writing the translations and transformations of data and hierarchies in the migration work stream, “reverse functions” are written for those transformations that are easily reversed (e.g. VAT added calculations can be easily reversed). Other not easily reversible transformations (e.g. concatenations) are “reversed” by writing the rule as though it was to support a migration from the target system(s) to the source system(s) with the added benefit of being able to examine programmatically the audit trails. With the object- oriented writing of all functions, they are easily re-used to support a co-existence scenario of legacy and new systems that can be used to replicate data between those systems on a real- time or batch basis as appropriate. These transformations can configured to be called via an Enterprise Service Hub (ESB) or other methods, as required. Archiving During any data migration, business requirements should be captured around the availability of data; what data should be available on which SLAs in order to ensure that the operational system continues to operate at maximum efficiency. A typical plan might envision segregating former customers to a secondary system outside of the main physical database (for performance reasons) 12 months after they terminate a client relationship. The same could be true of older transactions and other data. The ideal technical system would have the ability to treat data in an “object oriented” form, archiving off such “objects” like a customer, automatically capturing all information related to that customer and moving it to the secondary system. The data can still be interrogated but is no longer cause for performance degradation as data volumes expand. Use of this tactic ensures that the sizing of databases and therefore software and hardware requirements can be easily predicted and is only variable insofar as operational behaviors vary, or the client base expands at 5 Confidential - not for redistribution
  6. 6. which time additional hardware or software can become necessary. This function can alsosupport a full Information Lifecycle Management system that includes creation of data forsystem testing with features such as object movement (with referential integritymaintained) and data masking for security. Additional work required to support this wouldinclude creation of a security model that defines how datasets can be created, who can viewunmasked data (at the branch/column/customer/etc. level), and how the datasets can beextracted or moved.There are existing software products that can best leverage the work that will have alreadybeen performed to support the Information Lifecycle Management system. 6 Confidential - not for redistribution
  7. 7. Getting StartedOur experience shows that if a well-defined business process is implemented, along with ourtechnical steps as described above, you can significantly reduce the cost of your data migration andeliminate many sources of risk, especially when that migration is repetitive in nature (multiplecountries to be migrated independently over a number of months/years to a new system). A typicalcost curve levels out at 30% cost per migration after the second migration. Furthermore you canleverage the artifacts created to gain new value as described in “Added Value” above.You can contact me directly via email at WAEL@peakconsulting.eu or via our website atwww.peakconsulting.eu to learn more about how Peak Consulting can help you successfullycomplete your data migration or other “big data” initiative. 7 Confidential - not for redistribution
  8. 8. diagram 1: Data Migration Architecture Diagram 8 Confidential - not for redistribution

×