PPDM data loading best practice

3,442 views
3,239 views

Published on

A review of data loading best practice, based on PPDM data from the Oil & Gas industry.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,442
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
139
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
  • To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
  • How I came to be here giving this talkDmigration expert with ETL for 10+yearsMy first PPDM conference was last October in Houston. We're from a data loading, integration background, and it was a bit of an eye opener to listen to that data manager's view of things/PPDM, and it seemed to me that companies such as ourselves are part of the problem. One aspect of the conference that caught my attention was the education, certification and, for want of a better word, the professionalisation of petroleum data management - moving away from comparmentalised expertise within companies - I had the impression that one of the industry concerns is that a lot of this expertise is held in people's heads, and the experienced people looking forward more to retirement than their career. At PNEC, it was clear that some companies are very well advanced in this respect, for standards bodies like PPDM there is a lot of work to be done.There are similar rumblings within the data migration and integration industry, so today's talk will be an introduction to data migration best practices, then a quick look at how these link up with data management best practice.Went to pnec/ppdm, got data management perspective, lots of interest in certification, education, passing on the legacyAlso: focus on understanding the data - lineage, quality led me to think about how we as data loaders don’t really think about ongoing data management.[cause of/solution to]Hence talk on Data migration good practices/process
  • We’ll first look at 3 scenarios for moving data about
  • And yet cos it happens at end of programme, it is often started late and treated as a technical issue – moving the filing cabinet.-----DM happens at the end of the programme, just before the old system turns off, so it runs late and hence is usually started late into the programme.Possible exaggeration, but business view is a forklift truck picking up the filing cabinet and dropping it into the new system.Maybe a new filing cabinet and make sure all the papers are in there before movingIn reality, need to read, understand, translate the documents to new system, discard obsolete papers. Only business knows what’s important.
  • DM methodologies, certifications, training are sprouting up same way as with data managementWhat’s available?PDM – v2 coming, it’s a bit abstract.Project Management for Data Conversions and Data Management, Charles Scott
  • To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
  • And workflow parts.------------------A key to success is the old system being turned off. This gets people interested.Concentrate on Legacy Decommissioning. See what light it sheds on the process – turns out to be a lot.Sometimes a process can seem like a sea of required steps, hard to get people to buy into it.Focus on something everyone can agree with – the legacy system will be turned off – surprisingly often not considered in much detail, but it gives you leverage, grabs peoples attention if you can get them to believe you
  • WHAT: Need to know what you’re turning off.Programme: single vcersion of truth.Data quality.Model the systems and relationships you have – as you go out and start talking to the people who use these systems and how, you’ll discover more connections and systems including little empires, satellite systems used for operations you want to be brought into the new system. You’ll also discover bits of the legacy system nobody uses – no need to migrate.Landscape analysis. Typically data discovery done here, not data profiling.
  • Data migration is unusual in many ways. There are several teams of people involved, users, new system providers, old system experts, migration experts, project management, “business”.You write code that can correctly interpret the relevant information in the old system – moves and checks it to the new system – it’s complicatedAt the end of the programme, you run it once, it’s tested to ensure the new system will allow business to continue confidently, you don’t want them or you sleepless worriedThen you throw it away.There’s a lot of nostalgia around at the moment, you can buy your childhood memories on ebay, but if you’re looking for a 70s style waterfall methodology you’d be hard pushed to find services or product companies that don’t have the word agile in their process.Don’t despair, with data migration you can create you own waterfall method despite each stage being demonstrably agile.Describe diagramRequirements – light – just move the data. The some truly agile dev, maybe asking some reqs questions, unit testing, so on – let’s pretend that all the different bits are developed together and there are no nasty surprises when you try to join them up in your weeks worth of integration testing. Testers – they know how to write tests, right – so alongside the agile dev they’re writing tests, starting with the big pebbles and filling up the jar, let’s assume they test incrementally.then signoff – big step – security, data stakeholders, huge docsThen migrate – users hardly noticed you up til now, told to test, think of lots of fiendish lttle corner casesEg postcodeIf you have a nostalgic desire to follow a 70’s style waterfall process, you’d be hard pushed to find services or product companies that don’t have the word agile in their process, but don’t worry, in a DM project there are many ways to achieve it while using agile methods throughout.Signoff here might fail – big jump back at end of projectUsers might reject migration – bigger jump
  • To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
  • In the 3 scenarios we looked at, one area of commonality are the innocuous arrows linking the source and target.A great deal of work is involved in realising them – design, requirements, testing, implementation, documentation.What we really want is the implementation, the automated loading of external data into PPDM.let’s look at best practice for implementing these.
  • So for each of these data loaders, to relate back to a mantra we’ve heard in previous data management conferences, ideally we want a single version of truth; Whatever artifacts are required, we want to remove duplication, because duplication means errors, inconsistency, additional workWe want to remove boiler plate components that are only indirectly related to the business rules by which data is loaded.Let’s look at what goes into a data loader and where the duplication and unnecessary work comes from.--------------------------------As a data manager, looking at your single version of truth (logical, may be physically many databases), you want to be able to ask questions about that:Confidence in dataLegal questions – should we be looking at archived data, where is it, how did that data map to our current data view.Is there a link between instances of common errors – eg a problem with data loaded from a particular source.Looking back at data load, migration, integration, they have something in common: the arrows, or the rules by which data loaded.On the diagram they look nothing, but there is a lot of work that goes into them.Extend the single version of truth analogy, you want a single version of truth for each of these arrows. Look into them and they are generally very ad-hoc and poorly documented.Business rules: Excel – not well version controlled, woolly language – vague. Then implemented in code or using a graphical tool. Duplication. Difficult to communicate developer to domain expert.From a DM perspective, where are these rules. Well PPDM have done a great job in this respect by providing tables specifically for that.
  • PPDM comes to us as a physical projection, rather than a logical model – maps directly to a relational database.Access therefore via SQL, PL/SQL; low level detail is important ie how relationships are implemented eg well header to borehole.Considerations to access: primary keys, foreign keys, data types – conversions, maximum lengths. Load order required by FKs – PPDM Load of the rings, relationships – cardinality, etc.SQL: only know at runtime, so turnaround can be slowAll of this metadata is available in machine readable format, so we should use it -
  • Looking at the external files, we need a variety of skills: text manipulation, XML processing, Excel, database, etc.The data model is unlikely to be as rich as PPDM, but there is some definition of the content: the LAS 2.0 specficiation, Excel workbooks have a layout, eg tabular with column titles, worksheets are named, etc.It can be hard to find people with the relevant skills, and you can end up with some adhoc, non standardised implementations because the developer used whatever skills he had: perl, python, xslt, sql.So the next clue is that we should use the model information: what elements, attributes and relationships are defined, rather than details of how we access it:Abstract out the data access layer, don’t mix data access with the business rules required to move them into PPDM.
  • A common step for defining how a data source is to be loaded is for a domain expert to write it up in Excel.Not concerned with data access, but some details will creep in, egspecifiying an xpath.When lookups, merging / splitting values, string manipulation, conditional logic etc come in the description can become ambiguous.Also note the duplication: the model metadata is being written in to the spreadsheet; if the model changes, the spreadsheet needs to be manually updated.
  • A developer implements those rules in code. Above pseudo code shows typical things that are undesirable:First, duplication – this is reiterating the excel rules – they need to match up, but while a domain expert might follow the simple example above, low level code can be tricky to discuss.Second again: metadata is again duplicated – the names of the tables and columns appear in the SQL statements, the max length of the name column is checked. Explicit looping construct.Third boiler plate code: select/update/insert conditional logic.Fourth: data access code appears in the rules.I’ve made it explicit here and the code above probably wouldn’t pass a code inspection, but it does illustrate the type of duplication that can arise.In particular, the developer reads the specifications, knowledge stored in developers head, and regurgitated as code. Developer becomes valuable, code becomes hard to maintain as talented developer moves on.
  • Tools are a recognised best practice, they’re better than trying to do things by hand, eg hand coding migration scripts, workflow, profiling.But you do need skills in the toolsets.
  • To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
  • Graphical tools such as Talend, Mule DataMapper, and her AltovaMapForce take a predominantly graphical approach.You can see the metadata loaded in on the left and right (source/target), and lines connecting.In addition to the logic gates for more complex processing, in the background you can add code snippets to implement most business logic.Issues:is it really very easy to read (above is simple mapping, imagine PPDM well log curve, reference data tables etc).It isn’t easy to see what really happens: a+b versus an “adder” – eg follow the equal() to Customers – what does that actually do?But: can generate documentation and executable from that single definitive mapping definitionTyping errors etc are mostly eliminated.
  • An alternative is to use a textual DSL. Again you can see the metadata has been loaded[switch to TM live to enable interaction].No data access code.Metadata is used extensively – for example warnings, primary key for identification; relationships – cardinality, no explicit iteration. Typing errors checked at designtime. Model, element changes that affect the code quickly detected, egimagince PPDM 3.8 to 3.9.Rels used to link transforms – more logical view, no need to understand underlying constraints; complexity of model doesn’t matter, the project becomes structured naturally.FK constraints used to determine load order.Metadata pulled in directly from the source metadata, eg PPDM. Show comments; customisation. So, making use of all the hard work put in by PPDM.
  • From the same source you can generate code to execute the project
  • As a data manager, looking at your single version of truth (logical, may be physically many databases), you want to be able to ask questions about that:Confidence in dataLegal questions – should we be looking at archived data, where is it, how did that data map to our current data view.Is there a link between instances of common errors – eg a problem with data loaded from a particular source.Looking back at data load, migration, integration, they have something in common: the arrows, or the rules by which data loaded.On the diagram they look nothing, but there is a lot of work that goes into them.Extend the single version of truth analogy, you want a single version of truth for each of these arrows. Look into them and they are generally very ad-hoc and poorly documented.Business rules: Excel – not well version controlled, woolly language – vague. Then implemented in code or using a graphical tool. Duplication. Difficult to communicate developer to domain expert.From a DM perspective, where are these rules. Well PPDM have done a great job in this respect by providing tables specifically for that.
  • PPDM provides tables to allow us to record this - PPDM_SCHEMA, PPDM_TABLE, PPDM_COLUMN to describe the schemas and PPDM_MAP_RULE and PPDM_MAP_RULE_DETAIL to record the mappings. So, PPDM_SCHEMA doesn't just enable you to store details about your PPDM schema - it's pretty much the same as the Oracle catalog tables, with a few additions for recording units of measure, for example. So you can record there schema information about the legacy system, about XML schemas for example WITSML or PRODML. The PPDM_MAP tables let you record the mapping rules, so how a particular element or attribute goes from the source to the target schema and is stored in PPDM. This makes you the data manager happy, because you can query the database using your finely honed SQL skills to create reports to present to the business users who need to know this information - hopefully not the legal department.PPDM_MAP_RULE is used to contain lower level code, eg PL/SQL, python rules etc.It's hard to populate these tables though and their use is not standardised.
  • So: How about a tool, which can already generate documentation, generate the same for the PPDM metadata module.Above is a bit simplistic – hopefully PPDM_SYSTEM, PPDM_TABLE etc are already populated for the actual PPDM instance.And we only want to publish mappings when they actually used.Switch to demonstration to show TM prototype of how they can be populated.
  • You can store code but is it readable? Unit of code is usually a block, perhaps saying how to migrate an entire table.
  • So how about the tools used in data loading populate these tables automatically. Most, not all, have some metadata representation. If it's hand coding, then often the metadata is encapsulated in the code itself - the developer read the documentation, created the queries and updates which would run against the data stores, but apart from that you expect a tool to show you what you are moving data between.You'd also hope for a higher level representation of the mapping rules - lines, boxes, a domain specific language. Possibly a reporting capability.So what we did was to take our reporting capabilities, and look at how we can "report" to PPDM - export the metadata and mapping rules into the relevant module. I want to emphasise that we did an investigation, we don't have production code and we may have gotten some of the details wrong, but I'm going to fire up our tool, show you some simple mapping rules as developed, then show you what we populated in PPDM schema and mapping tables at the push of a button.
  • A typical tool – using ours cos I get a discount from our sales guy[better pic reqd]
  • To keep things simple when I’m talking, we’ll discuss loading data into PPDM, but a lot of this applies to generic data loading – moving data out of PPDM, or not involving PPDM at all.Data transformation is mudane from a business perspective, but very important to get right. The less time and trouble it causes, the more time you can spend doing more interesting things directly benefiting your business.Badly loaded data by definition affects the quality of the data in your MDM store.
  • At the end of these initial phases, you might realise that the big bang approach is not going to work, and you need to change your approach significantly.It’s much better to realise this early on.Eg you might migrate bit by bit, maintaining both systems and using a data highway to move data between the old/new systems[dvla]
  • You need people to agree to the old system off, identify these people and bring them with youDecide what docs they will sign off – if you require design docs, these will need to be understood – are user stories better than low level business rules?When – so – 2 weeks before you run the final project you have all the docs ready for sign off – what do you do – hand them all over then for the stakeholders to sign off?These docs will be large and hard to understand, and you
  • WHEN: Need to ensure business continuity
  • Tools are a recognised best practice, they’re better than trying to do things by hand, eg hand coding migration scripts, workflow, profiling.But you do need skills in the toolsets.
  • PPDM data loading best practice

    1. 1. Data LoadingBest PracticePPDM Association2012 DataSymposiumwww.etlsolutions.com
    2. 2. Agenda Best practices inData loading tools and challenges methodology We’ll be taking a look at loading data into PPDM, but much of this applies to generic data loading too
    3. 3. Agenda Best practices inData loading tools and challenges methodology
    4. 4. We’ve been listening to the Data Manager’s perspective • PPDM Conference, Houston • PNEC, Houston • Data Managers’ challenges: • Education • Certification • Preserving knowledge • Process
    5. 5. Data management is difficult and important
    6. 6. Different data movement scenarios Migration Data Integration Loading But all require mapping rules for best practice
    7. 7. The business view of data migration can be an issue • Often started at the end of a programme • Seen as a business issue (moving the filing cabinet), not technical • However, the documents in the filing cabinet need to be read, understood, translated to the new system; obsolete files need to be discarded
    8. 8. Different data migration methodologies are available PDM (Practical Data Providers Migration) • Most companies • Johny Morris providing data • Training course migration services/products • PDM certification have a methodology • Abstract • Ours is PDM-like, but • V2 due soon more concrete and less abstract
    9. 9. Agenda Best practices inData loading tools and challenges methodology Methodology
    10. 10. As an example, our methodology Project scoping Core migration Configuration Landscape analysis Data assurance Migration design Migration developmentRequirements analysis Data discovery Data review Testing design Testing development Data modelling Data cleansing Execution Review Legacy decommissioning
    11. 11. Firstly, review the legacy landscapeSatellites Archive SAP Legacy Report Application Excel Access DB VBA
    12. 12. Eradicate failure points Beware the virtual waterfall processRequirements Agile Development Migrate Signoff
    13. 13. Agenda Best practices inData loading tools and challenges methodology Rules
    14. 14. Rules are required• In data migration, integration or loading, one area of commonality is the link between source and target• This requires design, definition, testing, implementation and PPDM documentation• The aim is automated loading of external data into a common store• This requires best practice
    15. 15. Best practice: A single version of truth• So for each of these data loaders we want a single version of truth• Whatever artifacts are required, we want to remove duplication, because duplication means errors, inconsistency and additional work• We want to remove boiler plate components that are only PPDM 3.8 indirectly related to the business rules by which data is loaded• Let’s look at what goes into a data loader and where the duplication and unnecessary work comes from...
    16. 16. The PPDM physical model• PPDM comes to us as a physical projection, rather than a logical model – maps directly to a relational database• Access therefore via SQL, PL/SQL; low level detail is important i.e. how relationships are implemented (e.g. well header to borehole)• Considerations to access: primary keys, foreign keys, data types – conversions, maximum lengths. Load order required by FKs – PPDM Load of the rings, relationships – cardinality etc• SQL: only know at runtime, so turnaround can be slow• All of this metadata is available in machine readable format, so we should use it
    17. 17. External data sources• Looking at the external files, we need a variety of skills: text manipulation, XML processing, Excel, database• The data model is unlikely to be as rich as PPDM, but there is some definition of the content e.g. Excel workbooks have a tabular layout with column titles, worksheets are named• It can be hard to find people with the relevant skills - you sometimes see ad PPDM 3.8 hoc, non-standard implementations because the developer used whatever skills he/she had: perl, python, xslt, sql• So the next clue is that we should use the model information: what elements, attributes and relationships are defined, rather than details of how we access it• Abstract out the data access layer; don’t mix data access with the business rules required to move them into PPDM
    18. 18. Challenges with domain expert mapping rules• A common step for defining how a data source is to be loaded is for a domain expert to write it up in Excel• Not concerned with data access, but some details will creep in, e.g. specifying an xpath• When lookups, merging/splitting values, string manipulation, conditional logic appear, the description can become ambiguous• Also note the duplication: the model metadata is being written in the spreadsheet; if the model changes, the spreadsheet needs to be manually updated
    19. 19. Challenges with developer mapping rules• The example here probably wouldn’t pass a code inspection, but it does illustrate the type of issues that can arise• Firstly, duplication: this is reiterating the Excel rules – they need to match up, but while a domain expert might follow the simple example previously, low level code can be tricky to discuss• Secondly, metadata is again duplicated: the names of the tables and columns appear in the SQL statements, the max length of the name column is checked• Thirdly, boiler plate code: select/update/insert conditional logic• Fourthly, data access code appears in the rules• Finally, the code becomes hard to maintain as the developer moves on to other roles
    20. 20. Documentation of mapping rules  Word document for sign-off  Data Management record  How data was loaded  Stored in your MDM data store  Can be queried  PPDM mapping tables
    21. 21. Test artifacts• Here is where you do require some duplication• Tests are stories: • Define what the system should do • If it does, the system is good enough if the tests are complete• If we use a single version of truth to generate tests, the tests will duplicate errors, not find them
    22. 22. Agenda Best practices inData loading tools and challenges methodology Tools
    23. 23. Use tools • Use available metadata • Abstract out data access layer • Higher level DSL for the mapping rules: • Increase team communication – developer/business • Reduce boiler plate code • One definition: • Replace Excel and code • Generate documentation
    24. 24. An example of a graphical tool: Altova MapForce• Tools such as Talend, Mule DataMapper and Altova MapForce take a predominantly graphical approach• The metadata loaded on the left and right (source/target) with connecting lines• In addition to the logic gates for more complex processing, code snippets can be added to implement most business logic• Issues: • Is it really very easy to read? The example here is a simple mapping; imagine PPDM well log curve, reference data tables etc • It isn’t easy to see what really happens: a+b versus an “adder” – e.g. follow the equal() to Customers – what does that actually do? • But: can generate documentation and executable from that single definitive mapping definition • Typing errors etc are mostly eliminated
    25. 25. ETL Solutions’ Transformation Manager• An alternative is to use a textual DSL: again the metadata has been loaded• No data access code• Metadata is used extensively: for example warnings, primary key for identification; relationships• Typing errors are checked at designtime, and model or element changes affecting the code are quickly detected e.g. PPDM 3.8 to 3.9• Rels used to link transforms: a more logical view with no need to understand underlying constraints; complexity of the model doesn’t matter, as the project becomes structured naturally• FK constraints used to determine load order• Metadata pulled in directly from the source e.g. PPDM, making use of all the hard work put in by the PPDM Association
    26. 26. Generated documentation
    27. 27. Keeping the PPDM data manager happyOne of the many questions a data managerhas about the data he/she manages:Data lineage: How did this data get here? PPDM 3.8
    28. 28. PPDM provides tables to record data lineage
    29. 29. Transformation Manager can generate documentation for the PPDM metadata module
    30. 30. Agenda Best practices inData loading tools and challenges methodology Project management
    31. 31. Key points• Be aware • Look at data migration methodologies • Select appropriate components• Look for and remove large risky steps• Start early • Ensure correct resources will be available • No nasty budget surprises• Use tools• Build a happy virtual team
    32. 32. Questions• Did you know about these tables?• Who uses them?• How do you use them?• What features would be truly useful in a data loader tool?
    33. 33. Contact us for more information: Karl Glenn, Business Development Director kg@etlsolutions.com +44 (0) 1912 894040 Read more on our website: http://www.etlsolutions.com/what-we-do/oil-and-gas/ Raising data management standardswww.etlsolutions.com www.etlsolutions.com Images from Free Digital Photos freedigitalphotos.net

    ×