Hand Coding ETL Scenarios and Challenges


Published on

Overview of some of the scenarios that lead one to hand-coding over tools, description of the challenges faces, and some practices to deal with the problems.

Published in: Business, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Hand Coding ETL Scenarios and Challenges

    1. 1. Challenges and Techniques When Hand-coding ETL TDWI Webcast Series June, 26 2007 Mark Madsen http://ThirdNature.net
    2. 2. Course Outline and Overview <ul><li>What drives people to hand-coding? </li></ul><ul><li>Challenges faced in the absence of tools </li></ul><ul><li>Some practices to use when rolling your own ETL </li></ul>
    3. 3. One Fact: ETL Market Share Sources: Forrester Research, TDWI, private survey
    4. 4. Why Hand Coding? <ul><li>Money… </li></ul><ul><ul><li>vs. time </li></ul></ul><ul><ul><li>vs. resources </li></ul></ul><ul><ul><li>vs. realities of budgeting </li></ul></ul>
    5. 5. Why Hand Coding? <ul><li>Pre-existing integration </li></ul><ul><li>Inherited data mart or warehouse that’s been in production </li></ul><ul><ul><li>Do you really want to rebuild things that are working fine? </li></ul></ul>
    6. 6. Why Hand Coding? <ul><li>Legacy sources </li></ul>
    7. 7. Why Hand Coding? <ul><li>Intrusiveness </li></ul><ul><ul><li>contention with the source application or the source database </li></ul></ul><ul><ul><li>platform integration </li></ul></ul><ul><li>Performance of the non-native extracts </li></ul>
    8. 8. Why Hand Coding? <ul><li>Custom applications as data sources </li></ul><ul><li>Data that’s hard to deal with in a database or using ETL tools </li></ul>
    9. 9. Why Hand Coding? <ul><li>EAI, ESB or messaging as a data source </li></ul><ul><li>Streaming data feeds </li></ul><ul><li>Real-time data extracts </li></ul>
    10. 10. Why Hand Coding? <ul><li>Sometimes an ETL tool is excessive given the problem you’re trying to solve </li></ul>
    11. 11. Some of the Big Challenges <ul><li>Performance </li></ul><ul><li>Access to non-relational sources, particularly legacy sources </li></ul><ul><li>Incremental extracts </li></ul><ul><li>Maintenance problems </li></ul><ul><li>Dealing with exceptions </li></ul><ul><li>Operational integration </li></ul>
    12. 12. Performance <ul><li>Leverage the database </li></ul><ul><ul><li>Push work into SQL </li></ul></ul><ul><ul><li>Set processing is usually much faster than row-by-row processing </li></ul></ul><ul><ul><li>Less data movement </li></ul></ul><ul><ul><li>Takes advantage of database parallelism </li></ul></ul>
    13. 13. Performance <ul><li>Learn new SQL development techniques to better leverage the database (if you don’t already know them): </li></ul><ul><ul><li>E.g. if-then logic in SQL with no separate language wrapper for control logic </li></ul></ul><ul><ul><li>Subselects / correlated subqueries instead of a sequence of queries landing data between each step (in moderation) </li></ul></ul><ul><ul><li>Only move data when required </li></ul></ul><ul><li>But what about files? </li></ul>
    14. 14. Performance <ul><li>If you are doing row-by-row processing: </li></ul><ul><ul><li>Try to trickle-feed instead of batch processing </li></ul></ul><ul><ul><li>Create multiple instances of a single process and split your data (poor man’s parallelism) </li></ul></ul><ul><ul><li>Write real parallel code either using shared memory techniques or shared-nothing techniques </li></ul></ul>Process 1 Process 2 Process 3 Process 4
    15. 15. Legacy and Non-relational Sources <ul><li>Come in many different flavors and often have restrictions on accessibility </li></ul><ul><li>Usually have no native manipulation tools like a database does, or they are tied to a language </li></ul><ul><li>Typical strategies to deal with these: </li></ul><ul><ul><li>Write programs to dump data to files and ftp it to the target environment </li></ul></ul><ul><ul><li>Get (much less expensive) data replication tools that can copy the data easily </li></ul></ul><ul><ul><li>Leverage code libraries and database features, e.g. XML shredding </li></ul></ul>
    16. 16. Source Impacts and Incremental Extracts <ul><li>Resource drain on the source application is always a consideration, ETL tool or not. </li></ul><ul><ul><li>Leverage the database features, if applicable: </li></ul></ul><ul><ul><ul><li>database replication </li></ul></ul></ul><ul><ul><ul><li>change data capture facilities </li></ul></ul></ul><ul><ul><li>Leverage capabilities of the application, e.g. capture messages if using EAI / ESB tools. </li></ul></ul>
    17. 17. Maintenance <ul><li>Log everything, all the time </li></ul><ul><ul><li>Do it in a standard way </li></ul></ul><ul><ul><li>Make it queryable </li></ul></ul><ul><ul><li>Useful data to keep: </li></ul></ul><ul><ul><ul><li>Start and end times of entire batch, components / subprocesses </li></ul></ul></ul><ul><ul><ul><li>Wait times (if you have things that can’t start until other things are complete) </li></ul></ul></ul><ul><ul><ul><li>Data volumes </li></ul></ul></ul><ul><ul><ul><li>Details for every error and exception </li></ul></ul></ul>
    18. 18. Maintenance <ul><li>Don’t forget programming discipline: </li></ul><ul><ul><li>Document everything </li></ul></ul><ul><ul><li>Standardize common functions like logging and dependency checking </li></ul></ul><ul><ul><li>Pay attention to the development support environment </li></ul></ul><ul><ul><li>Modularize and make code readable </li></ul></ul><ul><ul><li>Try scripting languages instead of 3GLs </li></ul></ul>
    19. 19. Dealing With Exceptions and Errors <ul><li>Some practices to compensate for things the tools do: </li></ul><ul><ul><li>Use row status and process timestamp fields </li></ul></ul><ul><ul><li>Flag exception data as its processed and save it </li></ul></ul><ul><ul><li>Save the input data for errors, going back to production keys so you can trace data back </li></ul></ul><ul><ul><li>Clean data or isolate and save it as part of the process </li></ul></ul><ul><ul><li>Exception management is different with row-by-row processing vs. set-based processing; make sure you don’t lose data as you go through the process (it’s easier than you think) </li></ul></ul><ul><ul><li>Catch program errors and save the program state with the data </li></ul></ul>Handling exceptions in code vs. in an ETL tool is different, but the general patterns are roughly the same.
    20. 20. Integrating Into IT Operations <ul><li>Tying into scheduling is usually easier with code since it’s an executable </li></ul><ul><li>Problems come when you want to monitor and alert. </li></ul><ul><li>There are a few techniques to use depending on the environment. </li></ul>
    21. 21. Alternatives to Hand-coding
    22. 22. Strategies for Dealing With Hand-coded ETL <ul><li>Replacement </li></ul>
    23. 23. Strategies for Dealing With Hand-coded ETL <ul><li>Containment </li></ul>
    24. 24. Strategies for Dealing With Hand-coded ETL <ul><li>Coexistence </li></ul>
    25. 25. Creative Commons <ul><li>This work is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. </li></ul>
    26. 26. Creative Commons Image Attributions <ul><li>The following Creative Commons licensed images were used in this presentation: </li></ul><ul><li>Futuristic device: http://www.flickr.com/photos/tiptoe/2060918/ </li></ul><ul><li>Whirlpool: http://www.flickr.com/photos/astrovinni/3002669/ </li></ul><ul><li>Open air market: http://flickr.com/photos/baboon/309793875/ </li></ul><ul><li>Great wall: http://www.flickr.com/photos/webel/64438599/ </li></ul><ul><li>London: http://www.flickr.com/photos/cc_chapman/299509390/ </li></ul>