Hand Coding ETL Scenarios and Challenges

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    Favorites, Groups & Events

    Hand Coding ETL Scenarios and Challenges - Presentation Transcript

    1. Challenges and Techniques When Hand-coding ETL TDWI Webcast Series June, 26 2007 Mark Madsen http://ThirdNature.net
    2. Course Outline and Overview
      • What drives people to hand-coding?
      • Challenges faced in the absence of tools
      • Some practices to use when rolling your own ETL
    3. One Fact: ETL Market Share Sources: Forrester Research, TDWI, private survey
    4. Why Hand Coding?
      • Money…
        • vs. time
        • vs. resources
        • vs. realities of budgeting
    5. Why Hand Coding?
      • Pre-existing integration
      • Inherited data mart or warehouse that’s been in production
        • Do you really want to rebuild things that are working fine?
    6. Why Hand Coding?
      • Legacy sources
    7. Why Hand Coding?
      • Intrusiveness
        • contention with the source application or the source database
        • platform integration
      • Performance of the non-native extracts
    8. Why Hand Coding?
      • Custom applications as data sources
      • Data that’s hard to deal with in a database or using ETL tools
    9. Why Hand Coding?
      • EAI, ESB or messaging as a data source
      • Streaming data feeds
      • Real-time data extracts
    10. Why Hand Coding?
      • Sometimes an ETL tool is excessive given the problem you’re trying to solve
    11. Some of the Big Challenges
      • Performance
      • Access to non-relational sources, particularly legacy sources
      • Incremental extracts
      • Maintenance problems
      • Dealing with exceptions
      • Operational integration
    12. Performance
      • Leverage the database
        • Push work into SQL
        • Set processing is usually much faster than row-by-row processing
        • Less data movement
        • Takes advantage of database parallelism
    13. Performance
      • Learn new SQL development techniques to better leverage the database (if you don’t already know them):
        • E.g. if-then logic in SQL with no separate language wrapper for control logic
        • Subselects / correlated subqueries instead of a sequence of queries landing data between each step (in moderation)
        • Only move data when required
      • But what about files?
    14. Performance
      • If you are doing row-by-row processing:
        • Try to trickle-feed instead of batch processing
        • Create multiple instances of a single process and split your data (poor man’s parallelism)
        • Write real parallel code either using shared memory techniques or shared-nothing techniques
      Process 1 Process 2 Process 3 Process 4
    15. Legacy and Non-relational Sources
      • Come in many different flavors and often have restrictions on accessibility
      • Usually have no native manipulation tools like a database does, or they are tied to a language
      • Typical strategies to deal with these:
        • Write programs to dump data to files and ftp it to the target environment
        • Get (much less expensive) data replication tools that can copy the data easily
        • Leverage code libraries and database features, e.g. XML shredding
    16. Source Impacts and Incremental Extracts
      • Resource drain on the source application is always a consideration, ETL tool or not.
        • Leverage the database features, if applicable:
          • database replication
          • change data capture facilities
        • Leverage capabilities of the application, e.g. capture messages if using EAI / ESB tools.
    17. Maintenance
      • Log everything, all the time
        • Do it in a standard way
        • Make it queryable
        • Useful data to keep:
          • Start and end times of entire batch, components / subprocesses
          • Wait times (if you have things that can’t start until other things are complete)
          • Data volumes
          • Details for every error and exception
    18. Maintenance
      • Don’t forget programming discipline:
        • Document everything
        • Standardize common functions like logging and dependency checking
        • Pay attention to the development support environment
        • Modularize and make code readable
        • Try scripting languages instead of 3GLs
    19. Dealing With Exceptions and Errors
      • Some practices to compensate for things the tools do:
        • Use row status and process timestamp fields
        • Flag exception data as its processed and save it
        • Save the input data for errors, going back to production keys so you can trace data back
        • Clean data or isolate and save it as part of the process
        • Exception management is different with row-by-row processing vs. set-based processing; make sure you don’t lose data as you go through the process (it’s easier than you think)
        • Catch program errors and save the program state with the data
      Handling exceptions in code vs. in an ETL tool is different, but the general patterns are roughly the same.
    20. Integrating Into IT Operations
      • Tying into scheduling is usually easier with code since it’s an executable
      • Problems come when you want to monitor and alert.
      • There are a few techniques to use depending on the environment.
    21. Alternatives to Hand-coding
    22. Strategies for Dealing With Hand-coded ETL
      • Replacement
    23. Strategies for Dealing With Hand-coded ETL
      • Containment
    24. Strategies for Dealing With Hand-coded ETL
      • Coexistence
    25. Creative Commons
      • This work is licensed under the Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.
    26. Creative Commons Image Attributions
      • The following Creative Commons licensed images were used in this presentation:
      • Futuristic device: http://www.flickr.com/photos/tiptoe/2060918/
      • Whirlpool: http://www.flickr.com/photos/astrovinni/3002669/
      • Open air market: http://flickr.com/photos/baboon/309793875/
      • Great wall: http://www.flickr.com/photos/webel/64438599/
      • London: http://www.flickr.com/photos/cc_chapman/299509390/

    + mark madsenmark madsen, 3 years ago

    custom

    3648 views, 0 favs, 6 embeds more stats

    Overview of some of the scenarios that lead one to more

    More info about this document

    © All Rights Reserved

    Go to text version

    • Total Views 3648
      • 3584 on SlideShare
      • 64 from embeds
    • Comments 0
    • Favorites 0
    • Downloads 0
    Most viewed embeds
    • 19 views on http://blogs.ittoolbox.com
    • 15 views on http://clickstream.blogspot.com
    • 15 views on http://bazaarlive.blogspot.com
    • 8 views on http://it.toolbox.com
    • 5 views on http://www.biblogs.com

    more

    All embeds
    • 19 views on http://blogs.ittoolbox.com
    • 15 views on http://clickstream.blogspot.com
    • 15 views on http://bazaarlive.blogspot.com
    • 8 views on http://it.toolbox.com
    • 5 views on http://www.biblogs.com
    • 2 views on http://stockmarketguideindia.blogspot.com

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as inappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel
    File a copyright complaint
    Having problems? Go to our helpdesk?

    Categories