Slideshare.net (beta)

 

All comments

Add a comment on Slide 1

If you have a SlideShare account, login to comment; else you can comment as a guest


Showing 1-50 of 0 (more)

Hand Coding ETL Scenarios and Challenges

From mrm0, 1 year ago

Overview of some of the scenarios that lead one to hand-coding ove more

1643 views  |  0 comments  |  0 favorites  |  4 embeds (Stats)
Download not available ?
 

Groups / Events

 

 
Embed
options

More Info

This slideshow is Public
Total Views: 1643
on Slideshare: 1601
from embeds: 42

Slideshow transcript

Slide 1: Challenges and Techniques When Hand-coding ETL TDWI Webcast Series June, 26 2007 Mark Madsen http://ThirdNature.net Slide 1 TDWI August 2007 Mark Madsen

Slide 2: Course Outline and Overview  What drives people to hand-coding?  Challenges faced in the absence of tools  Some practices to use when rolling your own ETL Slide 2 TDWI August 2007 Mark Madsen

Slide 3: One Fact: ETL Market Share Custom 45.0% 12.7% Informatica Corporation PowerCenter & PowerMart 9.9% SAS Institute SAS Enterprise ETL Server 8.0% Ascential Software DataStage 3.0% DataMirror Constellar Hub & Transformation Server 1.9% IBM DB2 Warehouse Manager 1.9% Microsoft Data Transformation Services 1.7% Business Objects Data Integrator 1.7% Ab Initio Software Co>Operating System 1.7% Computer Associates Advantage DT 1.7% Oracle Warehouse Builder 1.4% Cognos DecisionStream 1.1% Evolutionary Technology The ETI Solution 1.1% Group 1 Software Data Flow Server 1.1% Hummingbird Genio 1.1% Pervasive Software DJ Cosmo 0.6% iWay Software ETL Manager 0.6% Teradata Warehouse Builder 0.6% Embarcadero Software DT/Studio Sunopsis 0.3% 3.3% Other 0.0% 5.0% 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% Sources: Forrester Research, TDWI, private survey Slide 3 TDWI August 2007 Mark Madsen

Slide 4: Why Hand Coding? Money… • vs. time • vs. resources • vs. realities of budgeting Slide 4 TDWI August 2007 Mark Madsen

Slide 5: Why Hand Coding?  Pre-existing integration  Inherited data mart or warehouse that’s been in production Do you really want to rebuild things that are working fine? Slide 5 TDWI August 2007 Mark Madsen

Slide 6: Why Hand Coding? Legacy sources Slide 6 TDWI August 2007 Mark Madsen

Slide 7: Why Hand Coding?  Intrusiveness • contention with the source application or the source database • platform integration  Performance of the non-native extracts Slide 7 TDWI August 2007 Mark Madsen

Slide 8: Why Hand Coding? Custom applications as data sources Data that’s hard to deal with in a database or using ETL tools Slide 8 TDWI August 2007 Mark Madsen

Slide 9: Why Hand Coding?  EAI, ESB or messaging as a data source  Streaming data feeds  Real-time data extracts Slide 9 TDWI August 2007 Mark Madsen

Slide 10: Why Hand Coding? Sometimes an ETL tool is excessive given the problem you’re trying to solve Slide 10 TDWI August 2007 Mark Madsen

Slide 11: Some of the Big Challenges  Performance  Access to non-relational sources, particularly legacy sources  Incremental extracts  Maintenance problems  Dealing with exceptions  Operational integration Slide 11 TDWI August 2007 Mark Madsen

Slide 12: Performance Leverage the database • Push work into SQL • Set processing is usually much faster than row-by- row processing • Less data movement • Takes advantage of database parallelism Slide 12 TDWI August 2007 Mark Madsen

Slide 13: Performance Learn new SQL development techniques to better leverage the database (if you don’t already know them): • E.g. if-then logic in SQL with no separate language wrapper for control logic • Subselects / correlated subqueries instead of a sequence of queries landing data between each step (in moderation) • Only move data when required But what about files? Slide 13 TDWI August 2007 Mark Madsen

Slide 14: Performance If you are doing row-by-row processing: • Try to trickle-feed instead of batch processing • Create multiple instances of a single process and split your data (poor man’s parallelism) • Write real parallel code either using shared memory techniques or shared-nothing techniques Process 1 Process 2 Process 3 Process 4 Slide 14 TDWI August 2007 Mark Madsen

Slide 15: Legacy and Non-relational Sources  Come in many different flavors and often have restrictions on accessibility  Usually have no native manipulation tools like a database does, or they are tied to a language  Typical strategies to deal with these: • Write programs to dump data to files and ftp it to the target environment • Get (much less expensive) data replication tools that can copy the data easily • Leverage code libraries and database features, e.g. XML shredding Slide 15 TDWI August 2007 Mark Madsen

Slide 16: Source Impacts and Incremental Extracts Resource drain on the source application is always a consideration, ETL tool or not. Leverage the database features, if applicable: database replication  change data capture facilities  Leverage capabilities of the application, e.g. capture messages if using EAI / ESB tools. Slide 16 TDWI August 2007 Mark Madsen

Slide 17: Maintenance Log everything, all the time • Do it in a standard way • Make it queryable • Useful data to keep: Start and end times of entire batch,  components / subprocesses Wait times (if you have things that can’t start  until other things are complete) Data volumes  Details for every error and exception  Slide 17 TDWI August 2007 Mark Madsen

Slide 18: Maintenance Don’t forget programming discipline: • Document everything • Standardize common functions like logging and dependency checking • Pay attention to the development support environment • Modularize and make code readable • Try scripting languages instead of 3GLs Slide 18 TDWI August 2007 Mark Madsen

Slide 19: Dealing With Exceptions and Errors Handling exceptions in code vs. in an ETL tool is different, but the general patterns are roughly the same. Some practices to compensate for things the tools do: • Use row status and process timestamp fields • Flag exception data as its processed and save it • Save the input data for errors, going back to production keys so you can trace data back • Clean data or isolate and save it as part of the process • Exception management is different with row-by-row processing vs. set-based processing; make sure you don’t lose data as you go through the process (it’s easier than you think) • Catch program errors and save the program state with the data Slide 19 TDWI August 2007 Mark Madsen

Slide 20: Integrating Into IT Operations  Tying into scheduling is usually easier with code since it’s an executable  Problems come when you want to monitor and alert.  There are a few techniques to use depending on the environment. Slide 20 TDWI August 2007 Mark Madsen

Slide 21: Alternatives to Hand-coding Slide 21 TDWI August 2007 Mark Madsen

Slide 22: Strategies for Dealing With Hand-coded ETL Replacement Slide 22 TDWI August 2007 Mark Madsen

Slide 23: Strategies for Dealing With Hand-coded ETL Containment Slide 23 TDWI August 2007 Mark Madsen

Slide 24: Strategies for Dealing With Hand-coded ETL Coexistence Slide 24 TDWI August 2007 Mark Madsen

Slide 25: Creative Commons This work is licensed under the Creative Commons Attribution- Noncommercial-Share Alike 3.0 United States License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/us/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA. Slide 25 TDWI August 2007 Mark Madsen

Slide 26: Creative Commons Image Attributions The following Creative Commons licensed images were used in this presentation: Futuristic device: http://www.flickr.com/photos/tiptoe/2060918/ Whirlpool: http://www.flickr.com/photos/astrovinni/3002669/ Open air market: http://flickr.com/photos/baboon/309793875/ Great wall: http://www.flickr.com/photos/webel/64438599/ London: http://www.flickr.com/photos/cc_chapman/299509390/ Slide 26 TDWI August 2007 Mark Madsen