Extraction, Transformation and Loading Process Data Warehousing Technologies   Dr. Ilieva I. Ageenko
Extraction, Transformation and Loading <ul><li>Data is extracted from the operational systems </li></ul><ul><li>Before dat...
ETL and Metadata <ul><li>Metada is created by and updated from the load programs that move data into the DW or DM </li></u...
Role of Metadata <ul><li>Metadata allows the decision support user to find out where data is in the architected DW environ...
Role of Metadata <ul><li>Three main layers of metadata exist in the DW </li></ul><ul><ul><li>Application-level (operationa...
Mistakes to Avoid <ul><li>“ Garbage in = Garbage out “ Avoid Loading dirty data into the DW </li></ul><ul><li>Avoid Buildi...
ETL- Tools for Data Extraction, Transformation and Load <ul><li>Techniques Available: </li></ul><ul><ul><li>Commercial off...
Generation of ETL Tools <ul><li>First Generation - Source Code-Generation  </li></ul><ul><ul><li>Prism Executive Suite fro...
Strengths and Limitations of First Generation ETL tools <ul><li>Strengths </li></ul><ul><ul><li>Tools are mature  </li></u...
Strengths and Limitations of Second Generation ETL tools <ul><li>Strengths </li></ul><ul><ul><li>Lower cost  </li></ul></u...
ELT process <ul><li>Data transformation can be very complex and can include: </li></ul><ul><ul><li>Field translations (suc...
Steps of Daily Production Extract <ul><li>Primary extraction (read the legacy format) </li></ul><ul><li>Identifying the ch...
Extraction Phase - Snapshots <ul><li>Periodically Snapshots </li></ul><ul><li>Identify What Has Changed since the last tim...
Loading Phase <ul><li>All or part of the DW is taken off-line while new data is loaded. </li></ul><ul><li>One way to minim...
Mirrored Data Warehouse <ul><li>High reliability during the day </li></ul><ul><li>During loading the mirror is broken </li...
Data Quality Assurance   <ul><li>At the end of the loading process a Quality Assurance check is made on the data in the mi...
Advantages Techniques <ul><li>Use of 3 mirrors  </li></ul><ul><ul><li>2 for data availability and redundancy </li></ul></u...
Loading, Indexing and Exception Processing <ul><li>Loading data into a fact table should be done as a bulk loading operati...
Conforming Dimensions <ul><li>Incompatible Granularity when loading data from multiple sources  </li></ul><ul><li>Conformi...
Upcoming SlideShare
Loading in …5
×

Building the DW - ETL

1,507 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,507
On SlideShare
0
From Embeds
0
Number of Embeds
72
Actions
Shares
0
Downloads
104
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Building the DW - ETL

  1. 1. Extraction, Transformation and Loading Process Data Warehousing Technologies Dr. Ilieva I. Ageenko
  2. 2. Extraction, Transformation and Loading <ul><li>Data is extracted from the operational systems </li></ul><ul><li>Before data is loaded to the DW it needs to pass a Transformation process. </li></ul><ul><li>ETL tools are commonly used to extract, cleanse, transform and load data into the DW. </li></ul><ul><li>ETL operate as the central hub of incoming data to the DW </li></ul>
  3. 3. ETL and Metadata <ul><li>Metada is created by and updated from the load programs that move data into the DW or DM </li></ul><ul><li>ELT tools generate and maintain centralized metadata </li></ul><ul><ul><li>Some ETL tools provide an open Metadata Exchange Architecture that can be used to integrate all components of a DW architecture with central metadata maintained by the ETL tool. </li></ul></ul>
  4. 4. Role of Metadata <ul><li>Metadata allows the decision support user to find out where data is in the architected DW environment. </li></ul><ul><li>Metadata contains the following components: </li></ul><ul><ul><li>Identification of the source of data </li></ul></ul><ul><ul><li>Description of the customization that has occurred as the data passes from DW to DM </li></ul></ul><ul><ul><li>Descriptive information about the tables, attributes and relationships </li></ul></ul><ul><ul><li>Definitions </li></ul></ul>
  5. 5. Role of Metadata <ul><li>Three main layers of metadata exist in the DW </li></ul><ul><ul><li>Application-level (operational) metadata </li></ul></ul><ul><ul><li>Core warehouse metadata - catalog of the data in the warehouse. I s based on abstractions of real world entities like project, customer </li></ul></ul><ul><ul><li>User-level metadata - maps the core warehouse metadata to useful business concepts </li></ul></ul>
  6. 6. Mistakes to Avoid <ul><li>“ Garbage in = Garbage out “ Avoid Loading dirty data into the DW </li></ul><ul><li>Avoid Building stovepipe data marts that do not integrate with central metadata definitions </li></ul>
  7. 7. ETL- Tools for Data Extraction, Transformation and Load <ul><li>Techniques Available: </li></ul><ul><ul><li>Commercial off-the-shelf ETL (+) </li></ul></ul><ul><ul><li>Write ETL using a procedural language (+-) </li></ul></ul><ul><ul><li>Data replication of source data to the DW (-) </li></ul></ul>
  8. 8. Generation of ETL Tools <ul><li>First Generation - Source Code-Generation </li></ul><ul><ul><li>Prism Executive Suite from Ardent Software , Inc </li></ul></ul><ul><ul><li>Passport from Carleton Corporation </li></ul></ul><ul><ul><li>ETI-extract tool suite from Evolutionary Techn., Inc </li></ul></ul><ul><ul><li>Copy Manager from Information Builders, Inc. </li></ul></ul><ul><ul><li>SAS/Warehouse Administrator from SAS Institute, Inc. </li></ul></ul><ul><li>Second Generation - Executable Code-Generation </li></ul><ul><ul><li>PowerMart from Informatica Corporation </li></ul></ul><ul><ul><li>Ardent DataStage from Ardent Software , Inc. </li></ul></ul><ul><ul><li>Data Mart Solution from Sagent Technology, Inc. </li></ul></ul><ul><ul><li>Tapestry from D2K , Inc. </li></ul></ul><ul><ul><li>Ab Initio from Ab Initio Software Corporation </li></ul></ul><ul><ul><li>Genio from LeoLogic </li></ul></ul>
  9. 9. Strengths and Limitations of First Generation ETL tools <ul><li>Strengths </li></ul><ul><ul><li>Tools are mature </li></ul></ul><ul><ul><li>Good at extracting data from legacy systems </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>High cost products and complex training requirements </li></ul></ul><ul><ul><li>Extract programs must be compiled from source code </li></ul></ul><ul><ul><li>Single-threaded applications do not use parallel processor </li></ul></ul><ul><ul><li>Use of intermediate files limits throughput </li></ul></ul><ul><ul><li>Requirements to manage code, files, programs, JCL </li></ul></ul><ul><ul><li>Many transformations must be coded by hand </li></ul></ul><ul><ul><li>Significant amount of metadata must be manually generated </li></ul></ul>
  10. 10. Strengths and Limitations of Second Generation ETL tools <ul><li>Strengths </li></ul><ul><ul><li>Lower cost </li></ul></ul><ul><ul><li>Products are easy to learn </li></ul></ul><ul><ul><li>Generate extensible metadata </li></ul></ul><ul><ul><li>Generate directly executable code (speed up the extraction process) </li></ul></ul><ul><ul><li>Parallel architecture </li></ul></ul><ul><li>Limitations </li></ul><ul><ul><li>Many tools are not mature </li></ul></ul><ul><ul><li>Some tools are limited to RDBMS database sources </li></ul></ul><ul><ul><li>Problems with capacity when apply to large enteprise DW architectures </li></ul></ul>
  11. 11. ELT process <ul><li>Data transformation can be very complex and can include: </li></ul><ul><ul><li>Field translations (such as from mainframe to PC formats) </li></ul></ul><ul><ul><li>Data formatting (such as decimal to binary data formats) </li></ul></ul><ul><ul><li>Field formatting (to truncate or pad field data when loaded into DW) </li></ul></ul><ul><ul><li>Reformatting data structure (such as reordering table columns) </li></ul></ul><ul><ul><li>Replacing field data through table lookups (such as changing alpha codes to numeric Ids) </li></ul></ul><ul><ul><li>Remapping transation codes to summary reporting codes </li></ul></ul><ul><ul><li>Applying logical data transformations based on user-defined business rules </li></ul></ul><ul><ul><li>Aggregating transaction level values into “roll-up” balances </li></ul></ul>
  12. 12. Steps of Daily Production Extract <ul><li>Primary extraction (read the legacy format) </li></ul><ul><li>Identifying the changed records </li></ul><ul><li>Generalizing keys for changing dimensions </li></ul><ul><li>Transforming into load records images </li></ul><ul><li>Migration from the legacy system to DW system </li></ul><ul><li>Sorting and Building Aggregates </li></ul><ul><li>Generalizing keys for aggregates </li></ul><ul><li>Loading </li></ul><ul><li>Processing exceptions </li></ul><ul><li>Quality assurance </li></ul><ul><li>Publishing </li></ul>
  13. 13. Extraction Phase - Snapshots <ul><li>Periodically Snapshots </li></ul><ul><li>Identify What Has Changed since the last time a snapshot was built. The amount of effort is determined by the type of scanning technique </li></ul><ul><ul><li>Scan data that has been time stamped </li></ul></ul><ul><ul><li>Scan a “Delta” file </li></ul></ul><ul><ul><li>Scan an Audit Log </li></ul></ul><ul><ul><li>Modify the application code </li></ul></ul><ul><ul><li>Compare “Before” and “After” image files </li></ul></ul>
  14. 14. Loading Phase <ul><li>All or part of the DW is taken off-line while new data is loaded. </li></ul><ul><li>One way to minimize the downtime due to loading is to MIRROR the DW. </li></ul>DBMS Data + aggregations Data + aggregations Mirror 1 Mirror 2 What are the benefits of using a mirrored DW?
  15. 15. Mirrored Data Warehouse <ul><li>High reliability during the day </li></ul><ul><li>During loading the mirror is broken </li></ul><ul><li>Both production and loading processes run in parallel in different mirrors </li></ul>
  16. 16. Data Quality Assurance <ul><li>At the end of the loading process a Quality Assurance check is made on the data in the mirror that has been changed. </li></ul><ul><li>All-disk to all-disk data transfer takes place </li></ul>DBMS DBMS Mirror 1 Mirror 2 Mirror 1 Mirror 2 ETL ETL QA QA fails
  17. 17. Advantages Techniques <ul><li>Use of 3 mirrors </li></ul><ul><ul><li>2 for data availability and redundancy </li></ul></ul><ul><ul><li>1 for loading </li></ul></ul><ul><li>Use of Segmentable Fact Table index </li></ul><ul><ul><li>Drop the most recent section (segment) of the master index of the Fact table, rather than the whole table </li></ul></ul><ul><ul><li>This technique provides speed and allows the portion of the index that is dropped to be rebuilt quickly once the loading is complete </li></ul></ul>
  18. 18. Loading, Indexing and Exception Processing <ul><li>Loading data into a fact table should be done as a bulk loading operation with the master index turned off. It will be much faster to perform a bulk load rather than process the records one at a time with INSERT or UPDATE statements. </li></ul><ul><li>The ability during a bulk data load to insert or overwrite , and the abilityto insert or add to values simplifies the logic of data loading. </li></ul><ul><li>Loading and Indexing should be able to gain enourmous benefits from parallelization. </li></ul><ul><li>Loading data into a dimension table is quite different from loading data into a fact table. </li></ul><ul><ul><li>The fact table typically has one or two large indexes built on a combination of dimension keys. </li></ul></ul><ul><ul><li>A dimension tables, typically has many keys built on its textual attributes. </li></ul></ul>
  19. 19. Conforming Dimensions <ul><li>Incompatible Granularity when loading data from multiple sources </li></ul><ul><li>Conforming the Dimensions means: </li></ul><ul><ul><li>forcing the two data sources to share identical dimensions </li></ul></ul><ul><ul><li>expressing both data sources at the lowest common level of granularity </li></ul></ul>

×