0
IBM Ascential ETL Overview: DataStage and Quality Stage
More than ever, businesses today need to understand their operations, customers, suppliers, partners, employees, and stock...
CIOs are responding to their organizations’ strategic needs by developing IT initiatives that align corporate data with bu...
<ul><li>Connect to any data or content, wherever it resides </li></ul><ul><li>Understand and analyze information, includin...
Data Analysis : Define, annotate, and report on fields of business data. <ul><li>Data Quality : </li></ul><ul><li>Standard...
QualityStage QualityStage is used to cleanse and enrich data to meet business needs and data quality management standards....
QualityStage Main QS stages used in the BRM project: <ul><li>Investigate   – gives you complete visibility into the actual...
QualityStage  Investigate     Standardize    Match    Survive <ul><li>Word   Investigation  parses freeform fields into...
QualityStage  Investigate     Standardize    Match    Survive <ul><li>For example , to create the patterns in address d...
QualityStage  Investigate     Standardize    Match    Survive Field parsing breaks the address into individual tokens o...
QualityStage  Investigate     Standardize     Match    Survive The  Standardize  stage allows you to reformat data from...
QualityStage  Investigate     Standardize     Match    Survive Standardization is used to invoke specific standardizati...
QualityStage  Investigate    Standardize     Match     Survive Data  matching  is used to find records in a single data...
<ul><li>Blocking </li></ul><ul><li>Matching </li></ul>QualityStage  Matching Stage  basically consists of two steps: Quali...
<ul><li>XA = master record (during the first pass, this was the first record found to match with another record) </li></ul...
QualityStage  Investigate    Standardize     Match      Survive Survivorship  is used to create a ‘best record’ from al...
QualityStage  Investigate    Standardize     Match      Survive <ul><li>Supplies missing values in one record with valu...
DataStage = data transformation
DataStage In its simplest form, DataStage performs data transformation and movement from source systems to target systems ...
DataStage <ul><li>DataStage Administrator </li></ul><ul><li>DataStage Manager </li></ul><ul><li>DataStage Designer </li></...
<ul><li>Specify general server defaults </li></ul><ul><li>Add and delete projects </li></ul><ul><li>Set project properties...
DataStage  Administrator     Manager     Designer      Director
DataStage  Administrator      Manager      Designer      Director DataStage  Manager  is the primary interface to the D...
<ul><li>Specify how the data is extracted </li></ul><ul><li>Specify data transformations </li></ul><ul><li>Decode (denorma...
DataStage  Administrator    Manager     Designer      Director Use DataStage  Director  to run, schedule, and monitor y...
<ul><li>Set up a project  – Before you can create any DataStage jobs, you must set up your project by entering information...
DataStage  Designer   Developing a job
DataStage  Designer   Developing a job
DataStage  Designer   Input Stage
DataStage  Designer   Transformer Stage The Transformer stage performs any data conversion required before the data is out...
DataStage  Designer
DataStage  Designer
DataStage  Designer
DataStage  Designer
<ul><li>T10  takes .txt files from the Pre-event folder and transforms them into rows. </li></ul><ul><li>Straight_moves  m...
Questions?
Thank you for attending
Upcoming SlideShare
Loading in...5
×

Etl Overview (Extract, Transform, And Load)

1,499

Published on

Overview of DataStage and QualityStage

1 Comment
1 Like
Statistics
Notes
  • Part 1 of a 3-part presentation: PART 1: ETL Overview / PART 2: ETL Quality Stage blocking and matching / PART 3: ETL Quality Stage unduplication
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,499
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
122
Comments
1
Likes
1
Embeds 0
No embeds

No notes for slide
  • Master data management – Reliably create and maintain consistent, complete, contextual and accurate business information about entities such as customers and products across multiple systems Business intelligence – Take the guesswork out of important decisions by gathering, storing, analyzing, and providing access to diverse enterprise information. Business transformation – Isolate users and applications from the underlying information completely to enable On Demand Business. Infrastructure rationalization – Quickly and accurately streamline corporate information by repurposing and reconciling data whenever it is required Risk and compliance - Deliver a dependable information management foundation to any quality control, corporate reporting visibility and data audit infrastructure.
  • DS Administrator is used for administration tasks such as setting up users, logging, creating and moving projects and setting up purging criteria
  • Permissions - Assign user categories to operating system user groups or enable operators to view all the details of an event in a job log file. Tracing – Enable or disable tracing on the server. Schedule – Set up a user name and password to use for running scheduled DataStage jobs. Mainframe – Set mainframe job properties and the default platform type. Turntables – Configure cache settings for Hashed File stages. Parallel – Set parallel job properties and defaults for date/time and number formats. Sequence – Set compilation defaults for job sequences. Remote – If you have specified that parallel jobs in the project are to be deployed on a USS system, this page allows you to specify deployment mode and USS machine details.
  • DataStage Designer – used to create DataStage applications (known as jobs). Each job specifies the data sources, the transformations required, and the destination of the data. Jobs are compiled to create executables that are scheduled by the Director and run on the server.
  • DataStage Director – used to validate, schedule, run, and monitor DataStage job sequences.
  • Constraint - Prevents data from getting into the processing piece of the ETL job (reject) Derivation - Logic at the field level (example: is it “open”? (“click through”))
  • Transcript of "Etl Overview (Extract, Transform, And Load)"

    1. 1. IBM Ascential ETL Overview: DataStage and Quality Stage
    2. 2. More than ever, businesses today need to understand their operations, customers, suppliers, partners, employees, and stockholders. They need to know what is happening with the business, analyze their operations, reach to market conditions, make the right decisions to drive revenue growth, increase profits and improve productivity and efficiency.
    3. 3. CIOs are responding to their organizations’ strategic needs by developing IT initiatives that align corporate data with business objectives. These initiatives include: Business intelligence Master data management Business transformation Infrastructure rationalization Risk and compliance
    4. 4. <ul><li>Connect to any data or content, wherever it resides </li></ul><ul><li>Understand and analyze information, including relationships and lineage </li></ul><ul><li>Cleanse information to ensure its quality and consistency </li></ul><ul><li>Transform information to provide enrichment and tailoring for its specific purposes </li></ul><ul><li>Federate information to make it accessible to people, processes and applications </li></ul>IBM WebSphere Information Integration platform enables businesses to perform five key integration functions :
    5. 5. Data Analysis : Define, annotate, and report on fields of business data. <ul><li>Data Quality : </li></ul><ul><li>Standardize source data fields </li></ul><ul><li>Match records across or within data sources, remove duplicate data </li></ul><ul><li>Survive records from the best information across sources </li></ul>Data Transformation & Movement : Move data and transform it to meet the requirements of its target systems <ul><li>Integrate data and content </li></ul><ul><li>Provide views as if from a single source while maintaining source integrity </li></ul><ul><li>Software: </li></ul><ul><li>Profile stage in QualityStage </li></ul>Software: QualityStage Software: DataStage Software: N/A (not used at NCEN) Software: QualityStage Software: DataStage This presentation will deal with ETL QualityStage and DataStage .
    6. 6. QualityStage QualityStage is used to cleanse and enrich data to meet business needs and data quality management standards. <ul><li>Data preparation (often referred to as data cleansing ) is critical to the success of an integration project. QualityStage provides a set of integrated modules for accomplishing data reengineering tasks, such as: </li></ul><ul><ul><li>Investigating </li></ul></ul><ul><ul><li>Standardizing </li></ul></ul><ul><ul><li>Designing and running matches </li></ul></ul><ul><ul><li>Determining what data records survive </li></ul></ul>= data cleansing
    7. 7. QualityStage Main QS stages used in the BRM project: <ul><li>Investigate – gives you complete visibility into the actual condition of data (not used in the BRM project because the users really know their data) </li></ul><ul><li>Standardize – allows you to reformat data from multiple systems to ensure that each data type has the correct and consistent content and format </li></ul><ul><li>Match – helps to ensure data integrity by linking records from one or more data sources that correspond to the same real-world entity. Matching can be used to identify duplicate entities resulting from data entry variations or account-oriented business practices </li></ul><ul><li>Survive – helps to ensure that the best available data survives and is correctly prepared for the target destination </li></ul>
    8. 8. QualityStage Investigate  Standardize  Match  Survive <ul><li>Word Investigation parses freeform fields into individual tokens, which are analyzed to create patterns. </li></ul><ul><li>In addition, Word Investigation provides frequency counts on the tokens. </li></ul>
    9. 9. QualityStage Investigate  Standardize  Match  Survive <ul><li>For example , to create the patterns in address data: </li></ul><ul><li>Word Investigation uses a set of rules for classifying personal names, business names and addresses . </li></ul><ul><li>Word Investigation provides prebuilt rule sets for investigating patterns on names and postal addresses for a number of different countries . </li></ul>For the United States, the address data would include: USPREP (parses name, address and area if data not previously formatted) USNAME (for individual and organization names) USADDR (for street and mailing addresses) USAREA (for city, state, ZIP code and so on)
    10. 10. QualityStage Investigate  Standardize  Match  Survive Field parsing breaks the address into individual tokens of “123”, “St.”, “Virginia” and “St.” Example : The test field “ 123 St. Virginia St. ” would be analyzed in the following way: <ul><li>Lexical analysis determines the business significance of each piece </li></ul><ul><ul><li>123 = Number </li></ul></ul><ul><ul><li>St. = Street type </li></ul></ul><ul><ul><li>Virginia = Alpha </li></ul></ul><ul><ul><li>St. = Street type </li></ul></ul><ul><li>Context analysis identifies the variations data structures and content as “123 St. Virginia St.” </li></ul><ul><ul><li>123 = House number </li></ul></ul><ul><ul><li>St. Virginia = Street address </li></ul></ul><ul><ul><li>St. = Street type </li></ul></ul>
    11. 11. QualityStage Investigate  Standardize  Match  Survive The Standardize stage allows you to reformat data from multiple systems to ensure that each data type has the correct and consistent content and format.
    12. 12. QualityStage Investigate  Standardize  Match  Survive Standardization is used to invoke specific standardization Rule Sets and standardize one or more fields using that Rule Set. Standardization is used to invoke specific standardization Rule Sets and standardize one or more fields using that Rule Set. For example, a Rule Set can be used so that “ Boulevard ” will always be “ Blvd ” Standardization is used to invoke specific standardization Rule Sets and standardize one or more fields using that Rule Set. For example, a Rule Set can be used so that “ Boulevard ” will always be “ Blvd ”, not “ Boulevard ”, “ Blv .”, “ Boulev ”, or some other variation. <ul><li>The USNAME rule set is used to standardize First Name, Middle Name, Last Name </li></ul><ul><li>The USADDR rule set is used to standardize Address data </li></ul><ul><li>The USAREA rules set is used to standardize City, State, Zip Code </li></ul><ul><li>The VTAXID rule set is used to validate Social Security Number </li></ul><ul><li>The VEMAIL rule set is used to validate Email Address </li></ul><ul><li>The VPHONE rule set is used to validate Work Phone Number </li></ul>The list below shows some of the more commonly-used Rule Sets .
    13. 13. QualityStage Investigate  Standardize  Match  Survive Data matching is used to find records in a single data source or independent data sources Data matching is used to find records in a single data source or independent data sources that refer to the same entity Data matching is used to find records in a single data source or independent data sources that refer to the same entity (such as a person, organization, location, product, or material) regardless of the availability of a predetermined key.
    14. 14. <ul><li>Blocking </li></ul><ul><li>Matching </li></ul>QualityStage Matching Stage basically consists of two steps: QualityStage Investigate  Standardize  Match  Survive
    15. 15. <ul><li>XA = master record (during the first pass, this was the first record found to match with another record) </li></ul><ul><li>DA = duplicates </li></ul><ul><li>CP = clerical procedure (records with a weighting within a set cutoff range) </li></ul><ul><li>RA = residuals (those records that remain isolated) </li></ul>Operations in the Matching module: 2. Processing Files 1. Unduplication <ul><li>Match Fields </li></ul><ul><li>Suspect Match Values by Match Pass </li></ul><ul><li>Vartypes </li></ul><ul><li>Cutoff Weights </li></ul>1. Unduplication (group records into sets having similar attributes) QualityStage Investigate  Standardize  Match  Survive
    16. 16. QualityStage Investigate  Standardize  Match  Survive Survivorship is used to create a ‘best record’ from all available information about an entity (such as a person, location, material, etc.). Survivorship and formatting ensure that the best available data survives and is correctly prepared for the target destination. Using the rules setup screen, it implements business and mapping rules, creating the necessary output structures for the target application and identifying fields that do not conform to load standards.
    17. 17. QualityStage Investigate  Standardize  Match  Survive <ul><li>Supplies missing values in one record with values from other records on the same entity </li></ul><ul><li>Populates missing values in one record with values from corresponding records that have been identified as a group in the matching stage </li></ul><ul><li>Enriches existing data with external data </li></ul>The Survive stage does the following:
    18. 18. DataStage = data transformation
    19. 19. DataStage In its simplest form, DataStage performs data transformation and movement from source systems to target systems in batch and in real time. The data sources may include indexed files, sequential files, relational databases, archives, external data sources, enterprise applications and message queues.
    20. 20. DataStage <ul><li>DataStage Administrator </li></ul><ul><li>DataStage Manager </li></ul><ul><li>DataStage Designer </li></ul><ul><li>DataStage Director </li></ul>The DataStage client components are:
    21. 21. <ul><li>Specify general server defaults </li></ul><ul><li>Add and delete projects </li></ul><ul><li>Set project properties </li></ul><ul><li>Access DataStage Repository by command interface </li></ul>DataStage Administrator  Manager  Designer  Director Use DataStage Administrator to:
    22. 22. DataStage Administrator  Manager  Designer  Director
    23. 23. DataStage Administrator  Manager  Designer  Director DataStage Manager is the primary interface to the DataStage repository. In addition to table and file layouts, it displays the routines, transforms, and jobs that are defines in the project. It also allows us to move or copy ETL jobs from one project to another.
    24. 24. <ul><li>Specify how the data is extracted </li></ul><ul><li>Specify data transformations </li></ul><ul><li>Decode (denormalize) data going into the data mart using referenced lookups </li></ul><ul><li>Aggregate data </li></ul><ul><li>Split data into multiple outputs on the basis of defined constraints </li></ul>DataStage Administrator  Manager  Designer  Director Use DataStage Designer to:
    25. 25. DataStage Administrator  Manager  Designer  Director Use DataStage Director to run, schedule, and monitor your DataStage jobs. You can also gather statistics as the job runs. Also used for looking at logs for debugging purposes.
    26. 26. <ul><li>Set up a project – Before you can create any DataStage jobs, you must set up your project by entering information about your data. </li></ul><ul><li>Create a job – When a DataStage project is installed, it is empty and you must create the jobs you need in DataStage Designer. </li></ul><ul><li>Define Table Definitions </li></ul><ul><li>Develop the job – Jobs are designed and developed using the Designer. Each data source, the data warehouse, and each processing step is represented by a stage in the job design. The stages are linked together to show the flow of data. </li></ul>DataStage: Getting Started
    27. 27. DataStage Designer Developing a job
    28. 28. DataStage Designer Developing a job
    29. 29. DataStage Designer Input Stage
    30. 30. DataStage Designer Transformer Stage The Transformer stage performs any data conversion required before the data is output to another stage in the job design. After you are done, compile and run the job.
    31. 31. DataStage Designer
    32. 32. DataStage Designer
    33. 33. DataStage Designer
    34. 34. DataStage Designer
    35. 35. <ul><li>T10 takes .txt files from the Pre-event folder and transforms them into rows. </li></ul><ul><li>Straight_moves moves the files into the stg_file_contact table, stg_file_broker table, or the reject file. </li></ul><ul><ul><li>If it says “lead source”, it will go to the reject file (constraint). </li></ul></ul><ul><ul><li>If it does not say “lead source”, it will evaluate the entire row to determine whether it will go to the contact or broker table (derivation). </li></ul></ul>DataStage An example : Preventing the header row from inserting into MDM_Contact and MDM_Broker
    36. 36. Questions?
    37. 37. Thank you for attending
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×