Schema on read is obsolete. Welcome metaprogramming..pdf
Data stage
1.
2. DATA STAGE:
Data stage is an ETL tool where it uses a client/server design
where jobs are created and administered via windows client against
central repository on a server.
Purpose of using data stage:
1.design jobs for execution , Transformation and loading.
2.schedule,run and monitor jobs all within the data stage.
3.create batch controlling jobs.
3. Data stage client
tools
Data stage
administrato
r
Data stage
manager
Data stage server
Data stage
Designer
Data stage
Director
Data
base
Data
ware
house
Data
mart
Data stage
components
4. DATA STAGE CLIENT COMPONENTS:
• Data Stage Administrator : This components will be used for to perform create or
delete the projects. , cleaning metadata stored in repository.
• Data stage Manager: it will be used for to perform the following task like..
a) Create the table definitions.
b) Metadata back-up and recovery can be performed.
c) Create the customized components.
5. DATA STAGE CLIENT COMPONENTS:
• Data stage Director: It is used to validate,
schedule,
run and
monitor the Data stage jobs.
• Data stage Designer: It is used to create the Data stage application known as job. The following activities can be performed with
designer window.
a) Create the source definition.
b) Create the target definition.
c) Develop Transformation Rules
d) Design Jobs.
• Data Stage Repository: It is one of the server side components which is defined to store the information to build out Data Ware
House.
• Data Stage Server: This is defined to execute the job while we are creating Data stage jobs.
6. DATA STAGE JOB’S:
Job : Job is nothing but it is ordered series of individual stages which are linked
together to describe the flow of data from source and target.
Types of jobs:
• Server jobs
• Parallel Jobs
• Mainframe Jobs
7. DATA STAGE JOB’S:
Server jobs:
• In server jobs it handle less volume of data
with more performance.
• It is having less number of components.
• Data processing will be slow.
• It work’s on SMP (Symmetric Multi
Processing).
• It is highly impact usage of transformer stage.
Parallel jobs:
• It handles high volume of data.
• It work’s on parallel processing concepts.
• It applies parallelism techniques.
• It follows MPP (Massively parallel
Processing).
• It is having more number of components
compared to server jobs.
• It work’s on orchestrate framework
8. STAGE:
A stage defines a database, file and processing.
Built-in stage:- These stages defines the extraction, Transformation, and loading.
They are divided into two types
• Passive stages:- which defines read and write access are known as passive stages.
EX:- All Database stages in palette window by designer.
• Active stages:- which defines the data transformation and filtering the data known as active stage.
Ex:- All Processing Stages.
9. PARALLELISM TECHNIQUES:
It is a process to perform ETL task in parallel approach need to build the data
warehouse. The parallel jobs support the following hardware system like SMP, MPP to
achieve the parallelism.
There are two types of parallelism techniques:
• Pipeline parallelism.
• Partition parallelism.
10. TYPES OF PARALLELISM TECHNIQUES:
• Pipeline parallelism : the data flow continuously throughout it pipeline . All stages in the job are
Operating simultaneously.
EX: my source is having 4 records as soon as first record starts processing, then all remaining
records processing simultaneously.
• Partition parallelism : in this parallelism , the same job would effectively be run simultaneously by
several processors. Each processors handles separate subset of total records.
EX: my source is having 100 records and 4 partitions. The data will be equally partition across 4
partitions that mean the partitions will get 25 records. Whenever the first partition starts, the
remaining three partitions start processing simultaneously and parallel.
11. DATA PARTITIONING TECHNIQUES IN DATA
STAGE:
In this data partitioning method the data splits into various partitions distribute across the
processors.
The data partitioning techniques are:
• Auto
• Hash
• Modulus
• Random
• Range
• Round Robin
12. DATA PARTITIONING TECHNIQUES:
• Round Robin: the first record goes to first processing node, second record goes to the
second processing node and so on….. This method is useful for creating equal size of
partition.
• Hash: The records with the same values for the hash-key field given to the same
processing node.
• Modulus: This partition is based on key column module. This partition is similar to hash
partition.
• Random: The records are randomly distributed across all processing nodes.
• Range: The related records are distributed across the one node . The range is specified
based on key column.
• Auto: This is most common method. The data stage determines the best partition
method to use depending upon the type of stage.
13. FILE STAGES:
• Sequential File stage: - it is one of the file stages which it can be used to reading the
data from file or writing the data to file. It can support single Input link or single
output link and as well as reject link.
Properties Tab
14. DATASET:
• Dataset: It is also one of the file stages which it can be used to store the data on
internal format, it is related operating system. So, it will take less time to read or
write the data.
Properties Tab
15. FILE SET:
• File Set: It is also one of the file stage which it can be used to read or write the data
on file set. The file it can be saved with the extension of ‘.fs’. it operating parallel
Properties Tab
16. Data set
• It stores data in binary in the internal
format of Data Stage so, it takes less
time to read/write from dataset than any
other source/target.
• It preserves the partitioning schemes so
that you don't have to partition it again.
• You cannot view data without data stage
File set
• It stores data in the format similar to a
sequential file.
• Only advantage of using file set over a
sequential file is "it preserves
partitioning scheme".
• You can view the data but in the order
defined in partitioning scheme.
17. COMPLEX FLAT FILE:
• This file is used to read the data form Mainframe file. By using CFF we
can read ASCII or EBCDIC (Extended Binary coded Decimal Interchange
Code) data. We can select the required columns and can omit the
remaining. We can collect the rejects (bad formatted records) by setting the
property reject to save (other options: continue fail). We can flatten the
arrays (COBOL files).
Properties Tab
18. TYPES OF PROCESSING STAGES:
• Aggregator stage: It is one the processing stage which it can be used to perform the
summaries for the group of input data. It can support single input link which carries
the input data and it can support single out put link which carries aggregated data to
output link.
Properties Tab
19. • Copy stage: It is also one of the processing stages which it can be used just to copy the
input data to number of output links. It can support single input link and number of
output links.
Properties Tab
20. • Filter stage: it is also one of the processing stage which it can be used to perform the
filter the data based on given condition. It can support single input link and N no of
output links and optionally it support one reject link.
Properties Tab
21. • Switch stage: it is also one of the processing stage which it can be used to filter the
input data based on given conditions. It can support single input link and 128 output
links .
Properties Tab
22. • Join stage: It is also one of the processing stages which can be used to combine two or
more input datasets based on key field. It can support two or more input datasets and
one output dataset, and doesn‟t support reject link.
• Join can be performing inner join, left outer, right outer, full-outer joins.
23. 1.Inner join – is to display the matched records from both
the side tables.
2.Left-outer join- is to show the matched records from
both sides as well as unmatched records from left side
table.
3.Right-outer join – is to show the matched records from
both sides as well as unmatched records from right side
table.
4.Full-outer join- is to show the matched as well as
unmatched records from both sides.
25. • Merge stage: It is also one of the processing stages which it can be used to merge the
multiple input data. It can support multiple input links, the first input link is called master
input link and remaining links are called Updated links.
• It can be perform inner join and left-outer join only.
Properties Tab
26. • Look-up stage: This is also one of the processing stages which can be used to look-up
on relational tables. It can support multiple input links and single output link and
support single reject link.
• This stage will be performing inner join and left-outer join.
Properties Tab
27. It is having two options like:
• Condition not met.
• Lookup Failure.
• Default it is having fail. If the condition not met =“drop”, it will be perform inner join.
If the condition not met option = continue, it will be perform left-outer join. Default it
having lookup Failure=Fail. If it is having continue option will be support Reject link.
28. • Funnel stage: it is also one of Active processing stage which can be used to combined the
multiple input datasets into single output datasets.
Properties Tab
29. • Remove Duplicate Stage: it is also one of processing stage which it can be used to remove
the duplicates data based on key field. It can support single input link and single output
link.
Properties Tab
30. • Sort stage: It is also one of the processing stage which can be used to sort data based on
key field, either ascending order or descending order. It can be support single input link
and single output link.
Properties Tab
31. • Modify Stage: It is also one of the processing stages which it can be used to
when you are able to handle Null handling and Data type changes. It is
used to change the data types if the source contains the varchar and the
target contains integer then we have to use this Modify Stage and we
have to change according to the requirement. And we can do some
modification in length also.
Properties Tab
32. • Pivot stage: it is also one of the processing stage. Pivot stage only CONVERTS
COLUMNS INTO ROWS and nothing else. Some DS Professionals refer to this as
NORMALIZATION.
Properties Tab
33. • Surrogate key Generator: It is also one important stage on processing stage which it can
be used to generate the sequence numbers while implementing slowly changing
dimension. It is a system generated key on dimensional tables.
Properties Tab
34. • Transformer Stage: It is an active processing stage which allows filtering the data based
on given condition and can derive new data definitions by developing an expression. This
stage uses Microsoft.net framework environment for it’s compilation. The transformer
stage can be performing data cleaning and data scrubbing operation. It can have single
input link and number of output links and also reject link.
36. From the above slide we can see it is having stage variables, Derivations, and Constraints.
• Stage Variable - An intermediate processing variable that retains value during read and
doesn’t pass the value into target column.
• Derivation - Expression that specifies value to be passed on to the target column.
• Constraints - Conditions that are either true or false that specifies flow of data with a
link.
Stage Properties Tab
Stage variables
Derivations
constraints
Order of execution:
37. • Change Capture Stage: This is also one of the active processing stage which it can be
used to capture the changes between two sources like After and Before. The source which
is used as reference to capture the change is called after dataset. The source which we are
looking for the change is called before dataset. The change code will be added in output
dataset. So, by this change code will be recognizing delete, insert or update.
Properties Tab
38. • Row generator Stage: it produces set of data fitting specified meta data. It is useful
where you want to test your job but have no real data available to process. It is having
no input links and a single output link.
Properties Tab
39. • Column generator stage: This stage adds the columns to incoming data and generates
mock data for these columns for each data row processed. It can have single input link
and single output link.
• Ex: if input file has some records and now we need to generate unique id for the input
file by using column generator stage we can add unique id for the input file.
Properties Tab
40. • Head Stage: This stage helpful for testing .and debug the application with large
datasets. This stage selects TOP N rows from the input dataset and copies the selected
rows to an output datasets. It can have a single input link and single output link.
Properties Tab
41. • Tail Stage: This stage helpful for testing and debug the application with large datasets.
This stage selects BOTTOM N rows from the input dataset and copies the selected rows
to an output datasets. It can have a single input link and single output link.
Properties Tab
42. • Peek Stage: it can have a single input link and any number output link. It can be used
to print the record column values to the job log view.
Properties Tab
43. • The Slowly Changing Dimension (SCD): stage is a processing stage that works within
the context of a star schema database. The SCD stage has a single input link, a single
output link, a dimension reference link, and a dimension update link.
• The SCD stage reads source data on the input link, performs a dimension table lookup
on the reference link, and writes data on the output link. The output link can pass data
to another SCD stage, to a different type of processing stage, or to a fact table. The
dimension update link is a separate output link that carries changes to the dimension.
You can perform these steps in a single job or a series of jobs, depending on the number
of dimensions in your database and your performance requirements.
46. SCD type-2:
• Adds a new row to a dimension table.
• Each SCD stage processes a single dimension and performs lookups by using an equality
matching technique. If the dimension is a database table, the stage reads the database
to build a lookup table in memory. If a match is found, the SCD stage updates rows in
the dimension table to reflect the changed data. If a match is not found, the stage
creates a new row in the dimension table. All of the columns that are needed to create a
new dimension row must be present in the source data
48. • Ex : The source is having EMP table with 100 records. When I was run the job, the
records was initially loaded into target insert stage, how it means, First compare two
input datasets, in first time there is no change in the records.
• So, the change capture stage gives the change code=1. The transformer stage
transforms the records from source to target by generating sequence to the records
by using stored procedure stage.
• If any update is occurred at source, that update records will be stored to target side
(TGT_UPDATE). How it will be store means, first two compare two input datasets,
changes is occurred at source level then change capture stage gives the change
code=3 , by using this change code, the transformer stage transform the records to
join stage through the update link.
• Join stage joins the updated records and target records by removing duplicate
records using remove duplicate stage. The output of the join stage to connected to
transform stage which was transforming update records to target update stage.