Datastage parallell jobs vs datastage server jobs
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Datastage parallell jobs vs datastage server jobs

  • 7,149 views
Uploaded on

Hi,

Hi,

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
7,149
On Slideshare
7,149
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
99
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Datastage Parallel jobs Vs Datastage Server jobs:1)The basic difference between the server and parallel jobs is the degree of p[parallelismserver jobs stages don not have in built partitioning and parallelism mechanism forextracting and loading data between different stages.We can do to enhance the speed and performance in server jobs is to enable inter processrow buffering through the administrator. This helps stages to exchange data soon as it isavailable in the link.We can use IPC stage too which helps one passive stage read data from another as soonas data is available. In other words, stages do not have to wait fir the entire set of recordsto be read first and then transferred to the next stage. Link partitioner and link collectorstages can be used to achieve a certain degree of partitioning parallelism.The above features which have to be explored in server jobs are built in Datastage PX.2)The PX Engine runs on a multiprocessor system and takes full advantage of theprocessing nodes defined in the configuration file. Both SMP and MMP architecture issupported by Datastage PX.3)PX takes advantage of both Pipeline parallelism and Partitioning parallelism. Pipelineparallelism means that as soon as data is available between stages (in pipes or links),itcan be exchanged between them without waiting for the entire record set to be read.Partitioning parallelism means that entire record set is partitioned into small sets andprocessed on different nodes(logical processors).For example, if there are 100 recordsthen if there are 4 logical nodes then each node would process 25 records each. Thisenhances the speed at which loading takes place to an amazing degree .Imagine situationswhere billions of records have to be loaded daily. This is where Datastage PX comes as aboon for ETL process and surpasses all other ETL tools in the market.4)In parallel we have Dataset which acts as the intermediate data storage in the linkedlist. It is the best storage option it stores the data in Datastage internal format.5)In parallel we can choose to display OSH,which gives information about how jobworks.6)In parallel transformer there is no reference link possibility ,in server stage referencecould be given to transformer .Parallel stage can use both basic and parallel orientedfunctions.7)Datastage server executed by Datastage server environment but parallel executed undercontrol of Datastage run time environment.8))Datastage compiled in to BASIC (interpreted pseudo code) and parallel compiled toOSH (Orchestrate Scripting Language).9)Debugging and Testing Stages are available only in the Parallel Extender.
  • 2. 10)More Processing stages are not included in Server example,Join,CDC,LookUp eetc….11) In file stages, Hash file available only in Server and Complex flat file,dataset,lookupfile set avail in parallel only.12)Server Transformer supports basic language compatibility ,parallel transformer is C++language computability.14)Look up of sequential file is possible in parallel jobs.15)In parallel we can specify more file paths to fetch data from using file pattern similarto Folder stage in Server, while in server e can specify one file name in one input link.16)We can simultaneously give input as well as output link to a sequential file stage inServer. But an output link in parallel means a reject link, that is a link that collectsrecords that fail to load into the sequential file for some reasons.17)The difference is file size Restriction.Sequential file size in server is 2GB.Sequential file size in parallel is: No Limitation18)Parallel sequential file has filter options too. Where we can specify the file pattern.Introduction to Datastage Enterprise Edition (EE)Datastage Enterprise Edition, formerly known as Datastage PX(parallel extender) hasbecome recently a part of IBM, Infosphere Information Server and its official name isIBM Infosphere Datastage.With the recent versions of Datastage(7.5,8,8.1),IBM does not release any updates toDatastage Server Edition (however it is still available in Datastage 8) and they seem toput the biggest effort in developing and enriching the Enterprise Edition of the Infosphereproduct line.Key Datastage Enterprise Edition concepts.Project Environment:1.We do work with flat files and oracle database as source.2.We get data in two ways by using
  • 3. Push technique Pull technique3.Most of the time we get the data using push technique (push technique is client himselfsend data to our server environment).4.If situation is like this where it is our responsibility to fetch the data from(client givesus proper authenticated privileged to access his server) client server then we got for pulltechnique.5.In our Unix environment (server) we do have particular file structure.6.Whatever the files we got from the client .Those files are placed in drop box.7.Then we move the received files to the input files folder.8.From there we dump the files to staging area, where we cleansing the data.9.After applying required business logic(transformations) we move the data toODS(operational data stage).From there on we apply scd’s on the data whatever we gotfrom ODS.10.Then the resulted data will be sent to Data Ware house.11.Now whatever the files we had in input files folder will be moved to archivefolder(back up and future purpose).12.While running some jobs if we want to send the resulted data to the output files folder.Then we specify its path of the output files folder(i.e Data generated file after execution).13.For dataset files we give path of the dataset folder where we want to store datasetrelated to our project.14.Reject file folder contains files from staging and ODS.These files generally as part ofcleansing and transformation.About project:1.Files and Datastage is our source.2.It’s a sales domain and the main intension of this project is to get the total salesinformation based on the location.3.Because of the U.S rescission Publix is facing bad-sales and bad-revenue in particularlocations. And at sometime they are doing very good in terms of revenue in certainplaces.
  • 4. 4.To identify the total revenue and bad-sales information Publix kicked-off this project.5.In our project we do have 18 dimensional tables and 11 fact tables.6.In that I involved in developing of 4 dimensions and 2 fact tables.7.In 20 DS-jobs for 4 dimensions and 9 jobs for 2 fact tables.8.Our Dataware house size is 1.5 TB.9.This Project is a top-down approach.10.We are loaded the data into data warehouse in our project no data marts.