58750024 datastage-student-guide
Upcoming SlideShare
Loading in...5
×
 

58750024 datastage-student-guide

on

  • 4,517 views

 

Statistics

Views

Total Views
4,517
Views on SlideShare
4,517
Embed Views
0

Actions

Likes
1
Downloads
249
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

58750024 datastage-student-guide 58750024 datastage-student-guide Document Transcript

  • DS314SVRSTUDENT GUIDE
  • Contents - 2Copyright © 2002 Ascential Software CorporationVersion 6.0: 09/01/02
  • CopyrightThis document and the software described herein are the property of Ascential SoftwareCorporation and its licensors and contain confidential trade secrets. All rights to thispublication are reserved. No part of this document may be reproduced, transmitted,transcribed, stored in a retrieval system or translated into any language, in any form or by anymeans, without prior permission from Ascential Software Corporation.Copyright © 2002 Ascential Software Corporation. All rights ReservedAscential Software Corporation reserves the right to make changes to this document and thesoftware described herein at any time and without notice. No warranty is expressed orimplied other than any contained in the terms and conditions of sale. Ascential Software Corporation 50 Washington Street Westboro, MA 01581-1021 USA Phone: (508) 366-3888 Fax: (508) 389-8749Ardent, Axielle, DataStage, Iterations, MetaBroker, MetaStage, and uniVerse are registeredtrademarks of Ascential Software Corporation. Pick is a registered trademark of PickSystems. Ascential Software is not a licensee of Pick Systems. Other trademarks andregistered trademarks are the property of the respective trademark holder.09-01-2002 Contents - 3
  • Contents - 4Copyright © 2002 Ascential Software CorporationVersion 6.0: 09/01/02
  • Table of ContentsModule 1: Introduction to DataStage ............................ 1-01Module 2: Installing DataStage ..................................... 2-01Module 3: Configuring Projects ..................................... 3-01Module 4: Designing and Running Jobs ........................ 4-01Module 5: Working with Metadata................................. 5-01Module 6: Working with Relational Data ....................... 6-01Module 7: Constraints and Derivations .......................... 7-01Module 8: Creating BASIC Expressions ........................ 8-01Module 9: Troubleshooting ............................................ 9-01Module 10: Defining Lookups ...................................... 10-01Module 11: Aggregating Data ...................................... 11-01Module 12: Job Control................................................ 12-01Module 13: Working with Plug-Ins ............................... 13-01Module 14: Scheduling and Reporting ........................ 14-01Module 15: Optimizing Job Performance .................... 15-01Module 16: Putting It All Together .............................. 16-01 Contents - 5
  • Contents - 6Copyright © 2002 Ascential Software CorporationVersion 6.0: 09/01/02
  • Module 1Introduction to DataStage
  • Module 1 – Introduction to DataStage DataStage 314SvrAscential software provides the enterprise with a full featured data integrationplatform that can take data from any source and load it into any target. Sourcescan range from customer relationship systems to legacy systems to datawarehouses -- in fact, any system that houses data. Target systems, likewise, canconsist of data in warehouses, real-time systems, Web services -- any applicationthat houses data.Depending on your needs, source data can undergo scrutiny and transformationthrough several stages: 1. Data profiling -- a discovery process where relevant information for target enterprise applications is gathered 2. Data quality -- a preparation process where data can be cleansed and corrected 3. Extract, Transform, Load -- a transformation process where data is enriched and loaded into the targetUnderlying these processes is an application framework that allows you to 1. Utilize parallel processing for maximum performance 2. Manage and share metadata amongst all the stagesOverlaying all of this is a command and control structure that allows you to tailoryour environment to your specific needs.1-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStage 1-3
  • Module 1 – Introduction to DataStage DataStage 314SvrA data warehouse is a central database that integrates data from many operationalsources within an organization. The data is transformed, summarized, andorganized to support business analysis and report generation.• Repository of data• Optimized for analysis• Supports business: − Projections − Comparisons − Assessments• Extracted from operational sources − Integrated − Summarized − Filtered − Cleansed − Denormalized − Historical1-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStageData marts are like data warehouses but smaller in scope. Frequently anorganization will have both an enterprise-wide data warehouse and data marts thatextract data from it for specialized purposes.• Like data warehouses but smaller in scope• Organize data from a single subject area or department• Solve a small set of business requirements• Are cheaper and faster to build than a data warehouse• Distribute data away from the data warehouse 1-5
  • Module 1 – Introduction to DataStage DataStage 314SvrDataStage is a comprehensive tool for the fast, easy creation and maintenance ofdata marts and data warehouses. It provides the tools you need to build, manage,and expand them. With DataStage, you can build solutions faster and give usersaccess to the data and reports they need.With DataStage you can:• Design the jobs that extract, integrate, aggregate, load, and transform the data for your data warehouse or data mart.• Create and reuse metadata and job components.• Run, monitor, and schedule these jobs.• Administer your development and execution environments.1-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStageDataStage is client/server software. The server stores all DataStage objects andmetadata in a repository, which consists of the UniVerse RDBMS. The clientsinterface with the server.The clients run on Windows 95 or later (Windows 98, NT, 2000). The server runson Windows NT 4.0 and Windows 2000. Most versions of UNIX are supported.See the installation release notes for details.The DataStage client components are: Component Description Administrator Administers DataStage projects and conducts housekeeping on the server Designer Creates DataStage jobs that are compiled into executable programs Director Used to run and monitor the DataStage jobs Manager Allows you to view and edit the contents of the repository 1-7
  • Module 1 – Introduction to DataStage DataStage 314SvrTrue or False? The DataStage Server and clients must be running on thesame machine.True: Incorrect. Typically, there are many client machines each accessing thesame DataStage Server running on a separate machine. The Server can berunning on Windows NT or UNIX. The clients can be running on a variety ofWindows platforms.False: Correct! Typically, there are many client machines each accessing thesame DataStage Server running on a separate machine. The Server can berunning on Windows NT or UNIX. The clients can be running on a variety ofWindows platforms.1-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStageUse the Administrator to specify general server defaults, add and delete projects,and to set project properties. The Administrator also provides a commandinterface to the UniVerse repository.• Use the Administrator Project Properties window to:• Set job monitoring limits and other Director defaults on the General tab.• Set user group privileges on the Permissions tab.• Enable or disable server-side tracing on the Tracing tab.• Specify a user name and password for scheduling jobs on the Schedule tab.• Specify hashed file stage read and write cache sizes on the Tunables tab.General server defaults can be set on the Administrator DataStageAdministration window (not shown):• Change license information.• Set server connection timeout.The DataStage Administrator is discussed in detail in a later module. 1-9
  • Module 1 – Introduction to DataStage DataStage 314SvrUse the Manager to store and manage reusable metadata for the jobs you define inthe Designer. This metadata includes table and file layouts and routines fortransforming extracted data.Manager is also the primary interface to the DataStage repository. In addition totable and file layouts, it displays the routines, transforms, and jobs that are definedin the project. Custom routines and transforms can also be created in Manager.1 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStageThe DataStage Designer allows you to use familiar graphical point-and-clicktechniques to develop processes for extracting, cleansing, transforming,integrating and loading data into warehouse tables.The Designer provides a “visual data flow” method to easily interconnect andconfigure reusable components.Use Designer to:• Specify how the data is extracted.• Specify data transformations.• Decode (denormalize) data going into the data mart using reference lookups. − For example, if the sales order records contain customer IDs, you can look up the name of the customer in the CustomerMaster table. − This avoids the need for a join when users query the data mart, thereby speeding up the access.• Aggregate data.• Split data into multiple outputs on the basis of defined constraints.You can easily move between the Director, Designer, and Manager by selectingcommands in the Tools menu. 1 - 11
  • Module 1 – Introduction to DataStage DataStage 314SvrUse the Director to validate, run, schedule, and monitor your DataStage jobs.You can also gather statistics as the job runs.1 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStage• Define your project’s properties: Administrator• Open (attach to) your project• Import metadata that defines the format of data stores your jobs will read from or write to: Manager• Design the job: Designer − Define data extractions (reads) − Define data flows − Define data integration − Define data transformations − Define data constraints − Define data loads (writes) − Define data aggregations• Compile and debug the job: Designer• Run and monitor the job: Director 1 - 13
  • Module 1 – Introduction to DataStage DataStage 314SvrAll your work is done in a DataStage project. Before you can do anything, otherthan some general administration, you must open (attach to) a project.Projects are created during and after the installation process. You can addprojects after installation on the Projects tab of Administrator.A project is associated with a directory. The project directory is used byDataStage to store your jobs and other DataStage objects and metadata.You must open (attach to) a project before you can do any work in it.Projects are self-contained. Although multiple projects can be open at the sametime, they are separate environments. You can, however, import and exportobjects between them.Multiple users can be working in the same project at the same time. However,DataStage will prevent multiple users from accessing the same job at the sametime.1 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStageDataStage Designer is used to build and compile your Extraction,Transformation, and Load (ETL) jobs.True: Correct! With Designer you can graphically build your job by placinggraphical components (called "stages") on a canvas. After you build it, your jobis compiled in Designer.False: Incorrect. With Designer you can graphically build your job by placinggraphical components (called "stages") on a canvas. After you build it, your jobis compiled in Designer.DataStage Manager is used to execute your jobs after you build them.True: Incorrect. DataStage Manager is your primary interface to the DataStagerepository. Use Manager to manage metadata and other DataStage objects.False: Correct! DataStage Manager is your primary interface to the DataStagerepository. Use Manager to manage metadata and other DataStage objects. 1 - 15
  • Module 1 – Introduction to DataStage DataStage 314SvrDataStage Director is used to execute your jobs after they have been built.True: Correct! Use Director to validate and run your jobs. You can alsomonitor the job while it is running.False: Incorrect. Use Director to validate and run your jobs. You can alsomonitor the job while it is running.DataStage Administrator is used to set global and project properties.True: Correct! You can set some global properties such as connection timeout,as well as project properties, such as permissions.False: Incorrect. You can set some global properties such as connection timeout,as well as project properties, such as permissions.1 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 1 – Introduction to DataStage 1 - 17
  • Module 2Installing DataStage
  • Module 2 – Installing DataStage DataStage 314Svr2-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Installing DataStageThe DataStage server should be installed before the DataStage clients areinstalled. The server can be installed on Windows NT (including Workstationand Server), Windows 2000, or UNIX. This module describes the Windows NTinstallation.The exact system requirements depend on your version of DataStage. See theinstallation CD for the latest system requirements.To install the server you will need the installation CD and a license for theDataStage server. The license contains the following information:• Serial number• Project count − The maximum number of projects you can have installed on the server. This includes new projects as well as previously created projects to be upgraded.• Expiration date• Authorization code − This information must be entered exactly as written in the license. 2-3
  • Module 2 – Installing DataStage DataStage 314SvrThe installation wizard guides you through the following steps:• Enter license information• Specify server directories• Select program folder• Create new projects and/or upgrade existing projects2-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Installing DataStage 2-5
  • Module 2 – Installing DataStage DataStage 314Svr2-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Installing DataStageThe DataStage services must be running on the server machine in order to run anyDataStage client applications. To start or stop the DataStage services in Windows2000, open the DataStage Control Panel window in the Windows 2000 ControlPanel. Then click Start All Services (or Stop All Services). These services mustbe stopped when installing or reinstalling DataStage.UNIX note: In UNIX, these services are started and stopped using the uv.rcscript with the stop or start command options. The exact name varies by platform.For SUN Solaris, it is /etc/rc2.d/S99uv.rc. 2-7
  • Module 2 – Installing DataStage DataStage 314SvrThe DataStage clients should be installed after the DataStage server is installed.The clients can be installed on Windows 95, Windows 98, Windows NT, orWindows 2000.There are two editions of DataStage.• The Developer’s edition contains all the client applications (in addition to the server).• The Operator’s edition contains just the client applications needed to run and monitor DataStage jobs (in addition to the server), namely, the Director and Administrator.To install the Developer’s edition you need a license for DataStage Developer.To install the Operator’s edition you need a license for DataStage Director. Thelicense contains the following information:• Serial number• User limit• Expiration date• Authorization code − This information must be entered exactly as written in the license.2-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Installing DataStage 2-9
  • Module 2 – Installing DataStage DataStage 314Svr2 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 3Configuring Projects
  • Module 3 – Configuring Projects DataStage 314Svr3-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 3 – Configuring ProjectsIn DataStage all development work is done within a project. Projects are createdduring installation and after installation using Administrator.Each project is associated with a directory. The directory stores the objects (jobs,metadata, custom routines, etc.) created in the project.Before you can work in a project you must attach to it (open it).You can set the default properties of a project using DataStage Administrator. 3-3
  • Module 3 – Configuring Projects DataStage 314Svr3-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 3 – Configuring ProjectsClick Properties on the DataStage Administration window to open the ProjectProperties window. There are five active tabs. (The Mainframe tab is onlyenabled if your license supports mainframe jobs.) The default is the General tab.If you select the Enable job administration in Director box, you can performsome administrative functions in Director without opening Administrator.When a job is run in Director, events are logged describing the progress of thejob. For example, events are logged when a job starts, when it stops, and when itaborts. The number of logged events can grow very large. The Auto-purge ofjob log box tab allows you to specify conditions for purging these events.You can limit the logged events either by number of days or number of job runs. 3-5
  • Module 3 – Configuring Projects DataStage 314SvrUse this page to set user group permissions for accessing and using DataStage.All DataStage users must belong to a recognized user role before they can log onto DataStage. This helps to prevent unauthorized access to DataStage projects.There are three roles of DataStage user:• DataStage Developer, who has full access to all areas of a DataStage project.• DataStage Operator, who can run and manage released DataStage jobs.• <None>, who does not have permission to log on to DataStage.UNIX note: In UNIX, the groups displayed are defined in /etc/group.3-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 3 – Configuring ProjectsThis tab is used to enable and disable server-side tracing.The default is for server-side tracing to be disabled. When you enable it,information about server activity is recorded for any clients that subsequentlyattach to the project. This information is written to trace files. Users with in-depthknowledge of the system software can use it to help identify the cause of a clientproblem. If tracing is enabled, users receive a warning message whenever theyinvoke a DataStage client.Warning: Tracing causes a lot of server system overhead. This should only beused to diagnose serious problems. 3-7
  • Module 3 – Configuring Projects DataStage 314SvrUse the Schedule tab to specify a user name and password for running scheduledjobs in the selected project. If no user is specified here, the job runs under thesame user name as the system scheduler.3-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 3 – Configuring ProjectsOn the Tunables tab, you can specify the sizes of the memory caches used whenreading rows in hashed files and when writing rows to hashed files. Hashed filesare mainly used for lookups and are discussed in a later module.Active-to-Active link performance settings will be covered in detail in a latermodule in this course. 3-9
  • Module 3 – Configuring Projects DataStage 314Svr3 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 4Designing and Running Jobs
  • Module 4 – Designing and Running Jobs DataStage 314Svr4-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsA job is an executable DataStage program. In DataStage, you can design and runjobs that perform many useful data warehouse tasks, including data extraction,data conversion, data aggregation, data loading, etc.DataStage jobs are:• Designed and built in Designer.• Scheduled, invoked, and monitored in Director.• Executed under the control of DataStage. 4-3
  • Module 4 – Designing and Running Jobs DataStage 314SvrIn this module, you will go through the whole process with a simple job, exceptfor the first bullet. In this module you will manually define the metadata.4-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsIn the center right is the Designer canvas. On it you place stages and links fromthe Tools Palette on the right. On the bottom left is the Repository window,which displays the branches in Manager. Items in Manager, such as jobs andtable definitions can be dragged to the canvas area. Click View>Repository todisplay the Repository window.Click View>Property Browser to display the Property Broswer window. Thiswindow displays the properties of objects selected on the canvas. 4-5
  • Module 4 – Designing and Running Jobs DataStage 314SvrThe toolbar at the top contains quick access to the main functions of Designer.4-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsThe tool palette contains icons that represent the components you can add to yourjob design.Most of the stages shown here are automatically installed when you installDataStage. You can also install additional stages called plug-ins for specialpurposes. For example, there is a plug-in called sort that can be used to sort data.Plug-ins are discussed in a later module. 4-7
  • Module 4 – Designing and Running Jobs DataStage 314SvrThere are two kinds of stages:Passive stages define read and write access to data sources and repositories.• Sequential• ODBC• HashedActive stages define how data is filtered and transformed.• Transformer• Aggregator• Sort plug-in4-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsTrue or False? The Sequential stage is an active stage.True: Incorrect. The Sequential stage is considered a passive stage because it isused to extract or load sequential data from a file. It is not used to transform ormodify data.False: Correct! The Sequential stage is considered a passive stage because it isused to extract or load sequential data from a file. It is not used to transform ormodify data. 4-9
  • Module 4 – Designing and Running Jobs DataStage 314Svr4 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsThe Sequential stage is used to extract data from a sequential file or to load datainto a sequential file.The main things you need to specify when editing the sequential file stage are thefollowing:• Path and name of file• File format• Column definitions• If the sequential stage is being used as a target, specify the write action: Overwrite the existing file or append to it. 4 - 11
  • Module 4 – Designing and Running Jobs DataStage 314Svr4 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsDefining a sequential target stage is similar to defining a sequential source stage.You are defining the format of the data flowing into the stage, that is, from theinput links. Define each input link listed in the Input name box.You are defining the file the job will write to. If the file doesn’t exist, it will becreated. Specify whether to overwrite or append the data in the Update actionset of buttons.General Tab Filter command. Here you can specify a filter program forprocessing the file you are extracting data from. This feature can be used, forexample, to unzip a compressed file before reading it. You can type in or browsefor the filter program, and specify any command line arguments it requires in thetext box. This text box is enabled only if you have selected the Stage uses filtercommands checkbox on the Stage page General tab. Note that, if you specify afilter command, data browsing is not available so the View Data button isdisabled.On the Format tab, you can specify a different format for the target file than youspecified for the source file.If the target file doesn’t exist, you will not (of course!) be able to view its datauntil after the job runs. If you click the View data button, DataStage will return a“Failed to open …” error. 4 - 13
  • Module 4 – Designing and Running Jobs DataStage 314SvrThe column definitions you defined in the source stage for a given (output) linkwill appear already defined in the target stage for the corresponding (input) link.Think of a link as like a pipe. What flows in one end flows out the other end.The format going in is the same as the format going out.4 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsThe Transformer stage is the primary active stage. Other active stages performmore specialized types of transformations.In the Transformer stage you can specify:• Column mappings• Derivations• ConstraintsA column mapping maps an input column to an output column. Values arepassed directly from the input column to the output column.Derivations calculate the values to go into output columns based on values in zeroor more input columns.Constraints specify the conditions under which incoming rows will be written tooutput links. 4 - 15
  • Module 4 – Designing and Running Jobs DataStage 314SvrNotice the following elements of the transformer:The top, left pane displays the columns of the input links. If there are multipleinput links, multiple sets of columns are displayed.The top, right pane displays the contents of the output links. We haven’t definedany fields here yet. If there are multiple output links, multiple sets of columns aredisplayed.For now, ignore the Stage Variables window in the top, right pane. This will bediscussed in a later module.The bottom area shows the column definitions (metadata) for the input and outputlinks.If there are multiple input and/or output links, there will be multiple tabs.4 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running Jobs 4 - 17
  • Module 4 – Designing and Running Jobs DataStage 314SvrAdd one or more Annotation stages to the canvas to document your job.An Annotation stage works like a text box with various formatting options. Youcan optionally show or hide the Annotation stages by pressing a button on thetoolbar.There are two Annotation stages. The Description Annotation stage is discussedin a later module.4 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsType the text in the box. Then specify the various options including:• Text font and color• Text box color• Vertical and horizontal text justification 4 - 19
  • Module 4 – Designing and Running Jobs DataStage 314SvrBefore you can run your job, you must compile it. This generates executable codethat can be run by the DataStage Server engine. To compile a job, clickFile>Compile or click the Compile button on the toolbar. The Compile Jobwindow displays the status of the compile.If an error occurs:Click Show Error to identify the stage where the error occurred.Click More to retrieve more information about the error.4 - 20Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running JobsAs you know, you run your jobs in Director. You can open Director from withinDesigner by clicking Tools>Run Director.In a similar way, you can move between Director, Manager, and Designer.There are two methods for running a job:• Run it immediately.• Schedule it to run at a later time or date.To run a job immediately:• Select the job in the Job Status view. The job must have been compiled.• Click Job>Run Now or click the Run Now button in the toolbar. The Job Run Options window is displayed. 4 - 21
  • Module 4 – Designing and Running Jobs DataStage 314SvrThis shows the Director Job Status view. To run a job, select it and then clickJob>Run Now.Other views available: • Job log – view messages from job run • Schedule – view dates and times job is scheduled to run4 - 22Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running Jobs •The Job Run Options window is displayed when you click Job>Run Now.This window allows you to stop the job after:• A certain number of rows.• A certain number of warning messages.You can validate your job before you run it. Validation performs some checksthat are necessary in order for your job to run successfully. These include:• Verifying that connections to data sources can be made.• Verifying that files can be opened.• Verifying that SQL statements used to select data can be prepared.Click Run to run the job after it is validated. The Status column displays thestatus of the job run. 4 - 23
  • Module 4 – Designing and Running Jobs DataStage 314SvrClick the Log button in the toolbar to view the job log. The job log recordsevents that occur during the execution of a job.These events include control events, such as the starting, finishing, and abortingof a job; informational messages; warning messages; error messages; andprogram-generated messages.4 - 24Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running Jobs 4 - 25
  • Module 4 – Designing and Running Jobs DataStage 314Svr4 - 26Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 4 – Designing and Running Jobs 4 - 27
  • Module 5Working with Meta Data
  • Module 5 – Working with Meta Data DataStage 314Svr5-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataDataStage Manager is a graphical tool for managing the contents of yourDataStage project repository, which contains metadata and other DataStagecomponents such as jobs and routines.Metadata is “data about data” that describes the formats of sources and targets.This includes general format information such as whether the record columns aredelimited and, if so, the delimiting character. It also includes the specific columndefinitions. 5-3
  • Module 5 – Working with Meta Data DataStage 314Svr5-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataThe left pane contains the project tree. There are eight main branches, but youcan create subfolders under each. Select a folder in the project tree to display itscontents. In this example, a folder named DS304 has been created that containssome of the jobs in the project.Data Elements branch: Lists the built-in and custom data elements. (Dataelements are extensions of data types, and are discussed in a later module.)Jobs branch: Lists the jobs in the current project.Routines branch: Lists the built-in and custom routines.Routines are blocks of DataStage BASIC code that can be called within a job.(Routines are discussed in a later module.)Shared Containers branch: Shared Containers encapsulate sets of DataStagecomponents into a single stage. (Shared Containers are discussed in a latermodule.)Stage Types branch: Lists the types of stages that are available within a job.Built-in stages include the sequential and transformer stages you used inDesigner.Table Definitions branch: Lists the table definitions available for loading into ajob. 5-5
  • Module 5 – Working with Meta Data DataStage 314SvrTransforms branch: Lists the built-in and custom transforms. Transforms arefunctions you can use within a job for data conversion. Transforms are discussedin a later module.5-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataDataStage Manager manages two different types of objects:• Metadata describing sources and targets: − Called table definitions in Manager. These are not to be confused with relational tables. DataStage table definitions are used to describe the format and column definitions of any type of source: sequential, relational, hashed file, etc. − Table definitions can be created in Manager or Designer and they can also be imported from the sources or targets they describe.• DataStage components − Every object in DataStage (jobs, routines, table definitions, etc.) is stored in the DataStage repository. Manager is the interface to this repository. − DataStage components, including whole projects, can be exported from and imported into Manager. 5-7
  • Module 5 – Working with Meta Data DataStage 314SvrAny set of DataStage objects, including whole projects, which are stored in theManager Repository, can be exported to a file. This export file can then beimported back into DataStage.Import and export can be used for many purposes, including:• Backing up jobs and projects.• Maintaining different versions of a job or project.• Moving DataStage objects from one project to another. Just export the objects, move to the other project, then re-import them into the new project.• Sharing jobs and projects between developers. The export files, when zipped, are small and can be easily emailed from one developer to another.5-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataClick Export>DataStage Components in Manager to begin the export process.Any object in Manager can be exported to a file. Use this procedure to backupyour work or to move DataStage objects from one project to another.Select the types of components to export. You can select either the whole projector select a portion of the objects in the project.Specify the name and path of the file to export to. By default, objects areexported to a text file in a special format. By default, the extension is dsx.Alternatively, you can export the objects to an XML document.The directory you export to is on the DataStage client, not the server. 5-9
  • Module 5 – Working with Meta Data DataStage 314SvrTrue or False? You can export DataStage objects such as jobs, but you cantexport metadata, such as field definitions of a sequential file.True: Incorrect. Metadata describing files and relational tables are stored as"Table Definitions". Table definitions can be exported and imported as anyDataStage objects can.False: Correct! Metadata describing files and relational tables are stored as"Table Definitions". Table definitions can be exported and imported as anyDataStage objects can.5 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataTrue or False? The directory you export to is on the DataStage clientmachine, not on the DataStage server machine.True: Correct! The directory you select for export must be addressible by yourclient machine.False: Incorrect. The directory you select for export must be addressible by yourclient machine. 5 - 11
  • Module 5 – Working with Meta Data DataStage 314Svr5 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataTo import DataStage components, click Import>DataStage Components.Select the file to import. Click Import all to begin the import process or Importselected to view a list of the objects in the import file. You can import selectedobjects from the list. Select the Overwrite without query button to overwriteobjects with the same name without warning. 5 - 13
  • Module 5 – Working with Meta Data DataStage 314Svr5 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta Data 5 - 15
  • Module 5 – Working with Meta Data DataStage 314SvrTable definitions define the formats of a variety of data files and tables. Thesedefinitions can then be used and reused in your jobs to specify the formats of datastores.For example, you can import the format and column definitions of theCustomers.txt file. You can then load this into the sequential source stage of ajob that extracts data from the Customers.txt file.You can load this same metadata into other stages that access data with the sameformat. In this sense the metadata is reusable. It can be used with any file or datastore with the same format.If the column definitions are similar to what you need you can modify thedefinitions and save the table definition under a new name.You can also use the same table definition for different types of data stores withthe same format. For example, you can import a table definition from a sequentialfile and use it to specify the format for an ODBC table. In this sense the metadatais “loosely coupled” with the data whose format it defines.You can import and define several different kinds of table definitions including:Sequential files, ODBC data sources, UniVerse tables, hashed files.5 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta DataTo start the import, click Import>Table Definitions>Sequential FileDefinitions. The Import Meta Data (Sequential) window is displayed.Select the directory containing the sequential files. The Files box is thenpopulated with the files you can import.Select the file to import.Select or specify a category (folder) to import into.• The format is: <Category><Sub-category>• <Category> is the first-level sub-folder under Table Definitions.• <Sub-category> is (or becomes) a sub-folder under the type. 5 - 17
  • Module 5 – Working with Meta Data DataStage 314SvrIn Manager, select the category (folder) that contains the table definition.Double-click the table definition to open the Table Definition window.Click the Columns tab to view and modify any column definitions. Select theFormat tab to edit the file format specification.5 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta Data 5 - 19
  • Module 5 – Working with Meta Data DataStage 314Svr5 - 20Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 5 – Working with Meta Data 5 - 21
  • Module 5 – Working with Meta Data DataStage 314Svr5 - 22Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 6Working with Relational Data
  • Module 6 – Working with Relational Data DataStage 314Svr6-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataYou can perform the same tasks with relational data that you can with sequentialdata. You can extract, filter, and transform data from relational tables.You can also load data into relational tables.Although you can work with many relational databases through native drivers(including UniVerse, UniData, and Oracle), you can access many more relationaldatabases using ODBC.In the ODBC stage, you can either specify your query to one or more tables in thedatabase interactively or you can type the query or you can paste in an existingquery. 6-3
  • Module 6 – Working with Relational Data DataStage 314SvrBefore you can access data through ODBC you must define an ODBC datasource. In Windows NT, this can be done using the (32 bit) ODBC Data SourceAdministrator in the Control Panel.The ODBC Data Source Administrator has several tabs. For use with DataStage,you should define your data sources on the System DSN tab (not User DSN).You can install drivers for most of the common relational database systems fromthe DataStage installation CD.Click Add to define a new data source. When you click Add a list of availabledrivers is displayed. Select the appropriate driver and then click Finish.Different relational databases have different requirements. As an example, wewill define a Microsoft Access data source.• Type the name of the data source in the Data Source Name box.• Click Select to define a connection to an existing database. Type the name and location of the database.• Click Create to define a connection to a new database.6-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational Data 6-5
  • Module 6 – Working with Relational Data DataStage 314SvrImporting table definitions from ODBC databases is similar to importingsequential file definitions. Click Import>Table Definitions>ODBC TableDefinitions in Manager to start the process.The DSN list displays the data sources that are defined for the DataStage Server.Select the data source you want to import from and, if necessary, provide a username and password.The Import Metadata window is displayed. It lists all tables in the database thatare available for import. Select one or more tables and a category to import to,and then click OK.6-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataExtracting data from a relational table is similar to extracting data from asequential file except that you use an ODBC stage instead of a sequential stage.In this example, we’ll extract data from a relational table and load it into asequential file. 6-7
  • Module 6 – Working with Relational Data DataStage 314SvrSpecify the ODBC data source name in the Data source name box on theGeneral tab of the ODBC stage.You can click the Get SQLInfo button to retrieve the quote character and schemadelimiters from the ODBC database.6-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataSpecify the table name on the General tab of the Outputs tab.Select Generated query to define the SQL SELECT statement interactively usingthe Columns and Selection tabs. Select User-defined SQL query to write yourown SQL SELECT statement to send to database. 6-9
  • Module 6 – Working with Relational Data DataStage 314SvrLoad the table definitions from Manager on the Columns tab. The procedure isthe same as for sequential files.When you click Load, the Select Columns window is displayed. Select thecolumns data is to be extracted from.6 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataOptionally, specify a WHERE clause and other additional SQL clauses on theSelection tab.Other clauses can be anything else you wish to add to the Select clause, such asORDER BY. 6 - 11
  • Module 6 – Working with Relational Data DataStage 314SvrThe View SQL tab enables you to view the SELECT statement that will be usedto select the data from the table.The SQL displayed in “read-only.” Click View Data to test the SQL statementagainst the database.6 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataIf you want to define your own SQL query, click User-defined SQL query onthe General tab and then write or paste the query into the SQL for primaryinputs box on the SQL Query tab. 6 - 13
  • Module 6 – Working with Relational Data DataStage 314Svr6 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataEditing an ODBC target stage is similar to editing an ODBC source stage. Itincludes the following tasks:• Specify the data source containing the target table.• Specify the name of the table.• Select the update action. You can choose from a variety of INSERT and/or UPDATE actions.• Optionally, create the table.• Load the column definitions from the Manager table definition. 6 - 15
  • Module 6 – Working with Relational Data DataStage 314SvrSome of the options are different in the ODBC stage when it is used as a target.Select the type of action to perform from the Update action list.You can optionally have DataStage create the target table or you can load to anexisting table.On the View SQL tab you can view the SQL statement used to insert the data intothe target table.6 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataOn the Edit DDL tab you can generate and modify the CREATE TABLEstatement used to create the target table.If you make any changes to column definitions, you need to regenerate theCREATE TABLE statement by clicking the Create DDL button. 6 - 17
  • Module 6 – Working with Relational Data DataStage 314SvrTransaction Handling: Allows you to specify a transaction isolation level forread data. The isolation level specifies how potential conflicts betweentransactions (i.e., dirty read, nonrepeatable reads, and phantom reads) are handled.By default, all the rows are written to the target table before a COMMIT. In theRows per transaction box, you can specify a specific number of rows to writebefore the COMMIT.6 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational Data 6 - 19
  • Module 6 – Working with Relational Data DataStage 314SvrTrue or False? Using a single ODBC stage, you can only extract data from asingle table.True: Incorrect. You can join data from multiple tables within a single datasource.False: Correct! You can join data from multiple tables within a single datasource.6 - 20Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational Data 6 - 21
  • Module 6 – Working with Relational Data DataStage 314Svr6 - 22Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataThe ORAOCI8 plug-in lets you rapidly and efficiently prepare and load streamsof tabular data from any DataStage stage (for example, the ODBC stage, theSequential File stage, and so forth) to and from tables of the target Oracledatabase. The Oracle client on Windows NT or UNIX uses SQL*Net to access anOracle server on Windows NT or UNIX. 6 - 23
  • Module 6 – Working with Relational Data DataStage 314SvrThe plug-in appears as any other stage on the designer work area. It can extract orwrite data contained in Oracle tables.Features: • Each ORAOCI8 plug-in stage is a passive stage that can have any number of input, output, and reference output links. • Input links specify the data you are writing, which is a stream of rows to be loaded into an Oracle database. You can specify the data on an input link using an SQL statement constructed by DataStage or a user-defined SQL statement. • Output links specify the data you are extracting, which is a stream of rowsto be read from an Oracle database. You can also specify the data on an output link using an SQL statement constructed by DataStage or a userdefined SQL statement. • Each reference output link represents a row that is key read from an Oracle database (that is, it reads the record using the key field in the WHERE clause of the SQL SELECT statement).6 - 24Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational DataGeneral TabThis tab is displayed by default. It contains the following fields:Table name. This required field is editable when the update action is not User-defined SQL (otherwise, it is read-only). It is the name of the target Oracle tablethe data is written to, and the table must exist or be created by choosing generateDDL from the Create table action list. You must have insert, update, or deleteprivileges, depending on input mode. You must specify Table name if you do notspecify User-defined SQL. There is no default. Click … (Browse button) tobrowse the Repository to select the table.Update action. Specifies which SQL statements are used to update the targettable. Some update actions require key columns to update or delete rows. There isno default. Choose the option you want from the list.Clear table then insert rows. Deletes the contents of the table and adds the newrows, with slower performance because of transaction logging.Truncate table then insert rows. Truncates the table with no transaction loggingand faster performance. 6 - 25
  • Module 6 – Working with Relational Data DataStage 314SvrInsert rows without clearing. Inserts the new rows in the table.Delete existing rows only. Deletes existing rows in the target table that haveidentical keys in the source files.Replace existing rows completely. Deletes the existing rows, then adds the newrows to the table.Update existing rows only. Updates the existing data rows. Any rows in the datathat do not exist in the table are ignored.Update existing rows or insert new rows. Updates the existing data rows beforeadding new rows. It is faster to update first when you have a large number ofrecords.Insert new rows or update existing rows. Inserts the new rows before updatingexisting rows. It is faster to insert first if you have only a few records.User-defined SQL. Writes the data using a user-defined SQL statement,which overrides the default SQL statement generated by the stage. If youchoose this option, you enter the SQL statement on the SQL tab.User-defined SQL file. Reads the contents of the specified file to writethe data.Transaction Isolation. Provides the necessary concurrency control betweentransactions in the job and other transactions. Use one of the following transactionisolation levels: • Read committed. Takes exclusive locks on modified data and sharable locks on all other data. Each query executed by a transaction sees only data that was committed before the query (not the transaction) began. Oracle queries never read dirty (uncommitted) data. This is the default. • Serializable. Takes exclusive locks on modified data and sharable locks on all other data. Serializable transactions see only the changes that were committed at the time the transaction began. Note: If Enable transaction grouping is selected on the Transaction Handling tab, only the Transaction Isolation value for the first link is used for the entire group. • Array size. Specifies the number of rows to be transferred in one call between DataStage and Oracle before they are written. Enter a positive integer to indicate how often Oracle performs writes at a time to the database. The default value is 1, that is, each row is written in a separate statement. Larger numbers use more memory on the client to cache the6 - 26Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 6 – Working with Relational Data rows. This minimizes server round trips and maximizes performance by executingfewer statements. If this number is too large, the client may run out of memory. • Transaction size. This field exists for backward compatibility, but it is ignored for version 3.0 and later of the plug-in. The transaction size for new jobs is now handled by Rows per transaction on the Transaction Handling tab. • Create table action. Creates the target table in the specified database if Generate DDL is selected. It uses the column definitions in the Columns tab and the table name and the TABLESPACE and STORAGE properties for the target table. The generated Create Table statement includes the TABLESPACE and STORAGE keywords, which indicate the location where the table is created and the storage expression for the Oracle storage_clause. You must have CREATE TABLE privileges on your schema. You can also specify your own CREATE TABLE SQL statement. You must enter the storage clause in Oracle format. (Use the User-defined DDL tab on the SQL tab for a complex statement.) 6 - 27
  • Module 7Constraints and Derivations
  • Module 7 – Constraints and Derivations DataStage 314Svr7-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsA constraint specifies the condition under which data flows through a link. Forexample, suppose you want to split the data in the jobs file into separate filesbased on the job level.We need to define a constraint on each link so that only jobs within a certain levelrange are written to each file. 7-3
  • Module 7 – Constraints and Derivations DataStage 314SvrClick the Constraints button in the toolbar at the top of the Transformer Stagewindow to open the Transformer Stage Contraints window.The Transformer Stage Contraints window lists all the links out of thetransformer. Double-click on the cell next to a link to create the constraint.• Rows that are not written out to previous rows are written to a rejects link.• A row of data is sent down all the links it satisfies.• If there is no constraint on a (non-rejects) link, all rows will be sent down the link.7-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsThis shows the Constraints window. Constraints are defined for each of the topthree links. The Reject Row box is selected for the last link. All rows that fail tosatisfy the top three links will be sent down this link. 7-5
  • Module 7 – Constraints and Derivations DataStage 314SvrTrue or False? A constraint specifies a condition under which incoming rowsof data will be written to an output linkTrue: Correct! You can separately define a constraint for each output link. If noconstraint is written for a particular output link, then all rows will be written tothat link.False: Incorrect. You can separately define a constraint for each output link. Ifno constraint is written for a particular output link, then all rows will be written tothat link.7-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsTrue or False? A Rejects link can be placed anywhere in the link ordering.True: Incorrect. A Rejects link should be placed last in the link ordering, if it isto get every row that doesnt satisfy any of the other constraints.False: Correct! A Rejects link should be placed last in the link ordering, if it isto get every row that doesnt satisfy any of the other constraints. 7-7
  • Module 7 – Constraints and Derivations DataStage 314Svr7-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsA derivation is an expression that specifies the value to be moved into a targetcolumn (field).Every target column must have a derivation. The simplest derivation is an inputcolumn. The value in the input column is moved to the target column.To construct a derivation for a target column double-click on the derivation cellnext to the target column.Derivations are constructed in the same way that constraints are constructed:• Type constants.• Type or enter operators from Operator shortcut menu.• Type or enter operands from Operand shortcut menu.What’s the difference between derivations and constraints?• Constraints apply to links; derivations apply to columns.• Constraints are conditions, either true or false; derivations specify a value to go into a target column. 7-9
  • Module 7 – Constraints and Derivations DataStage 314SvrIn this example the concatenation of several fields is moved into the FullNametarget field.The colon (:) is the concatenation operator. You can insert this from the Operatormenu or type it in.7 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsTrue or False? If the constraint for a particular link is not satisified, then thederivations defined for that link are not executed.True: Correct! Constraints have precedence over derivations. Derivations in anoutput link are only executed if the constraint is satisfied.False: Incorrect. Constraints have precedence over derivations. Derivations inan output link are only executed if the constraint is satisfied. 7 - 11
  • Module 7 – Constraints and Derivations DataStage 314Svr7 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsYou can create stage variables for use in your column derivations and constraints.Stage variables store values without writing them out to a target file or table.They can be used in expressions just like constants, input columns, and otheroperands.Stage variables retain their values across reads. This allows them to be used ascounters and accumulators. You can also use them to compare a current inputvalue to a previous input value.To create a new stage variable, click the right mouse button over the StageVariables window and then click Append New Stage Variable (or Insert NewStage Variable).After you create it, you specify a derivation for it in the same way as for columns. 7 - 13
  • Module 7 – Constraints and Derivations DataStage 314SvrThis lists the execution order:• Derivations in stage variables are executed before constraints. This allows them to be used in constraints.• Next constraints are executed.• Then column derivations are executed.• Derivations in higher columns are executed before lower columns.7 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsNote the output link reordering icon available on the toolbar from within theTransformer stage. 7 - 15
  • Module 7 – Constraints and Derivations DataStage 314SvrTo get to the link ordering screen, open the transformer stage, then click on theoutput link execution order icon. The above screen will appear. Select a linkand use the arrow buttons to reposition a link in the execution order.7 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and DerivationsDerivations for stage variables are executed before derivations for anyoutput link columns.True: Correct! So you can be sure that the derivations for any of the stagevariables referenced in column derivations will have already been executed.False: Incorrect. The derivations for stage variables are executed first. So youcan be sure that the derivations for any of the stage variables referenced in columnderivations will have already been executed. 7 - 17
  • Module 7 – Constraints and Derivations DataStage 314Svr7 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 7 – Constraints and Derivations 7 - 19
  • Module 7 – Constraints and Derivations DataStage 314Svr7 - 20Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 8Creating Basic Expressions
  • Module 8 – Creating Basic Expressions DataStage 3048-2Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsDataStage BASIC is a form of BASIC that has been customized to work withDataStage.In the previous module you learned how to define constraints and derivations.Derivations and constraints are written using DataStage BASIC.Job control routines, which are discussed in a later module, are also written inDataStage BASIC.This module will not attempt to teach you BASIC programming. Our focus is onwhat you need to know in order to construct complex DataStage constraints andderivations. 8-3
  • Module 8 – Creating Basic Expressions DataStage 304For more information about BASIC operators than is provided here, search for“BASIC Operators” in Help. You can insert these operators from the Operatorsmenu (except for the IF operator, which is on the Operands menu).• Arithmetic operators: -, +, *, /• Relational operators: =, <, >, <=, >=• Logical operators: AND, OR, NOT• IF operator: − IF min_lvl < 0 THEN “Out of Range” ELSE “In Range”• Concatenation operator (:) − “The employee’s name is ” : lname : “, ” : fname• Substring operator ([start, length]). First character is 1 (not 0). − “APPL3245”[1, 4] → “APPL” − “APPL3245”[5, 2] → “32”8-4Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsFor more information about BASIC functions than is provided here, look upAlphabetical List BASIC Functions and Statements in Help. BASIC functionsinclude the standard Pick BASIC functions. Click Function from the Operandsmenu to insert a function.Here are a few of the more common functions:• TRIM(string), TRIM(string, character), TRIMF, TRIMB − TRIM(“ xyz ” ) → “xyz”• LEN(string)• UPCASE(string), DOWNCASE(string)• ICONV, OCONV − ICONV is used to convert values to an internal format − OCONV is used to convert values from an internal format − Very powerful functions. Often used for date and time conversions and manipulations. − These functions are discussed later in the module. 8-5
  • Module 8 – Creating Basic Expressions DataStage 304For more information about BASIC system variables than is provided here, lookup System Variables in Help. Click System Variable from the Operands menuto insert a system variable.• @DATE, @TIME Date/time job started − @YEAR, @MONTH, @DAY Extracted from @DATE• @INROWNUM row counter - incoming link• @OUTROWNUM row counter - outgoing link• @LOGNAME User logon name• @NULL NULL value• @TRUE, @FALSE• @WHO Name of current project8-6Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsTrue or False? TRIM is a system variable.True: Incorrect. TRIM is a DataStage function that removes surrounding spacesin a character string.False: Correct! TRIM is a DataStage function that removes surrounding spacesin a character string. 8-7
  • Module 8 – Creating Basic Expressions DataStage 304True or False? @INROWNUM is a DataStage function.True: Incorrect. System variables all begin with the @-sign. @INROWNUM isa system variable that contains the number of the last row read from the inputlink.False: Correct! System variables all begin with the @-sign. @INROWNUM isa system variable that contains the number of the last row read from the inputlink.8-8Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsDataStage is supplied with a number of functions you can use to obtaininformation about your jobs and projects. You can insert these functions intoderivations.DS functions and macros are discussed in a later module. 8-9
  • Module 8 – Creating Basic Expressions DataStage 304DS (DataStage) routines are defined in DataStage Manager. There are severaltypes of DS routines. The type you can insert into your derivations andconstraints are of the Transform Function type. A DS Transform FunctionRoutine consists of a predefined block of BASIC statements that takes one ormore arguments and returns a single value.DS routines are defined in DataStage Manager. You can define your ownroutines, but there are also a number of pre-built routines that are supplied withDataStage.The pre-built routines include a number of routines for manipulating dates, suchas ConvertMonth, QuarterTag, and Timestamp.8 - 10Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic Expressions 8 - 11
  • Module 8 – Creating Basic Expressions DataStage 304Data elements are extended data types. For example, a phone number is a kind ofstring. You could define a data element called PHONE.NUMBER to preciselydefine this type.Data elements are defined in DataStage Manager. A number of built-in types aresupplied with DataStage. For example MONTH.TAG represents a string of theform “YYYY-MM”.8 - 12Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic Expressions 8 - 13
  • Module 8 – Creating Basic Expressions DataStage 304DS Transforms are similar to DS Transform Function routines. They take one ormore arguments and return a single value. There are two primary differences:• The argument(s) and return value have specific data elements associated with them. In this sense, they transform data from one data element type to another data element type.• Unlike DS routines, they do not consist of blocks of BASIC statements. Rather, they consist of a single (though possibly very complex) BASIC expression.You can define your own DS Transforms, but there are also a number of pre-builttransforms that are supplied with DataStage.The pre-built transforms include a number of routines for manipulating stringsand dates.8 - 14Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic Expressions 8 - 15
  • Module 8 – Creating Basic Expressions DataStage 3048 - 16Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic Expressions 8 - 17
  • Module 8 – Creating Basic Expressions DataStage 304Date manipulation in DataStage can be done in several ways:• Using the Iconv and Oconv functions using the “D” conversion code.• Using the built-in date Transforms.• Using the built-in date routines.• Using routines in the DataStage Software Development Kit (SDK)Using routines in the DataStage Software Development Kit (SDK) is covered inanother DataStage course. Your instructor can provide further details. The SDKroutines are installed in the Manager Routinessdk folder.8 - 18Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsFor detailed help on Iconv and Oconv, see their entries in the Alphabetical Listof BASIC Functions and Statements in Help.Use Iconv to convert a string date in a variety of formats to the internal DataStageinteger format. Use Oconv to convert an internal date to a string date in a varietyof formats. Use these two functions together to covert a string date from oneformat to another.The internal format for a date is based on a reference date of December 31, 1967,which is day 0. Dates before are negative integers; dates after are positiveintegers.Use the “D” conversion code to specify the format of the date to be converted toan internal date by Iconv or the format of the date to be output by Oconv. 8 - 19
  • Module 8 – Creating Basic Expressions DataStage 304For detailed help (more than you probably want), see D Code under Iconv orOconv in Help.“D4-MDY[2,2,4]”• D Date conversion code• 4 Number of digits in year• - Separator• MDY Ordering is month, day, year• [2,2,4] Number of digits for M,D,Y, respectivelyNote:• The number in brackets for “Y” (namely 4) overrides the number following “D”.• Iconv ignores some of the characters. − Any separator will do. − Number of characters is ignored if there are separators.8 - 20Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic Expressions• Iconv(“12-31-67”, “D2-MDY[2,2,2]”) → 0• Iconv(“12311967”, “D MDY[2,2,4]”) → 0• Iconv(“31-12-1967”, “D-DMY[2,2,4]”) → 0• Oconv(0, “D2-MDY[2,2,4]”) → “12-31-1967”• Oconv(0, “D2/DMY[2,2,2]”) → “31/12/67”• Oconv(10, “D/YDM[4,2,A10]”) → “1968/10/JANUARY” − This example illustrates the use of an additional formatting option. The “A10” options says to alphabetically express the name, length 10 characters.• Oconv( Iconv(“12-31-67”, “D2-MDY[2,2,2]”), “D/YDM[4,2,A10]”) → “1967/31/DECEMBER” − This example shows how to convert from one string representation to another. 8 - 21
  • Module 8 – Creating Basic Expressions DataStage 3048 - 22Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsDataStage provides a number of built-in transforms you can use for dateconversions.The following data elements are used with the built-in transforms: Data element String format Example DATE.TAG YYYY-MM-DD 1999-02-24 WEEK.TAG YYYYWnn 1999W06 MONTH.TAG YYYY-MM 1999-02 QUARTER.TAG YYYYQn 1999Q4 YEAR.TAG YYYY 1999 8 - 23
  • Module 8 – Creating Basic Expressions DataStage 304True or False? You can use Oconv to convert a string date from one formatto another.True: Incorrect. Oconv by itself cant do this. You would first use Iconv toconvert the input string into a day integer. Then you can use Oconv to convert theday integer into the output string.False: Correct! Oconv by itself cant do this. You would first use Iconv toconvert the input string into a day integer. Then you can use Oconv to convert theday integer into the output string.8 - 24Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsThe transforms can be grouped into the following categories:• String to day number − Formatted string → internal date integer• Day number to date string − Internal date integer → formatted string• Date string to date string − DATE.TAG string → formatted string 8 - 25
  • Module 8 – Creating Basic Expressions DataStage 304The following transforms convert strings of the specified format (MONTH.TAG,QUARTER.TAG, …) to an internal date representing the first or last day of theperiod. Function Tag Description MONTH.FIRST MONTH.TAG Returns a numeric internal date corresponding to the first/last day MONTH.LAST of a month QUARTER.FIRST QUARTER.TAG Returns a numeric internal date corresponding to the first/last day QUARTER.LAST of a quarter WEEK.FIRST WEEK.TAG Returns a numeric internal date corresponding to the first day WEEK.LAST (Monday) / last day (Sunday) of a week YEAR.FIRST YEAR.TAG Returns a numeric internal date corresponding to the first/last day YEAR.LAST of a year8 - 26Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsExamples:MONTH.FIRST(“1993-02”) → 9164MONTH.LAST(“1993-02”) → 9191 8 - 27
  • Module 8 – Creating Basic Expressions DataStage 304The following functions convert internal dates to strings in various formats(DATE.TAG, MONTH.TAG, …). Function Argument type Description DATE.TAG Internal date Converts internal date to string in DATE.TAG format MONTH.TAG Internal date Converts internal date to string in MONTH.TAG format QUARTER.TAG Internal date Converts internal date to string in QUARTER.TAG format WEEK.TAG Internal date Converts internal date to string in WEEK.TAG formatExamples:MONTH.TAG(9177) → “1993-02”DATE.TAG(9177) → “1993-02-14”8 - 28Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic ExpressionsThe following functions convert strings in DATE.TAG format to strings invarious other formats (DAY.TAG, MONTH.TAG, …). Function Tag Description TAG.TO.MONTH DATE.TAG Convert DATE.TAG to MONTH.TAG TAG.TO.QUARTER DATE.TAG Convert DATE.TAG to QUARTER.TAG TAG.TO.WEEK DATE.TAG Convert DATE.TAG to WEEK.TAG TAG.TO.DAY DATE.TAG Convert DATE.TAG to DAY.TAGExamples:TAG.TO.MONTH(“1993-02-14”) → “1993-02”TAG.TO.QUARTER(“1993-02-14”) → “1993Q1” 8 - 29
  • Module 8 – Creating Basic Expressions DataStage 3048 - 30Copyright © 2002 Ascential Software Corporation03/01/02
  • DataStage Essentials Module 8 – Creating Basic Expressions 8 - 31
  • Module 8 – Creating Basic Expressions DataStage 3048 - 32Copyright © 2002 Ascential Software Corporation03/01/02
  • Module 9Troubleshooting
  • Module 9 – Troubleshooting DataStage 314Svr9-2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingEvents are logged to the job log file when a job is validated, run, or reset. Youcan use the log file to troubleshoot jobs that fail during validation or a run.Various entries are written to the log, including when:• The job starts• The job finishes• An active stage starts• An active stage finishes• Rows are rejected (yellow icons)• Errors occur (red icons)• DataStage informational reports are logged• User-invoked messages are displayed 09 - 3
  • Module 9 – Troubleshooting DataStage 314SvrThe event window shows the events that are logged for a job during its run.The job log contains the following information: Column Name Description Occurred Time the event occurred On date Date the event occurred Type Info Informational. No action required. Warning An error occurred. Investigate the cause of the warning, as this may indicate a serious error. Fatal A fatal error occurred. Control The job starts and finishes. Reject Rejected rows are output. Reset A job or the log is reset. Event A message describing the event. The system displays the first line of the message. If a message has an ellipsis (…) at the end, it contains more than one line. You can view the full message in the Event Detail window.9-4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingClearing the logTo clear the log, click Job>Clear Log. 09 - 5
  • Module 9 – Troubleshooting DataStage 314SvrDouble-click on an event to open the Event Detail window. This window givesyou more information.When an active stage finishes, DataStage logs an informational message thatdescribes how many rows were read in to the stage and how many were written.This provides you with valuable information that can indicate possible errors.9-6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingThe Monitor can be used to display information about a job while it is running.To start the Monitor, click Tools>New Monitor. Once in Monitor, click the rightmouse button and then select Show links to display information about each of theinput and output links. 09 - 7
  • Module 9 – Troubleshooting DataStage 314SvrWhen you are testing a job, you can save time by limiting the number of rows andwarnings.9-8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingServer side tracing is enabled in Administrator. It is designed to be used to helpcustomer support analysts troubleshoot serious problems. When enabled, it logs arecord to a trace file whenever DataStage clients interact with the server.Caution: Because of the overhead caused by server side tracing it should only beused when working with customer support. 09 - 9
  • Module 9 – Troubleshooting DataStage 314Svr9 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingDataStage provides a debugger for testing and debugging your job designs. Thedebugger runs within Designer. With the DataStage debugger you can:• Set breakpoints on job links, including conditional breakpoints.• Step through your job link-by-link or row-by-row.• Watch the values going into link columns. 09 - 11
  • Module 9 – Troubleshooting DataStage 314SvrTo begin debugging a program, click View>Debug Bar to display the debugtoolbar. The toolbar provides access to all of the debugging functions. Stop Toggle breakpoint Next link Debug job parameters View job log Clear breakpoints Go Debug window Next row Edit breakpoints Button Description Go Start/continue debugging. Next Link The job continues until the next action occurs on the link. Next Row The job continues until the next row is processed or ntil another link ith a breakpoint is9 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – Troubleshooting or until another link with a breakpoint is encountered.Stop Job Stops the job at the point it is at. Click Go to continue.Job Parameters Set limits on rows and warnings.Edit Breakpoints Displays the Edit Breakpoints window, in which you can edit existing breakpoints.Toggle Breakpoint Set or clear a breakpoint on a selected link.Clear All Breakpoints Removes breakpoints from all links.View job log Open Director and view the job log.Debug Window Show/hide the Debug Window, which displays link column values. 09 - 13
  • Module 9 – Troubleshooting DataStage 314SvrTo set a breakpoint on a link, select the link and then click the ToggleBreakpoint button. A black circle appears on the link.9 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingClick the Edit Breakpoints button to open the Edit Breakpoints window.Existing breakpoints are listed in the lower pane.To set a condition for a breakpoint, select the breakpoint and then specify thecondition in the above pane. You can either specify the number of rows beforebreaking or specify an expression to break upon when it’s true. 09 - 15
  • Module 9 – Troubleshooting DataStage 314SvrClick the Debug Window button to open the Debug Window.• The top pane lists all the columns defined for all links.• The Local Data column lists the data currently in the column.• The Current Break box at the top of the window lists the link where execution stopped.• To add a column to the lower pane (where it is isolated), select the column and then click Add Watch.• If a breakpoint is set, execution stops at that link when a row is written to the link.9 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 9 – TroubleshootingYou can step through row-by-row or step-by-step.• Next Row extracts a row of data and stops at the next link with a breakpoint that the row is written to. − For example, if a breakpoint is set on the MexicoCustomersOut link, execution stops at the MexicoCustomersOut link when a Mexican customer is read. − If a breakpoint is not set on the MexicoCustomersOut link, execution will not stop at the MexicoCustomersOut link when a Mexican customer is read. − Execution will stop at the CustomersIn link (even if there is no breakpoint set on it) because all rows are read through that link.• Next Link stops at the next link that data is written to. 09 - 17
  • Module 9 – Troubleshooting DataStage 314Svr9 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 10Defining Lookups
  • Module 10 – Defining Lookups DataStage 314Svr10 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining LookupsA hashed file is a file that distributes records in one or more evenly-sized groupsbased on a primary key. The primary key value is processed by a "hashingalgorithm" to determine the location of the record.The number of groups in the file is referred to as its modulus.In this example, there are 5 groups (modulus 5).Hashed files are used for reference lookups in DataStage because of their fastperformance. The hashing algorithm determines the group the record is in. Thegroups contain a small number of records, so the record can be quickly locatedwithin the group.If write caching is enabled, DataStage does not write hashed file records directlyto disk. Instead it caches the records in memory, and writes the cached records todisk when the cache is full. This improved performance. You can specify thesize of the cache on the Tunables tab in Administrator. 10 - 3
  • Module 10 – Defining Lookups DataStage 314SvrTo create and load a hashed file, create a job that has the hashed file stage as atarget.For example, here’s a simple job that will create and load the StoresHashedhashed file, which will contain a list of stores and their addresses keyed bystor_id.Loading a hashed file with data is similar to loading a sequential file with data.10 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining LookupsProperties for the hashed file stage are used to provide the physical location forthe hashed file. If you use the “Use account name” checkbox and leave theaccount name box blank, the hashed file will be created in the same project inwhich the job creating it is executed. This provides flexibility for jobs that aremoved from development to production environment. Alternatively, you can optto specify the exact directory in which the hashed file will be created; however, ifyou place the file outside the area controlled by the DataStage engine (the project)you will not be able to backup the file using the DataStage Manager exportproject function. 10 - 5
  • Module 10 – Defining Lookups DataStage 314SvrYou should use the Key checkboxes to identify the keycolumns. If you don’t, thefirst column definition is taken as the hashed file’s key field. The remainingcolumns dictate the order in which data will be written to the hashed file. Do notreorder the column definitions in the grid unless you are certain you understandthe consequences of your action.10 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining LookupsOnce you have created a hashed file (using Director) you may want to import thathashed file’s meta data. Like all meta data imports, this is performed in DataStageManager (import>table definitions>Universe File Definitions). Note that hashedfiles are known as Universe File Definitions. 10 - 7
  • Module 10 – Defining Lookups DataStage 314Svr10 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining LookupsYou can extract data from a hashed file:• As a stream.• As a lookup.The process of extracting data from a hashed file as a stream is similar toextracting data from a sequential file.A hashed file stage used as a source has an additional tab called the Selection tab.Use it to specify a condition for filtering the data from the hashed file. 10 - 9
  • Module 10 – Defining Lookups DataStage 314Svr10 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining LookupsYour job can delete a hashed file and then recreate it. To delete the file and thenrecreate it, select the Delete file before create box in the Create file optionswindow on the hashed file target stage.To delete a hashed file without recreating it in a job, you can execute theDELETE.FILE command.To execute this command, log onto Administrator, select the project (account)containing the hashed file, and then click Command to open the CommandInterface window. In the Command box, type DELETE.FILE followed by thename of the hashed file. Then click Execute.The DELETE.FILE command can also be executed in a Before/After Routine.(Before/After Routines are discussed in a later module.) 10 - 11
  • Module 10 – Defining Lookups DataStage 314SvrExtracting data from a hashed file as a lookup involves several steps.The dashed line indicates reference input as opposed to stream input.Keep the following in mind:• Reference lookups from hashed files must go into a Transformer stage.• Multiple lookups can be done at the same Transformer. To specify an ordering, open the transformer and then click the Input Link Execution Order icon at the top. The procedure is the same as defining target link execution order.• Lookups cannot be done from sequential files.• Lookups can also be done from relational (ODBC) tables.10 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining LookupsThe lookup expression or join is defined in the transformer stage.Click the right mouse button over the hashed file key column and select Edit KeyExpression.• This defines the value to look up in the hashed file. 10 - 13
  • Module 10 – Defining Lookups DataStage 314Svr• Any valid expression can be specified (not just a column mapping).• You can drag input columns to the key column like when defining derivations for target columns.Output from the lookup file is mapped to fields in the target link.If the lookup fails (the result of the expression is not found in the hashed file),NULLs are returned in all the lookup link columns.You can test for NULLs in a derivation to determine whether the lookupsucceeded.10 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining Lookups 10 - 15
  • Module 10 – Defining Lookups DataStage 314Svr10 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 10 – Defining Lookups 10 - 17
  • Module 11Aggregating Data
  • Module 11 – Aggregating Data DataStage 314Svr11 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 11 – Aggregating DataThe data sources you’re extracting data from can contain many thousands of rowsof data. You can summarize groups of data in each column using the functionslisted below. Function Description Minimum Returns the lowest value in the column. Maximum Returns the highest value in the column. Count Counts the number of values in the column. Sum Sums the values in the column. Average Returns the average of the values in the column. Standard deviation Returns the standard deviation for the values in the column.The first three functions (minimum, maximum, count) apply to non-numeric aswell as numeric data. The last three only make sense when applied to numericdata. 11 - 3
  • Module 11 – Aggregating Data DataStage 314SvrHere’s an example of some data you might want to summarize.The Sales file contains a list of sales. You may want the following aggregations,for example:• Total sales amount by store (by product, by year, by month, etc.)• Average sales amount by store (by product, etc.)• Total (average) quantity sold by product11 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 11 – Aggregating DataIn this example, we will determine the average sales amount for each productsold.• The Sales stage is used to read the data.• The transformer performs some initial calculations on the data. For instance, the sales amount for each order (qty * price) is calculated. − Calculations can’t be defined in the aggregator stage.• The aggregator stage can have at most one input link and it can’t be a reference link.• In this example, the columns coming in include the product ID (product_id) and the sales amount (sales_amount) for each order.• The output link will define the aggregations. 11 - 5
  • Module 11 – Aggregating Data DataStage 314SvrThis lists the main tasks in defining aggregations.11 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 11 – Aggregating DataTrue or False? Suppose you want to aggregate over derived values. Forexample, you want to SUM(qty * unit_price). You can perform thisderivation within the AggregatorTrue: Incorrect. You cannot perform derivations within the Aggregator stage. Ifyou want to aggregate derived values, perform the the derivation in an outputcolumn in a prior Transformer stage. Then Aggegate over that incoming columnin the Aggregator stage.False: Correct! You cannot perform derivations within the Aggregator stage. Ifyou want to aggregate derived values, perform the the derivation in an outputcolumn in a prior Transformer stage. Then Aggegate over that incoming columnin the Aggregator stage. 11 - 7
  • Module 11 – Aggregating Data DataStage 314SvrThe Inputs Columns tab specifies the incoming columns.Aggregations are performed in memory. If the data is presorted before it isaggregated, this can greatly improve the way in which the Aggregator stagehandles the data. Use the Sort and Sort Order columns to specify whether andhow the data is sorted.The Sort column is used to specify which columns are sorted and their order. Forexample, if the incoming data is sorted by stor_id and product_id, in that order,then stor_id would be column 1 and product_id would be column 2.In the Sort Order column specify the sort order, that is, whether the data is sortedin ascending or descending order or some other more complex ordering.The aggregator stage does not itself sort the data. Sorting must be performed inan earlier stage, for example, using an ODBC stage or sort plug-in.11 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 11 – Aggregating DataDefine the aggregation for each output column.• Select the column(s) to group by. You will not be able to specify an aggregate function for the group by column(s). 11 - 9
  • Module 11 – Aggregating Data DataStage 314SvrDouble-click on the Derivation cell to open the Derivation window. Thiswindow is special in the aggregator stage. It allows you to select a column and anaggregate function for the column.11 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 11 – Aggregating Data 11 - 11
  • Module 11 – Aggregating Data DataStage 314Svr11 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 11 – Aggregating Data 11 - 13
  • Module 12Job Control
  • Module 12 – Job Control DataStage 314Svr12 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlJob parameters allow you to design flexible, reusable jobs. If you want to processdata based on a particular file, file location, time period, or product, you caninclude these settings as part of your job design. However, if you do this, whenyou want to use the job again for a different file, file location, time period, orproduct, you must edit the design and recompile the job.Instead of entering inherently variable factors as part of the job design, you canset up parameters which represent processing variables. When you run orschedule a job with parameters, DataStage prompts for the required informationbefore continuing.Job parameters can be used in many places in DataStage Designer, including:• Passive stage file and table names.• Passive stage directory paths.• Account names for hashed files.• Transformer stage constraints.• Transformer stage derivations. 12 - 3
  • Module 12 – Job Control DataStage 314SvrRecall this job. Customers from different countries are written out to separatefiles. The problem here is that the countries are “hard-coded” into the job design.What if we want a file containing, for example, Canadian customers? We can adda new output stage from the transformer and define a new constraint. Thenrecompile and run the job.A more flexible method is to use a parameter in the constraint in place of aspecific country string such as “USA”. Then during runtime, the user can specifythe particular country.12 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlTo define job parameters for a job, open the job in Designer and then clickEdit>Job Properties. Click the Parameters tab on the Job Properties window. 12 - 5
  • Module 12 – Job Control DataStage 314SvrOnce a job parameter has been defined it can be used in various components of ajob design to add flexibility. Candidate uses for a parameter are enumerated onthis slide.To reference a job parameter two methods are used. If the value will be used in DataStage specific functions (such as value used with a constraint or derivation), simply supply the name of the parameter. If the value will be used in system functions (such as the location of file) the name of the parameter should be enclosed in # marks.12 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlSince the file name is a value that is passed to the operating system for handling,the parameter is enclosed within # marks. 12 - 7
  • Module 12 – Job Control DataStage 314SvrSince the value of the parameter will be used within a DataStage function (aconstraint), the parameter name is used without enclosing # marks. In thisexample the developer simply right-clicked where the parameter should beplaced, chose “Job parameter”, and will select from the dropdown list ofparameters available in this job.12 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlTrue or False? When job parameters are used in passive stages such asSequential File stages, they must be surrounded with pound (#) signs.True: Correct! You must surround the name of the job parameter with poundsigns. Otherwise, DataStage wont recognize it as a job parameter.False: Incorrect. You must surround the name of the job parameter with poundsigns. Otherwise, DataStage wont recognize it as a job parameter. 12 - 9
  • Module 12 – Job Control DataStage 314Svr12 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlBefore and after routines are DS routines that run before or after a job and beforeor after a transformer. DS Before/After routines are defined in Manager. Threebuilt-in Before/After routines are supplied with DataStage: ExecDOS, ExecSH,ExecTCL. These routines can be used to execute Windows DOS, UNIX, andUniVerse commands, respectively. The command, together with any output, isadded to the job log as an informational message.You can also define custom Before/After routines. They are similar to otherroutines except that they have only two arguments: an input argument, an errorcode argument. 12 - 11
  • Module 12 – Job Control DataStage 314SvrClick Edit>Job Properties on the Designer window or the Stage Propertiesbutton in a transformer or other active stage. In either case, a window isdisplayed in which you can select a Before/After routine and specify an inputparameter. Input parameters can contain job parameters.In this example, the target file is copied to a temporary directory after the job runsusing the standard Windows DOS copy command.12 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlDataStage is supplied with a number of functions you can use to control jobs andobtain information about jobs. For detailed information about these functions, seeJob Control in Help.These functions can be executed in the Job control tab of the Job Propertieswindow, within DS routines, and within column derivations. 12 - 13
  • Module 12 – Job Control DataStage 314SvrHere are some of the job control functions. BASIC Function Description DSAttachJob Specify the job you want to control DSSetParam Set parameters for the job you want to control DSSetJobLimit Set limits for the job you want to control DSRunJob Request that a job is run DSWaitForJob Wait for a called job to finish DSGetProjectInfo Get information about the current project DSGetJobInfo Get information about the controlled job or current job DSGetStageInfo Get information about a stage in the controlled job or current job12 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlDSGetLinkInfo Get information about a link in a controlled job or current jobDSGetParamInfo Get information about a controlled job’s parametersDSGetLogEntry Get the log event from the job logDSGetLogSummary Get a number of log events on the specified subject from the job logDSGetNewestLogId Get the newest log event, of a specified type, from the job logDSLogEvent Log an event to the job log of a different jobDSLogInfo Log an informatory message to the job logDSStopJob Stop a controlled jobDSDetachJob Return a job handle previously obtained from DSAttachJobDSSetUserStatus Set a status message to return as a termination message when it finishes 12 - 15
  • Module 12 – Job Control DataStage 314SvrThe job control routines and other BASIC statements written in the Job controltab are executed after the job in which they are defined runs. This enables you torun a job that controls other jobs. In fact this can be all the job does.For example, suppose you want a job that first loads a hashed file and then usesthat hashed file in a lookup. You can define this as a single job. Alternatively,you can define this as two separate jobs (as we did earlier) and then define a“master” controlling job that first runs the load and then runs the lookup.12 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlCreate an empty job and then click Edit>Job Properties. Click the Job controltab. Select the jobs you want to run one at a time in the Add Job box and thenclick Add. The job control functions and other BASIC statements are added tothe edit box. Add and modify the statements as necessary.In this example:DSRunJob is used to run the load job.DSWaitForJob waits for the job to finish. You don’t want the lookup to beperformed until the hashed file is fully loaded.DSGetJobInfo gets information about the status of the job. If an error occurs thejob is aborted before the lookup job is run. 12 - 17
  • Module 12 – Job Control DataStage 314Svr12 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlThe Job Sequencer enables you to graphically create controlling jobs, withoutusing the job control functions. Job control code is automatically generated fromyour graphical design.Job Sequences resemble standard DataStage jobs. They consist of stages andlinks, like DataStage jobs. However, it is a different set of stages and links.Among the stages are Job Activity stages, which are used to run DataStage jobs.Links are used to specify the sequence of execution. Triggers can be defined onthese links to specify the condition under which control passes through the link.There are other Activity stages, including:• Routine Activity stages for execution a DataStage Routine.• Execute Command stages for executing Windows, UNIX, or DataStage commands.• Notification stages for sending email notifications. 12 - 19
  • Module 12 – Job Control DataStage 314SvrHere is an example of a Job Sequence. The stages are Job Activity stages. Thestage validates a job that loads a lookup hashed file. The second stage runs thejob, if the validation succeeded. The third stage runs a job that does a lookupfrom the hashed file.The links execute these three stages in sequence. Triggers are defined on each ofthe links, so that control is passed to the next stage only if the previous stageexecuted without errors.To create a new Job Sequence click the New button and then select JobSequence.12 - 20Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlThis shows a Job Activity stage. Select the job to run in the Job name box.Select how you want to run it in the Execution action box.The Parameters box lists all the parameters defined for the job. Select aparameter and then click Insert Parameter Value to specify a value to be passedto the parameter. 12 - 21
  • Module 12 – Job Control DataStage 314SvrTriggers specify the condition under which control passes through a link. Selectthe type of trigger in the Expression Type. The types include:• Unconditional: Pass control unconditionally.• Otherwise: Pass control if none of the triggers on the links are executed.• OK: Pass control if the job ran without errors or warnings.• Failed: Pass control if the job failed.• Warning: Pass control if the job ran with warnings.• UserStatus: Pass control if the User Status variable contains the specified value. The User Status variable can be set in a job or Routine using the DSSetUserStatus job control function.• Custom: Specify your own condition in DataStage Basic.12 - 22Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlTrue or False? Triggers can be defined on the Job Activity Triggers tab foreach Input link.True: Incorrect. Triggers are defined on Output links. They determine whetherexecution will continue down the link.False: Correct! Triggers are defined on Output links. They determine whetherexecution will continue down the link. 12 - 23
  • Module 12 – Job Control DataStage 314Svr12 - 24Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job Control 12 - 25
  • Module 12 – Job Control DataStage 314SvrDataStage Containers encapsulate a set of job design components (stages andlinks) into a single stage icon.There are two kinds of Containers: Local and Shared. Local Containers onlyexist within the single job they are used. Use Shared Containers to simplifycomplex job designs.Shared Containers exist outside of any specific job. They are listed in the SharedContainers branch in Manager. These Shared Containers can be added to anyjob. Shared containers are frequently used to share a commonly used set of jobcomponents.A Job Container contains two unique stages. The Container Input stage is used topass data into the Container. The Container Output stage is used to pass data outof the Container.12 - 26Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlThis shows the components that make up a example Container. The same jobcomponents are used with the exception of the Container Input stage, shown onthe left, and the Container Output stage, shown on the right. 12 - 27
  • Module 12 – Job Control DataStage 314Svr12 - 28Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job ControlThis shows a job with a Job Container (the stage in the middle). Data is passedinto the Container from the link on the left. Data is retrieved from the Containerin the link on the right. The Container processes the data using the set of stagesand links it is designed with. 12 - 29
  • Module 12 – Job Control DataStage 314Svr12 - 30Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 12 – Job Control 12 - 31
  • Module 13Working with Plug-Ins
  • Module 13 – Working with Plug-Ins DataStage 314Svr13 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Working with Plug-InsA plug-in is a custom-built stage (active or passive) that you can install and use inDataStage in addition to the built-in stages. Plug-ins provide additionalfunctionality without the need for new versions of DataStage to be released.Plug-ins can be written in either C or C++. Sample code is loaded in the /sampledirectory when DataStage is installed.A number of plug-ins are provided by Ascential. These include:• Plug-in stages pre-installed with DataStage − Found in Stage Types/PlugIn branch in Manager. − Includes the Oracle bulk loader.• Plug-in stages on the installation CD. These include: − Additional bulk loaders. − An ftp plug-in for accessing data using the ftp protocol. − A sort plug-in for sorting data. − A merge plug-in for integrating data. − Plug-ins for accessing RDBMs, such as Oracle, through native drivers.• Chargeable plug-in stages available from Ascential. These include: 13 - 3
  • Module 13 – Working with Plug-Ins DataStage 314Svr − External Data Access (EDA) for access to mainframe systems. − Change Data Capture (CDC) for obtaining only changed records.• Plug-in stages written to the DataStage C API may also be available from third-party vendors.• Once a plug-in is installed you can use it in your jobs just as you can the built- in stages.13 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Working with Plug-Ins 13 - 5
  • Module 13 – Working with Plug-Ins DataStage 314SvrYou can view installed plug-ins for a project in Manager.13 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Working with Plug-InsDocumentation for the plug-ins that come with DataStage is provided in PDFformat on the DataStage installation CD.In addition, open the plug-in in Manager. The Stage Type window provides avariety of information in the four tabs:• A description of the plug-in.• Plug-in creator information.• Plug-in dependencies.• Plug-in properties.Most of what you need to do when you use a plug-in in a job is to set itsproperties correctly. Plug-ins provide online documentation for each propertywhen you open the Properties tab in Designer.The sort plug-in can have one input link and one output link. The input linkspecifies the records of data to be sorted. The output link outputs the data insorted order. 13 - 7
  • Module 13 – Working with Plug-Ins DataStage 314Svr13 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Working with Plug-InsThis lists the main tasks involved in defining a sort using the DataStage Sort plug-in. 13 - 9
  • Module 13 – Working with Plug-Ins DataStage 314SvrThe sort stage has three tabs:Inputs tab: Specify the format of the data to be sorted.Outputs tab: Specify the format of the data after it’s sorted.Stage tab: On the Properties sub-tab, you set the properties that define thesort.13 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Working with Plug-InsTrue or False? Job parameters can be used in the Sort plug-in stage.True: Correct! Like when using job parameters in sequential stages, surroundjob parameters with # signs.False: Incorrect. Like when using job parameters in sequential stages, surroundjob parameters with # signs. 13 - 11
  • Module 13 – Working with Plug-Ins DataStage 314Svr13 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 2 – Working with Plug-Ins 13 - 13
  • Module 13 – Working with Plug-Ins DataStage 314Svr13 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 14Scheduling and Reporting
  • Module 14 – Scheduling and Reporting DataStage 314Svr14 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 14 – Scheduling and ReportingJobs are scheduled in Director. A job can be scheduled to run in a number ofdifferent ways:• Once today at a specified time.• Once tomorrow at a specified time.• On a specific day and at a particular time.• Daily at a particular time.• On the next occurrence of a particular date and time.Each job can be scheduled to run on any number of occasions and can be run withdifferent job parameter values on the different occasions.Jobs run on the DataStage server under the user name specified on the Scheduletab in Administrator. If no user name is specified, it runs under the same name asthe Windows NT Schedule service.If DataStage is running on Windows NT, DataStage uses the Windows NTSchedule service to schedule jobs. If you intend to use the DataStage scheduler,be sure to start or verify that the Windows NT Scheduler service is running.To start the NT Scheduler, open the Windows NT Control Panel and then openthe Services icon. You can then manually start the service or set the service to 14 - 3
  • Module 14 – Scheduling and Reporting DataStage 314Svrstart automatically each time the computer is started.14 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 14 – Scheduling and ReportingTrue or False? When a scheduled job runs, it runs under the user ID of theperson who scheduled it.True: Incorrect. When a user manually runs a job in Director, the job runs underthe user ID of the person who manually started it. When a scheduled job runs, itruns under the user ID specified in Administrator.False: Correct! When a user manually runs a job in Director, the job runs underthe user ID of the person who manually started it. When a scheduled job runs, itruns under the user ID specified in Administrator. 14 - 5
  • Module 14 – Scheduling and Reporting DataStage 314Svr14 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 14 – Scheduling and ReportingIn addition to simple reports you can generate in Designer and Director usingFile>Print, DataStage provides a flexible and powerful reporting tool. TheDataStage Reporting Assistant is invoked from DataStage Manager. You cangenerate reports at various levels within a project, including:• Entire project• Selected jobs• Selected table definitions• Selected routines and transforms• Selected plug-in stagesInformation generated for reporting purposes is stored in an ODBC database onthe DataStage client. You can use this information for printing a report, writing areport to a file, or for browsing.By default, DataStage stores the reporting information in a Microsoft Access datasource named DSReporting that is defined when the Reporting Assistant isinstalled. 14 - 7
  • Module 14 – Scheduling and Reporting DataStage 314SvrThis shows an example of a report created for a job.14 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 14 – Scheduling and ReportingThis lists the main tasks involved in generating a report. 14 - 9
  • Module 14 – Scheduling and Reporting DataStage 314SvrTrue or False? The DataStage Reporting Assistant stores the data it uses inits reports in an ODBC database.True: Correct! This data source is set up on your client machine when theDataStage clients are installed on your machine.False: Incorrect. This data source is set up on your client machine when theDataStage clients are installed on your machine.14 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 14 – Scheduling and Reporting 14 - 11
  • Module 14 – Scheduling and Reporting DataStage 314Svr14 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 14 – Scheduling and Reporting 14 - 13
  • Module 14 – Scheduling and Reporting DataStage 314Svr14 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 15Optimizing Job Performance
  • Module 15 – Optimizing Performance DataStage 314Svr15 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job Performance 14 - 3
  • Module 15 – Optimizing Performance DataStage 314SvrCollection of performance statistics for a particular job run is controlled from theDataStage Director client. Some overhead is consumed by the collection; therefore,job component percentages may not sum to 100% for every job run.Statistics are written to the job log and may be viewed as long as that log ispreserved.15 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceThe collection of performance statistics can be turned on and off for each activestage in a DataStage job. This is done via the Tracing tab of the Job Run Optionsdialog box, select the stage you want to monitor and select the Performancestatistics check box. Use shift-click to select multiple active stages to monitor fromthe list. 14 - 5
  • Module 15 – Optimizing Performance DataStage 314SvrThe first pane of the above frame contains a sample of a job log. Whenperformance tracing is turned on a special log entry is generated immediately afterthe stage completion message. This is identified by the first linejob.stage.DSD.StageRun Performance statistics.The second pane contains the detailed view of the statistics message; it is displayedin a tabular form. You can cut these and paste them into a spreadsheet if required tomake further analysis possible.The performance statistics relate to the per-row processing cycle of an active stage,and of each of its input and output links. The information shown is: • Percent. The percentage of overall execution time that this part of the process used. • Count. The number of times this part of the process was executed. • Minimum. The minimum elapsed time in microseconds that this part of the process took for any of the rows processed. • Average. The average elapsed time in microseconds that this part of the process took for the rows processed.You need to take care interpreting these figures. For example, when inprocessactive stage to active stage links are used the percent column will not add up to100%. Also be aware that, in these circumstances, if you collect statistics for the15 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job Performancefirst active stage the entire cost of the downstream active stage is included in theactive-to-active link (as shown in our example diagram). This distortion remainseven where you are running the active stages in different processes (by havinginter-process row buffering enabled) unless you are actually running on a multi-processor system. 14 - 7
  • Module 15 – Optimizing Performance DataStage 314SvrYou can improve the performance of most DataStage jobs by turning inprocess rowbuffering on and recompiling the job. This allows connected active stages to passdata via buffers rather than row by row.You can turn in-process row buffering on for the whole project using the DataStageAdministrator. Alternatively, you can turn it on for individual jobs via thePerformance tab of the Job Properties dialog box.Note: You cannot use in-process row-buffering if your job uses COMMON blocksin transform functions to pass data between stages. This is not recommendedpractice, and it is advisable to redesign your job to use row buffering rather thanCOMMON blocks.15 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceRow buffering can be turned on in DataStage Administrator -- either as in processor inter process. Using Administrator to perform this function will set up rowbuffering as the project default. However, this can be overridden at the individualjob level. 14 - 9
  • Module 15 – Optimizing Performance DataStage 314Svr15 - 10Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job Performance 14 - 11
  • Module 15 – Optimizing Performance DataStage 314SvrWhen you design a job you see it in terms of stages and links. When it iscompiled, the DataStage engine sees it in terms of processes that are subsequentlyrun on the server.How does the DataStage engine define a process? It is here that the distinctionbetween active and passive stages becomes important. Actives stages, such as theTransformer and Aggregator, perform processing tasks, while passive stages, suchas Sequential file stage and ODBC stage, are reading or writing data sources andprovide services to the active stages. At its simplest, active stages becomeprocesses. But the situation becomes more complicated where you connect activestages together, and passive stages together.15 - 12Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceWhat happens when you have a job that links two passive stages together?Obviously there is some processing going on. Under the covers DataStage inserts acut-down transformer stage between the passive stages, which just passes datastraight from one stage to the other, and becomes a process when the job is run.What happens where you have a job that links two or more active stages together?By default this will all be run in a single process. Passive stages mark the processboundaries, all adjacent active stages between them being run in a single process. 14 - 13
  • Module 15 – Optimizing Performance DataStage 314SvrThis job is comprised of two processes because of the second sequential stage.15 - 14Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job Performance 14 - 15
  • Module 15 – Optimizing Performance DataStage 314SvrSMP – Symmetric Multi-Processors, multiple CPUs with single shared operatingsystem and shared memory.Examples: Sun Starfire, IBM S80, Compaq GS Series, HP SuperdomeMPP – multiple independent systems connected by a high-speed network, somenodes may be SMP. Each node has dedicated resources.Clusters – a group of connected computersNUMA – Non Uniform Memory Access, tightly coupled cluster which uses sharedmemory and local memoryTo achieve muli-processing you not only need multi-processing hardware but youalso need applications that were built to multi-process.15 - 16Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceData pipelining requires the same transform on all partitions; this can be easilyaccomplished using containers.Data pipelining and data partitioning can occur simultaneously within a job. 14 - 17
  • Module 15 – Optimizing Performance DataStage 314SvrIf you split processes in your job design by writing data to a Sequential file andthen reading it back again, you can use an Inter Process (IPC) stage in place of theSequential stage. This will split the process and reduce I/O and elapsed time as thereading process can start reading data as soon as it is available rather than waitingfor writing process to finish.15 - 18Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceNormally this would be a single process but with the introduction of the IPC stagethis job will split the read operation into one process and the transform and writeinto another process. Meta data must be the same on both the input and output linksto the IPC stage. 14 - 19
  • Module 15 – Optimizing Performance DataStage 314SvrThe Properties tab allows you to specify two properties for the IPC stage:•Buffer Size. Defaults to 128 Kb. The IPC stage uses two blocks of memory; oneblock can be written to while the other is read. This property defines the size ofeach block, so that by default 256 Kb is allocated in total.•Timeout. Defaults to 10 seconds. This specifies a time limit for how long thestage will wait for a process to connect to it before timing out. This normally willnot need changing, but may be important where you are prototyping multi-processor jobs on single processor platforms and there are likely to be delays.15 - 20Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job Performance 14 - 21
  • Module 15 – Optimizing Performance DataStage 314SvrWith the introduction of the enhanced multi-processor support at Release 6, thereare opportunities to further enhance the performance of server jobs by partitioningdata.The Partitioner stage allows you to partition data you are reading so it can beprocessed by individual processors running on multiple processors.The Collector stage (appears to the right of this diagram) allows you to collectpartitioned data together again for writing to a single data target.Inter-process row buffering should be turned on.15 - 22Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceThis example shows the partitioner stage depicted in the previous frame. Note:meta data on the output and input links must be identical. 14 - 23
  • Module 15 – Optimizing Performance DataStage 314SvrAlgorithms:– Round-Robin. This is the default method. Using the round-robin method thestage will write each incoming row to one of its output links in turn.– Random. Using this method the stage will use a random number generator todistribute incoming rows evenly across all output links.– Hash. Using this method the stage applies a hash function to one or more inputcolumn values to determine which output link the row is passed to.– Modulus. Using this method the stage applies a modulus function to an integerinput column value to determine which output link the row is passed to.Partitioning Key. This property is only significant where you have chosen apartitioning algorithm of Hash or Modulus. For the Hash algorithm, specify one ormore column names separated by commas. These keys are concatenated and a hashfunction applied to determine the destination output link. For the Modulusalgorithm, specify a single column name which identifies an integer numericcolumn. The value of this column value determines the destination output link.15 - 24Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceThe Collector stage is an active stage, which takes up to 64 inputs and allows youto collect data from these links and route it along a single output link. The stageexpects the output link to use the same meta data as the input links.The Collector stage can be used in conjunction with a Partitioner stage to enableyou to take advantage of a multi-processor system and have data processed inparallel. The Partitioner stage partitions data, it is processed in parallel, then theCollector stage collects it together again before writing it to a single target. 14 - 25
  • Module 15 – Optimizing Performance DataStage 314SvrThe Properties tab allows you to specify two properties for the Collector stage:•Collection Algorithm. Use this property to specify the method the stage uses tocollect data. Choose from:oRound-Robin. This is the default method. Using the round-robin method thestage will read a row from each input link in turn.oSort/Merge. Using the sort/merge method the stage reads multiple sorted inputsand writes one sorted output.•Sort Key. This property is only significant where you have chosen a collectingalgorithm of Sort/Merge. It defines how each of the partitioned data sets are knownto be sorted and how the merged output will be sorted.15 - 26Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 15 – Optimizing Job PerformanceThe partitioner stage can support up to 64 output links and the collector can supportup to 64 input links. Meta data should be identical on input and output links to thepartitioner; similarly for the collector stage.This configuration can also use a container in place of the transformer. 14 - 27
  • Module 15 – Optimizing Performance DataStage 314Svr15 - 28Copyright © 2002 Ascential Software Corporation09/01/02
  • Module 16Putting It All Together
  • Module 16 – Putting It All Together DataStage 314Svr16 - 2Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 16 – Putting It All Together 16 - 3
  • Module 16 – Putting It All Together DataStage 314SvrOur final application will use a business data mart for source data. Data will beextracted, go through little or no data transformation, get summarized and loadedinto a target database. The reason that little or no data transformation will occur isthat the data has already been cleaned before being loaded into the data mart.16 - 4Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 16 – Putting It All TogetherThe existing data mart consists of a star schema structure comprised of a sales facttable surrounded by promotion, store, product, and time dimensions. The timedimension has been renamed to timex because of naming convention limitationswithin Microsoft Access; similarly be date field has been renamed to datex.A star schema, such as the one depicted above, is an ideal data structure for ad hocqueries. Many vendor tools are available in the marketplace to support this type ofquery building.Note that the dimensions are linked to the fact table by surrogate keys. We will usethe surrogate keys to build DataStage jobs that denormalize the data. 16 - 5
  • Module 16 – Putting It All Together DataStage 314SvrThe desired result from end-users is a table, or set of tables, get summarized thedata by specific dimensions. This video will briefly demonstrate the actions atypical end-user might employ with the summary tables.Be16 - 6Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 16 – Putting It All Together 16 - 7
  • Module 16 – Putting It All Together DataStage 314Svr16 - 8Copyright © 2002 Ascential Software Corporation09/01/02
  • DataStage Essentials Module 16 – Putting It All Together 16 - 9
  • Module 16 – Putting It All Together DataStage 314Svr16 - 10Copyright © 2002 Ascential Software Corporation09/01/02