Excellent DataStage Documentation and Examples in New 660
Page IBM RedBook
Vincent McBurney | May 20, 2008 | Comments (11)
There is a new IBM draft Redbook seeking community feedback called IBM WebSphere
DataStage Data Flow and Job Design with a whopping 660 pages of guidelines, tips, examples
An IBM RedBook IBM InfoSphere DataStage Data Flow and Job Design brings together a team
of researchers from around the world to an IBM lab to spend 2-6 weeks researching a practical
use of an IBM product.Â It's kind of like Big Brother but they are doing something useful and
don't have quite as many spa parties (so I'm told). IBM is seeking peer review and feedback
on this draft.
There are a few bonuses in this book:
T 17 pages of DataStage architecture overview.
T 5 pages of best practices, standards and guidelines.
T 100 pages describing the most popular stages in parallel jobs.
T A Sneak Peak at the new DataStage 8.1 Distributed Transaction Stage for XA transactions from MQ
S Several hundred pages on a Retail processing scenario.
S Download of DataStage export files and scripts available from the Redbook website.
S It also lifts the lid on some product rebranding, goodbye WebSphere DataStage, hello InfoSphere
I've heard a few complaints (some of them from me) on the lack of DataStage documentation
over the years.Â "Where can I download the PDFs?"Â "Are there any books about
DataStage?"Â "Are there any DataStage Standards?"Â "Where can I get example jobs?"
"Please send me the materials for DataStage Certification."Â Well we can all stop
complaining!Â You can't ask for more than over a thousand pages of documentation with
screenshots and examples in this RedBook and the one from last year I profiled in Everything
you wanted to know about SOA on the IBM Information Server but were too disinterested to
ask. Not to mention IBM WebSphere QualityStage Methodologies, Standardization, and
Matching. This one belongs on my list of The Top 7 Online DataStage Tutorials.
The team that put this one together:
• Nagraj Alur was the project leader and works at the San Jose centre.
• Celso Takahashi is a technical sales expert from IBM Brazil.
• Sachiko Toratani is an IT support specialist from IBM Japan.
• Denis Vasconcelos is a data specialist from IBM Brazil.Â
The team was supported by the DataStage development team from the Silicon Valley Labs in
It's a whopping RedBook weighing in at 660 pages and 19.7 MB as it's chock full of
screenshots.Â Because not all readers want to download a 19.7 MB file or wade through a
PDF to find out if they want it I have taken a deeper look at a couple sections and included the
full table of contents.
There are a few pages of standards and guidelines that are handy for beginner programmers
and cover overall setup and specific stage setup:
DataStage Data Types
Stage specific guidelines
An example of some stage specific guidelines:
Take precautions when using expressions or derivations on nullable columns within the parallel
â€“ Always convert nullable columns to in-band values before using them in an expression or
â€“ Always place a reject link on a parallel Transformer to capture / audit
Be particularly careful to observe the nullability properties for input links to any form of Outer
Join. Even if the source data is not nullable, the non-key columns must be defined as nullable
in the Join stage input in order to identify unmatched records.
When you add to this all the sample jobs you have a great data warehouse example.Â
Personally I'd like to see this entire DataStage standards and guidelines section lifted out and
plonked in a wiki - perhaps over on LeverageInformation.
Distributed Transaction Stage
A Distributed Transaction Stage accepts multiple input links in a DataStage job representing
rows of data for various database actions and makes sure they are all applied as a single unit
of work.Â This stage is coming in release 8.1 and dsRealTime blog author Ernie Ostic talks
about it in his post (and about how to achieve this in a Server Job) in MQSeriesâ€¦Ensuring
Message Delivery from Queue to Target :Â
Using MQSeries in DataStage as a source or target is very easyâ€¦..but ensuring delivery from
queue to queue is a bit more tricky. Even more difficult is trying to ensure delivery from queue
to database without dropping any messagesâ€¦
The best way to do this is with an XA transaction, using a formal transaction coordinator, such
as MQSeries itself. This is typically done with the Distributed Transaction Stage, which works
with MQ to perform transactions across resourcesâ€¦.deleting a message from the source
queue, INSERTing a row to the target, and then committing the entire operation. This requires
the most recent release of DataStage, and the right environment, releases, and configuration
of MQSeries and a database that it supports for doing such XA activityâ€¦.
In the example in the Redbook a series of messages are read from MQ Series queue, they are
transformed ETL style and then passed to the Distributed Transaction Stage (DTS) to be
written to various database tables:
This job looks like Napoleons troop movements at Waterloo but shows how the job takes a
complex message from MQ, flattens it out into customer, product and store rows, does a bit of
fancy shmancy transformation using DataStage stages and sends insert, update and delete
commands for all three types of data to a Distributed Transaction Stage.Â A Unit of Work is a
bundle of up to nine database commands and the removal of the message they all came from,
all with a single rollback on failure.
There are some handy functions on this design:
• You can read from the queue in read only mode so the messages stay on there or in destructive mode so
handled messages are removed.
• You can choose to write the messages out in the order they were placed on the queue, handy for parallel
• You can configure the job to finish after reading a certain number of transactions or after a defined period of
• Ability to treat different messages with a shared key field as a unit.
So if you are like me you look at the job and wonder how the hell you set the properties of the
DTS stage when it has nine input links and nine different sets of database commands.Â Well
that's one of the surprises in release 8.1, they have a nifty diagram showing up in the
property window (kind of like a Google map) that shows you what link you are modifying at
any point in time:
You can click on a link in this little map to change to the properties for that link - so you can
click on Product_Delete to see the properties for the delete command on the product table and
then click on Store_Update to change to a different set of properties.Â I wonder how many
DataStage 8.1 stages are going to have this feature?Â Could be handy.Â You can also see
the new look and feel of the property window which is a lot more like a standard GUI property
window now - kind of what you see in tools like Visual Basic.
Slowly Changing Dimension Stage
The RedBook also gives the new 8.0.1 Slowly Changing Dimension Stage a thorough going
over in a lot more detail then any of the documentation or tutorials we have seen before.Â
The retail scenario shows a very complex series of SCD updates out of a single complex flat
I've done these type of dimension loads before and before you had the SCD stage this same
functionality could have taken ten jobs with up to ten stages in each.Â The SCD stage
performs the same functionality as four stages under the old version: a surrogate key
generator, a surrogate key lookup, a change data capture and a transformer for setting dates
and flags and values.Â The SCD stage does all this in one stage so it's a lot easier for
inexperienced programmers and those new to SCD functionality.
The RedBook takes a look inside the properties screen, it looks a lot like a transformer, with
some extra columns to define the purpose of the special SCD tagging fields: