(305) 4-1
Advanced DataStage
Working with Hashed Files
Module
2
(305) 4-2
Advanced DataStage
Module Objectives
After this module, you will be able to:
 Explain what a hashed file is
 Describe the types of hashed files
 Define virtual (I-Descriptor) fields
 Create different types of dynamic and static hashed files
 Analyze hashed files using FILE.STAT and ANALYZE.FILE
 Access hashed files using UniVerse commands, UV
stages, and hashed file stages
 Create and access distributed files
(305) 4-3
Advanced DataStage
What is a Hashed File?
 Records are stored in groups
 Each group is the same size, specified at creation
 Number of groups is called “modulus”
 Hashing algorithm applied to primary key
determines the group
 Fast access
 Preload into memory for fast reads
 Enable write-caching for fast writes
 Dynamic v. static hashed files
 Dynamic: Additional groups are added as needed
 Static: Fixed number of groups, specified at creation
(305) 4-4
Advanced DataStage
Uses of Hashed Files
 Provides fast access for lookups
 Remote data can be loaded into locally
stored hashed files for better performance
 Useful as a temporary or non-volatile
program storage area
 Can incorporate pre-compiled column
derivation expressions (“virtual columns”)
 Useful for normalizing UniVerse multi-
valued (nested) data
(305) 4-5
Advanced DataStage
Inserting Records
Hashing
Algorithm
Group 1 Group 2 Group 3 Group 4 Group 5
(305) 4-6
Advanced DataStage
Types of Hashed Files
There are two kinds of hashed files:
 Dynamic
 Automatically increase or decrease the
number of groups as necessary
 Static
 Number of groups does not change
automatically
(305) 4-7
Advanced DataStage
Overflow
Group 1 Group 2 Group 3 Group 4 Group 5
Header
2048 4096 6144 8192 10240
Group
Address
overflow
Group 2
12288
When there isn’t enough space left in a
group for the record, the group overflows:
(305) 4-8
Advanced DataStage
Hashing Algorithms
 Dynamic
 GENERAL or SEQ.NUM
 Static
 17 different choices
 Based on variation pattern of primary key
values
 Specified as "file type"
(305) 4-9
Advanced DataStage
Static Hashed File Algorithms
character
type
most variation in key occurs
right middle left any
wholly
numeric
2 6 10 14
numeric &
separators
3 7 11 15
ASCII 4 8 12 16
any 5 9 13 17
(305) 4-10
Advanced DataStage
Creating Hashed Files
Use Hashed File target stage
 Or create from UniVerse command line
 CREATE.FILE
 SQL CREATE TABLE
Location
 Account (Project)
 Directory
Enable / Disable write caching
Column definitions
Key column(s)
Create file options
(305) 4-11
Advanced DataStage
Hashed File Location
Create in project (account)
Blank for current project
Create in server directory
(305) 4-12
Advanced DataStage
Hashed File Write Caching
 Write records into a memory cache
 Writes multiple records to disk from the cache
 Reduces number of disk writes
 By default, write caching is disabled
 Do not read and write to the same hashed
file in a single transform stage group!
 Cached record may not be found
(305) 4-13
Advanced DataStage
Reading + Writing to Same Hashed File
Write to UniqueEmps
Read from UniqueEmps
(305) 4-14
Advanced DataStage
Enable Write Cache
Enable write cache
(305) 4-15
Advanced DataStage
Specifying the Hashed File Key
 Must specify a key
 Composed of one or more columns
starting with first column
 Records written with the same key value
are overwritten
 There can be only one record with the same
key value
(305) 4-16
Advanced DataStage
Using Character Fields as Keys
 Hashed files preserve blanks in both data
and key columns if they are derived from a
char data type.
 The TRIM routine can be used to remove
extra spaces when populating or referencing
the file.
(305) 4-17
Advanced DataStage
Exercise Part I: Working with Hashed Files
 Create a hashed file in the current
DataStage project
 Create a hashed file in a directory
 Use a hashed file to find duplicates
 Test write caching
(305) 4-18
Advanced DataStage
I-Descriptors
 Define virtual fields
 Read-only
 Expression defines the virtual field
 Very similar to a BASIC expression
 Expression is evaluated when the virtual field is
accessed
 Must be compiled
 Example: Bonus = min_lvl * 10
 Uses
 Boost performance
 Create standard interface
(305) 4-19
Advanced DataStage
Creating I-Descriptor Fields
 Create from DataStage command line
 Use REVISE utility
 Add record specifying field to hashed file
dictionary
 Import metadata as a UniVerse table
 Access using UV stage
(305) 4-20
Advanced DataStage
Displaying the Hashed File Dictionary
LIST DICT JOBCODES1
Data descriptor records
Key alias
Default display fields
(305) 4-21
Advanced DataStage
Using REVISE
REVISE DICT JOBCODES1 USING ENTER.DICT
Num of field to change
Field name
(305) 4-22
Advanced DataStage
Creating an I-Descriptor Field
BASIC expression
I type
Name
(305) 4-23
Advanced DataStage
Displaying the New Field
Added field
To display
(305) 4-24
Advanced DataStage
Exercise Part II: Virtual Fields
 Define a virtual (I-Descriptor) field
 Import metadata for virtual fields
 Access data in virtual fields
(305) 4-25
Advanced DataStage
Creating Static Hashed Files
Number of groups
Group size in
units of 512 bytes
(305) 4-26
Advanced DataStage
Creating Dynamic Hashed Files
(305) 4-27
Advanced DataStage
Analysis Tools
FILE.STAT filename
For static hashed files
Statistics about record distribution in groups
Use HASH.HELP or HASH.HELP.DETAIL for
recommendations
Use HASH.TEST or HASH.TEST.DETAIL for testing
different algorithms (types), modulo, and group numbers
ANALYZE.FILE filename STATISTICS
For dynamic hashed files
Sizing, data distribution, number of groups, etc.
(305) 4-28
Advanced DataStage
Exercise Part III: Analysis
 Design, run, and monitor a job that creates
and loads a static hashed file
 Run FILE.STAT and HASH.TEST
 Modify the static file based on the analysis
 Rerun the job
 Design, run, and monitor a job that creates
and loads a dynamic hashed file
 Run ANALYZE.FILE to analyze a dynamic
file
(305) 4-29
Advanced DataStage
Distributed Files
STORES.EAST
STORES.WEST
STORES.NORTH
STORES.SOUTH
STORES.USA
&PARTFILES&
Data Record
Part 1
Part 2 Part 3
Part 4
Algorithm applied to key
determines the part
(305) 4-30
Advanced DataStage
Why Create Distributed Files?
A way of physically partitioning data
E.g., Company orders can be physically
partitioned by division
A way of bringing disparate data together
E.g., Orders at several company divisions are
in separate hashed files
Define distributed file to query all orders
Can overcome size limitations on single
files
Can support faster lookups
(305) 4-31
Advanced DataStage
Defining a Distributed File
Create part files
DEFINE.DF STORES.USA ADDING STORES.EAST 1
STORES.WEST 2 … [algorithm]
Algorithm
SYSTEM [s] Key is integer followed by separator
 E.g., 1-RECORDID
INTERNAL I-type expression
 “IF store_id[1,1] = ‘E’ THEN 1 ELSE IF store_id[1,1] = ‘W’ THEN 2 …”
 Deleting distributed files
 DEFINE.DF STORES USA REMOVING STORES.EAST STORES.WEST
(305) 4-32
Advanced DataStage
Exercise Part IV: Distributed Files
 Create a distributed file
 Lookup a value in a distributed file

DS41_DS305_M02_DataAccess_HASH DS41_DS305_M02_DataAccess_HASH.ppt

  • 1.
    (305) 4-1 Advanced DataStage Workingwith Hashed Files Module 2
  • 2.
    (305) 4-2 Advanced DataStage ModuleObjectives After this module, you will be able to:  Explain what a hashed file is  Describe the types of hashed files  Define virtual (I-Descriptor) fields  Create different types of dynamic and static hashed files  Analyze hashed files using FILE.STAT and ANALYZE.FILE  Access hashed files using UniVerse commands, UV stages, and hashed file stages  Create and access distributed files
  • 3.
    (305) 4-3 Advanced DataStage Whatis a Hashed File?  Records are stored in groups  Each group is the same size, specified at creation  Number of groups is called “modulus”  Hashing algorithm applied to primary key determines the group  Fast access  Preload into memory for fast reads  Enable write-caching for fast writes  Dynamic v. static hashed files  Dynamic: Additional groups are added as needed  Static: Fixed number of groups, specified at creation
  • 4.
    (305) 4-4 Advanced DataStage Usesof Hashed Files  Provides fast access for lookups  Remote data can be loaded into locally stored hashed files for better performance  Useful as a temporary or non-volatile program storage area  Can incorporate pre-compiled column derivation expressions (“virtual columns”)  Useful for normalizing UniVerse multi- valued (nested) data
  • 5.
    (305) 4-5 Advanced DataStage InsertingRecords Hashing Algorithm Group 1 Group 2 Group 3 Group 4 Group 5
  • 6.
    (305) 4-6 Advanced DataStage Typesof Hashed Files There are two kinds of hashed files:  Dynamic  Automatically increase or decrease the number of groups as necessary  Static  Number of groups does not change automatically
  • 7.
    (305) 4-7 Advanced DataStage Overflow Group1 Group 2 Group 3 Group 4 Group 5 Header 2048 4096 6144 8192 10240 Group Address overflow Group 2 12288 When there isn’t enough space left in a group for the record, the group overflows:
  • 8.
    (305) 4-8 Advanced DataStage HashingAlgorithms  Dynamic  GENERAL or SEQ.NUM  Static  17 different choices  Based on variation pattern of primary key values  Specified as "file type"
  • 9.
    (305) 4-9 Advanced DataStage StaticHashed File Algorithms character type most variation in key occurs right middle left any wholly numeric 2 6 10 14 numeric & separators 3 7 11 15 ASCII 4 8 12 16 any 5 9 13 17
  • 10.
    (305) 4-10 Advanced DataStage CreatingHashed Files Use Hashed File target stage  Or create from UniVerse command line  CREATE.FILE  SQL CREATE TABLE Location  Account (Project)  Directory Enable / Disable write caching Column definitions Key column(s) Create file options
  • 11.
    (305) 4-11 Advanced DataStage HashedFile Location Create in project (account) Blank for current project Create in server directory
  • 12.
    (305) 4-12 Advanced DataStage HashedFile Write Caching  Write records into a memory cache  Writes multiple records to disk from the cache  Reduces number of disk writes  By default, write caching is disabled  Do not read and write to the same hashed file in a single transform stage group!  Cached record may not be found
  • 13.
    (305) 4-13 Advanced DataStage Reading+ Writing to Same Hashed File Write to UniqueEmps Read from UniqueEmps
  • 14.
    (305) 4-14 Advanced DataStage EnableWrite Cache Enable write cache
  • 15.
    (305) 4-15 Advanced DataStage Specifyingthe Hashed File Key  Must specify a key  Composed of one or more columns starting with first column  Records written with the same key value are overwritten  There can be only one record with the same key value
  • 16.
    (305) 4-16 Advanced DataStage UsingCharacter Fields as Keys  Hashed files preserve blanks in both data and key columns if they are derived from a char data type.  The TRIM routine can be used to remove extra spaces when populating or referencing the file.
  • 17.
    (305) 4-17 Advanced DataStage ExercisePart I: Working with Hashed Files  Create a hashed file in the current DataStage project  Create a hashed file in a directory  Use a hashed file to find duplicates  Test write caching
  • 18.
    (305) 4-18 Advanced DataStage I-Descriptors Define virtual fields  Read-only  Expression defines the virtual field  Very similar to a BASIC expression  Expression is evaluated when the virtual field is accessed  Must be compiled  Example: Bonus = min_lvl * 10  Uses  Boost performance  Create standard interface
  • 19.
    (305) 4-19 Advanced DataStage CreatingI-Descriptor Fields  Create from DataStage command line  Use REVISE utility  Add record specifying field to hashed file dictionary  Import metadata as a UniVerse table  Access using UV stage
  • 20.
    (305) 4-20 Advanced DataStage Displayingthe Hashed File Dictionary LIST DICT JOBCODES1 Data descriptor records Key alias Default display fields
  • 21.
    (305) 4-21 Advanced DataStage UsingREVISE REVISE DICT JOBCODES1 USING ENTER.DICT Num of field to change Field name
  • 22.
    (305) 4-22 Advanced DataStage Creatingan I-Descriptor Field BASIC expression I type Name
  • 23.
    (305) 4-23 Advanced DataStage Displayingthe New Field Added field To display
  • 24.
    (305) 4-24 Advanced DataStage ExercisePart II: Virtual Fields  Define a virtual (I-Descriptor) field  Import metadata for virtual fields  Access data in virtual fields
  • 25.
    (305) 4-25 Advanced DataStage CreatingStatic Hashed Files Number of groups Group size in units of 512 bytes
  • 26.
  • 27.
    (305) 4-27 Advanced DataStage AnalysisTools FILE.STAT filename For static hashed files Statistics about record distribution in groups Use HASH.HELP or HASH.HELP.DETAIL for recommendations Use HASH.TEST or HASH.TEST.DETAIL for testing different algorithms (types), modulo, and group numbers ANALYZE.FILE filename STATISTICS For dynamic hashed files Sizing, data distribution, number of groups, etc.
  • 28.
    (305) 4-28 Advanced DataStage ExercisePart III: Analysis  Design, run, and monitor a job that creates and loads a static hashed file  Run FILE.STAT and HASH.TEST  Modify the static file based on the analysis  Rerun the job  Design, run, and monitor a job that creates and loads a dynamic hashed file  Run ANALYZE.FILE to analyze a dynamic file
  • 29.
    (305) 4-29 Advanced DataStage DistributedFiles STORES.EAST STORES.WEST STORES.NORTH STORES.SOUTH STORES.USA &PARTFILES& Data Record Part 1 Part 2 Part 3 Part 4 Algorithm applied to key determines the part
  • 30.
    (305) 4-30 Advanced DataStage WhyCreate Distributed Files? A way of physically partitioning data E.g., Company orders can be physically partitioned by division A way of bringing disparate data together E.g., Orders at several company divisions are in separate hashed files Define distributed file to query all orders Can overcome size limitations on single files Can support faster lookups
  • 31.
    (305) 4-31 Advanced DataStage Defininga Distributed File Create part files DEFINE.DF STORES.USA ADDING STORES.EAST 1 STORES.WEST 2 … [algorithm] Algorithm SYSTEM [s] Key is integer followed by separator  E.g., 1-RECORDID INTERNAL I-type expression  “IF store_id[1,1] = ‘E’ THEN 1 ELSE IF store_id[1,1] = ‘W’ THEN 2 …”  Deleting distributed files  DEFINE.DF STORES USA REMOVING STORES.EAST STORES.WEST
  • 32.
    (305) 4-32 Advanced DataStage ExercisePart IV: Distributed Files  Create a distributed file  Lookup a value in a distributed file

Editor's Notes

  • #12 Since writes are done in group order, if multiple records are written to the same group, the number of physical disk I/O operations is reduced since the write will be handled from data in buffer cache. If the writes are done to a pre-allocated hashed files, only one flush operation will occur. For releases 3.6.3 and later multiple flushes will occur depending on the DS_MMAPSIZE and DS_MAXWRITEHEAPS dsenv variable settings. Since data is written to cache and then flushed to disk, you should not be reading from and writing to the same hashed file utilizing a single transform stage group. With release 3.6.3 it is possible to disable write caching either globally by setting DS_MAXWRITEHEAPS=0 in the dsenv file or by execute a before stage TCL to execute ENV SET DS_MAXWRITEHEAPS=0 to disable write caching for a transformer process group. It is also possible to allocate any number of heaps either globally or for a transformer process group. For all releases prior to 3.6.3 the UV stage can be used to write to hashed files for operation that read and write to the same file.
  • #16 If the file is smaller, then than the memory requirements required to cache the file, and the demands on the disk I/O subsystem will be reduced, resulting in improved performance.