DS41_DS305_M02_DataAccess_HASH DS41_DS305_M02_DataAccess_HASH.ppt

(305) 4-1
Advanced DataStage
Working with Hashed Files
Module
2

(305) 4-2
Advanced DataStage
Module Objectives
After this module, you will be able to:
 Explain what a hashed file is
 Describe the types of hashed files
 Define virtual (I-Descriptor) fields
 Create different types of dynamic and static hashed files
 Analyze hashed files using FILE.STAT and ANALYZE.FILE
 Access hashed files using UniVerse commands, UV
stages, and hashed file stages
 Create and access distributed files

(305) 4-3
Advanced DataStage
What is a Hashed File?
 Records are stored in groups
 Each group is the same size, specified at creation
 Number of groups is called “modulus”
 Hashing algorithm applied to primary key
determines the group
 Fast access
 Preload into memory for fast reads
 Enable write-caching for fast writes
 Dynamic v. static hashed files
 Dynamic: Additional groups are added as needed
 Static: Fixed number of groups, specified at creation

(305) 4-4
Advanced DataStage
Uses of Hashed Files
 Provides fast access for lookups
 Remote data can be loaded into locally
stored hashed files for better performance
 Useful as a temporary or non-volatile
program storage area
 Can incorporate pre-compiled column
derivation expressions (“virtual columns”)
 Useful for normalizing UniVerse multi-
valued (nested) data

(305) 4-5
Advanced DataStage
Inserting Records
Hashing
Algorithm
Group 1 Group 2 Group 3 Group 4 Group 5

(305) 4-6
Advanced DataStage
Types of Hashed Files
There are two kinds of hashed files:
 Dynamic
 Automatically increase or decrease the
number of groups as necessary
 Static
 Number of groups does not change
automatically

(305) 4-7
Advanced DataStage
Overflow
Group 1 Group 2 Group 3 Group 4 Group 5
Header
2048 4096 6144 8192 10240
Group
Address
overflow
Group 2
12288
When there isn’t enough space left in a
group for the record, the group overflows:

(305) 4-8
Advanced DataStage
Hashing Algorithms
 Dynamic
 GENERAL or SEQ.NUM
 Static
 17 different choices
 Based on variation pattern of primary key
values
 Specified as "file type"

(305) 4-9
Advanced DataStage
Static Hashed File Algorithms
character
type
most variation in key occurs
right middle left any
wholly
numeric
2 6 10 14
numeric &
separators
3 7 11 15
ASCII 4 8 12 16
any 5 9 13 17

(305) 4-10
Advanced DataStage
Creating Hashed Files
Use Hashed File target stage
 Or create from UniVerse command line
 CREATE.FILE
 SQL CREATE TABLE
Location
 Account (Project)
 Directory
Enable / Disable write caching
Column definitions
Key column(s)
Create file options

(305) 4-11
Advanced DataStage
Hashed File Location
Create in project (account)
Blank for current project
Create in server directory

(305) 4-12
Advanced DataStage
Hashed File Write Caching
 Write records into a memory cache
 Writes multiple records to disk from the cache
 Reduces number of disk writes
 By default, write caching is disabled
 Do not read and write to the same hashed
file in a single transform stage group!
 Cached record may not be found

(305) 4-13
Advanced DataStage
Reading + Writing to Same Hashed File
Write to UniqueEmps
Read from UniqueEmps

(305) 4-14
Advanced DataStage
Enable Write Cache
Enable write cache

(305) 4-15
Advanced DataStage
Specifying the Hashed File Key
 Must specify a key
 Composed of one or more columns
starting with first column
 Records written with the same key value
are overwritten
 There can be only one record with the same
key value

(305) 4-16
Advanced DataStage
Using Character Fields as Keys
 Hashed files preserve blanks in both data
and key columns if they are derived from a
char data type.
 The TRIM routine can be used to remove
extra spaces when populating or referencing
the file.

(305) 4-17
Advanced DataStage
Exercise Part I: Working with Hashed Files
 Create a hashed file in the current
DataStage project
 Create a hashed file in a directory
 Use a hashed file to find duplicates
 Test write caching

(305) 4-18
Advanced DataStage
I-Descriptors
 Define virtual fields
 Read-only
 Expression defines the virtual field
 Very similar to a BASIC expression
 Expression is evaluated when the virtual field is
accessed
 Must be compiled
 Example: Bonus = min_lvl * 10
 Uses
 Boost performance
 Create standard interface

(305) 4-19
Advanced DataStage
Creating I-Descriptor Fields
 Create from DataStage command line
 Use REVISE utility
 Add record specifying field to hashed file
dictionary
 Import metadata as a UniVerse table
 Access using UV stage

(305) 4-20
Advanced DataStage
Displaying the Hashed File Dictionary
LIST DICT JOBCODES1
Data descriptor records
Key alias
Default display fields

(305) 4-21
Advanced DataStage
Using REVISE
REVISE DICT JOBCODES1 USING ENTER.DICT
Num of field to change
Field name

(305) 4-22
Advanced DataStage
Creating an I-Descriptor Field
BASIC expression
I type
Name

(305) 4-23
Advanced DataStage
Displaying the New Field
Added field
To display

(305) 4-24
Advanced DataStage
Exercise Part II: Virtual Fields
 Define a virtual (I-Descriptor) field
 Import metadata for virtual fields
 Access data in virtual fields

(305) 4-25
Advanced DataStage
Creating Static Hashed Files
Number of groups
Group size in
units of 512 bytes

(305) 4-26
Advanced DataStage
Creating Dynamic Hashed Files

(305) 4-27
Advanced DataStage
Analysis Tools
FILE.STAT filename
For static hashed files
Statistics about record distribution in groups
Use HASH.HELP or HASH.HELP.DETAIL for
recommendations
Use HASH.TEST or HASH.TEST.DETAIL for testing
different algorithms (types), modulo, and group numbers
ANALYZE.FILE filename STATISTICS
For dynamic hashed files
Sizing, data distribution, number of groups, etc.

(305) 4-28
Advanced DataStage
Exercise Part III: Analysis
 Design, run, and monitor a job that creates
and loads a static hashed file
 Run FILE.STAT and HASH.TEST
 Modify the static file based on the analysis
 Rerun the job
 Design, run, and monitor a job that creates
and loads a dynamic hashed file
 Run ANALYZE.FILE to analyze a dynamic
file

(305) 4-29
Advanced DataStage
Distributed Files
STORES.EAST
STORES.WEST
STORES.NORTH
STORES.SOUTH
STORES.USA
&PARTFILES&
Data Record
Part 1
Part 2 Part 3
Part 4
Algorithm applied to key
determines the part

(305) 4-30
Advanced DataStage
Why Create Distributed Files?
A way of physically partitioning data
E.g., Company orders can be physically
partitioned by division
A way of bringing disparate data together
E.g., Orders at several company divisions are
in separate hashed files
Define distributed file to query all orders
Can overcome size limitations on single
files
Can support faster lookups

(305) 4-31
Advanced DataStage
Defining a Distributed File
Create part files
DEFINE.DF STORES.USA ADDING STORES.EAST 1
STORES.WEST 2 … [algorithm]
Algorithm
SYSTEM [s] Key is integer followed by separator
 E.g., 1-RECORDID
INTERNAL I-type expression
 “IF store_id[1,1] = ‘E’ THEN 1 ELSE IF store_id[1,1] = ‘W’ THEN 2 …”
 Deleting distributed files
 DEFINE.DF STORES USA REMOVING STORES.EAST STORES.WEST

(305) 4-32
Advanced DataStage
Exercise Part IV: Distributed Files
 Create a distributed file
 Lookup a value in a distributed file

DS41_DS305_M02_DataAccess_HASH DS41_DS305_M02_DataAccess_HASH.ppt

More Related Content

Similar to DS41_DS305_M02_DataAccess_HASH DS41_DS305_M02_DataAccess_HASH.ppt

More from ssuser3e8ddb

Recently uploaded

DS41_DS305_M02_DataAccess_HASH DS41_DS305_M02_DataAccess_HASH.ppt

Editor's Notes