Data flow ii extract

Data Flow-I
Extraction
By,
Dr. Dipti Patil

Building the logical data map
•Analyzing the source system has 2 steps:
1. Data discovery
2. Anomaly detection

Data discovery I
•Identify and examine the data sources
•Data modeling sessions should be organized to define the data
models and design the mapping details
•Not all sources may be covered in such sessions. So it is
important to ensure that all data points are collected including
the external and supporting data points that can be used for
references.
•Documentation of the source systems including the details like -
purpose, current users, frequency of updates etc. are important
•Need to track the data sources and keep in sync with the
updates to the same. Mechanisms to capture the changes in the
sources should be known well in advance and in depth

Data discovery II
•Source system tracking report should be
maintained. This should include
–Data mart into which the source feeds into
–Interface name from the transaction application
–Common term used in business
–Priority of the data
–Purpose of the data
–Technical owner of the data (who generates it)
–Business owner of the data (who uses it)
–DBMS system name
–Production system details where the data source
resides

Data discovery III
•Track the system-of-record: the exact source of
data origin. Helps to avoid data duplication and
incompleteness in data.
•Data which is derived (from one of more data
sources) should be tracked individually
•Analyze the source systems to discover the
content better. Tracked best using ER diagrams,
this may require to reverse engineer the
systems. Characteristics to consider here:
–Unique identifiers and keys
–Data types of all columns

Data content analysis and anomaly
detection
•Some common anomalies handling includes:
–NULL value
–Date fields
–Numeric fields
–Unique keys
This step will also include collection of business
rules for the ETL process. These are much more
technical than other business rules in the data
warehouse projects. ETL architect is expected to
translate the user requirements to usable ETL
definitions

Heterogeneous data sources I
•Challenges in integrating with different data
sources
•Alignment of data points and KPIs
•Conformed dimensions are cohesive design
that unifies disparate data systems scattered
across the enterprise
•Data source should be identified during data
profiling and also identify the fact and
dimension tables in the data warehouse

Heterogeneous data sources II
•Understanding the source system is essential to
be able to integrate multiple systems together
•Matching algorithms for joining data from
multiple sources
•If there is collision in the ETL process, survivor
rules must be defined to resolve the same. This
should be noted after system-of-record is
created
•Business rules must be identified
•Load the conformed dimensions taking into

Handling multiple data source
platforms - challenges
•Most commonly used connection is via ODBC
(open database connectivity)
•Mainframe sources provide a different arena of
integration issues due to their customized
hardware architecture
–Most of the legacy code on mainframes is in
COBOL
–EBCDIC character sets need to be converted to
ASCII as required
–Data transfer across nodes and platforms over
network

Tracking the data changes
•Detect the changes happening on the source
•Pull versus push approach for tracking the
changes
•Sniffing the intermediate logs

Data flow ii extract

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to Data flow ii extract

Similar to Data flow ii extract (20)

Recently uploaded

Recently uploaded (20)

Data flow ii extract