1) The document discusses data discovery and extraction from source systems into a data warehouse. It covers identifying and documenting all relevant data sources, tracking changes, and analyzing data content and anomalies.
2) The key steps in data discovery are organizing data modeling sessions, ensuring all data points are collected, and documenting source system details. Tracking reports should maintain information on source systems.
3) Integrating heterogeneous data sources presents challenges in aligning data, designing conformed dimensions, and resolving collisions through survivor rules. Business rules must also be identified.
2. Building the logical data map
•Analyzing the source system has 2 steps:
1. Data discovery
2. Anomaly detection
3. Data discovery I
•Identify and examine the data sources
•Data modeling sessions should be organized to define the data
models and design the mapping details
•Not all sources may be covered in such sessions. So it is
important to ensure that all data points are collected including
the external and supporting data points that can be used for
references.
•Documentation of the source systems including the details like -
purpose, current users, frequency of updates etc. are important
•Need to track the data sources and keep in sync with the
updates to the same. Mechanisms to capture the changes in the
sources should be known well in advance and in depth
4. Data discovery II
•Source system tracking report should be
maintained. This should include
–Data mart into which the source feeds into
–Interface name from the transaction application
–Common term used in business
–Priority of the data
–Purpose of the data
–Technical owner of the data (who generates it)
–Business owner of the data (who uses it)
–DBMS system name
–Production system details where the data source
resides
5. Data discovery III
•Track the system-of-record: the exact source of
data origin. Helps to avoid data duplication and
incompleteness in data.
•Data which is derived (from one of more data
sources) should be tracked individually
•Analyze the source systems to discover the
content better. Tracked best using ER diagrams,
this may require to reverse engineer the
systems. Characteristics to consider here:
–Unique identifiers and keys
–Data types of all columns
6. Data content analysis and anomaly
detection
•Some common anomalies handling includes:
–NULL value
–Date fields
–Numeric fields
–Unique keys
This step will also include collection of business
rules for the ETL process. These are much more
technical than other business rules in the data
warehouse projects. ETL architect is expected to
translate the user requirements to usable ETL
definitions
7. Heterogeneous data sources I
•Challenges in integrating with different data
sources
•Alignment of data points and KPIs
•Conformed dimensions are cohesive design
that unifies disparate data systems scattered
across the enterprise
•Data source should be identified during data
profiling and also identify the fact and
dimension tables in the data warehouse
8. Heterogeneous data sources II
•Understanding the source system is essential to
be able to integrate multiple systems together
•Matching algorithms for joining data from
multiple sources
•If there is collision in the ETL process, survivor
rules must be defined to resolve the same. This
should be noted after system-of-record is
created
•Business rules must be identified
•Load the conformed dimensions taking into
9. Handling multiple data source
platforms - challenges
•Most commonly used connection is via ODBC
(open database connectivity)
•Mainframe sources provide a different arena of
integration issues due to their customized
hardware architecture
–Most of the legacy code on mainframes is in
COBOL
–EBCDIC character sets need to be converted to
ASCII as required
–Data transfer across nodes and platforms over
network
10. Tracking the data changes
•Detect the changes happening on the source
•Pull versus push approach for tracking the
changes
•Sniffing the intermediate logs