Data Warehouse and Business Intelligence - Recipe 3
1. Recipes of Data Warehouse and
Business Intelligence
How to check the Staging Area Loading
2. The Micro ETL Foundation
•
•
•
•
•
The Micro ETL Foundation is a set of ideas and solutions for Data Warehouse and
Business Intelligence Projects in Oracle environment.
It doesn’t use expensive ETL tools, but only your intelligence and ability to
think, configure, build and load data using the features and the programming
language of your RDBMS.
This recipes is another easy example based on the slides of Recipes 1 and 2 of Data
Warehouse and Business Intelligence.
Copying the content of the following slides with your editor and SQL Interface
utility, you can reproduce this example.
The solution presented here is the check of Staging area loading
3. The load of data file
•
•
•
•
Configure and load the source data file according to the slides of «Recipes 2 of
Data Warehouse and Business Intelligence».
Copy the SQL statement in a file. Run the script and you will load a Staging Table
with a «click»
Now we will see how to verify the load process.
The data source file is the following
EMPLOYEE_ID FIRST_NAME
117 Sigal
118 Guy
119 Karen
120 Matthew
121 Adam
122 Payam
123 Shanta
124 Kevin
125 Julia
126 Irene
LAST_NAME
Tobias
Himuro
Colmenares
Weiss
Fripp
Kaufling
Vollman
Mourgos
Nayer
Mikkilineni
EMAIL
PHONE_NUMBER HIRE_DATE JOB_ID
SALARY COMMISSION_PCT MANAGER_ID DEPARTMENT_ID
STOBIAS
5.151.274.564 24/07/2005 PU_CLERK
2800
114
30
GHIMURO
5.151.274.565 15/11/2006 PU_CLERK
2600
114
30
KCOLMENA
5.151.274.566 10/08/2007 PU_CLERK
2500
114
30
MWEISS
6.501.231.234 18/07/2004 ST_MAN
8000
100
50
AFRIPP
6.501.232.234 10/04/2005 ST_MAN
8200
100
50
PKAUFLIN
6.501.233.234 01/05/2003 ST_MAN
7900
100
50
SVOLLMAN
6.501.234.234 10/10/2005 ST_MAN
6500
100
50
KMOURGOS
6.501.235.234 16/11/2007 ST_MAN
5800
100
50
JNAYER
6.501.241.214 16/07/2005 ST_CLERK
3200
120
50
IMIKKILI
6.501.241.224 28/09/2006 ST_CLERK
2700
120
50
4. The load process
•
The objects involved in the process are showned in the next figure.
1
File Sytem
Row External
Table (RXT)
Source External
View (FXV)
Load
Source
Data
File
2
Configuration
External Table
(CXT)
File Definition
Table (CFT)
Source External
Table (FXT)
Row
File
3
Configuration
External View
(CXV)
4
Staging Table
(STT)
5
5. What to check
•
At the end of the loading, we need to control that it is gone all ok. We need to
ensure that the rows number in the Staging table is correct. To have this safety, we
must show that:
1.
2.
3.
4.
5.
The rows number declared in the .row file
The rows number in the source data file
The rows number in the external table that refers to the data file
The rows number of the view builded on the external table
The rows number of the staging table
Are all exactly the same.
• Now see what we need.
6. The detail check table
•
•
•
•
•
•
•
•
Build a check table to contain the result
of the checks
IO_COD is the same of the configuration
table created in «Recipes2».
SEQ_NUM is a global sequential number
got from an Oracle sequence.
SOURCE_COD is the name of the data file
SORT_NUM is a sort number inside the
io_cod
CHECK_DET is a description of the check
N1_VAL is the rows counter
STAMP_DTS is the sysdate
DROP TABLE STA_CHK_LOT;
CREATE TABLE STA_CHK_LOT
(
IO_COD VARCHAR2(12) NOT NULL,
SEQ_NUM NUMBER NOT NULL,
SOURCE_COD VARCHAR2(24) NOT NULL,
SORT_NUM NUMBER NOT NULL,
CHECK_DET VARCHAR2(600) NOT NULL,
N1_VAL NUMBER NOT NULL,
STAMP_DTS DATE NOT NULL
);
DROP SEQUENCE STA_CHK_SEQ;
CREATE SEQUENCE STA_CHK_SEQ
START WITH 1 INCREMENT BY 1;
7. The summary check table
•
•
•
•
•
Build a summary check table to contain
in only one row the result of the previous
table.
IO_COD is the same of the configuration
table created in «Recipes2».
*_CNT is the rows number got from the 5
checks showed in the slide 4.
RET_COD will be the final result (OK or
NOT OK)
STAMP_DTS is the sysdate
DROP TABLE STA_IO_LOT;
CREATE TABLE STA_IO_LOT
(
IO_COD
VARCHAR2(12) NOT NULL,
SOURCE_COD VARCHAR2(80) NOT NULL,
DEC_CNT NUMBER,
FIL_CNT NUMBER,
FXT_CNT NUMBER,
FXV_CNT NUMBER,
STT_CNT NUMBER,
RET_COD varchar2(30),
STAMP_DTS DATE
);
8. The count rows function
•
•
•
•
At this point I need to write some
pl/sql code. You can write it also in
java or other programming language.
This function count the number of
lines in the source data file.
It has 2 parameters: the folder
(Oracle directory) and the file name.
It is all. Now we can load the two
check tables.
CREATE OR REPLACE FUNCTION F_COUNT_FILE_ROWS(
P_DIR VARCHAR2
,P_FILE_NAME VARCHAR2
) RETURN NUMBER IS
V_F UTL_FILE.FILE_TYPE;
V_COUNT NUMBER;
V_LINE VARCHAR2(2000);
BEGIN
V_COUNT := 0;
V_F := UTL_FILE.FOPEN(P_DIR, P_FILE_NAME, 'R');
LOOP
UTL_FILE.GET_LINE(V_F, V_LINE);
V_COUNT := V_COUNT+1;
END LOOP;
UTL_FILE.FCLOSE(V_F);
EXCEPTION
WHEN NO_DATA_FOUND THEN
UTL_FILE.FCLOSE(V_F);
RETURN V_COUNT;
END;
/
9. The declared rows
•
•
•
Insert this number with the following SQL statement.
It use the Oracle dictionary to find the file name.
It use the source external view to calculate the number
INSERT INTO STA_CHK_LOT (
IO_COD,SEQ_NUM,SOURCE_COD,SORT_NUM,CHECK_DET,N1_VAL,STAMP_DTS)
VALUES ('employees1'
,STA_CHK_SEQ.NEXTVAL
,(SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS
WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT')
,1
,'DECLARED'
,(SELECT NVL(MAX(ROWS_NUM),0) FROM STA_EMPLOYEES1_FXV)
,SYSDATE
);
10. The file rows
•
•
•
Insert this number with the following SQL statement.
It use the Oracle dictionary to find the file name.
It use the function to calculate the number
INSERT INTO STA_CHK_LOT (
IO_COD,SEQ_NUM,SOURCE_COD,SORT_NUM,CHECK_DET,N1_VAL,STAMP_DTS)
VALUES ('employees1'
,STA_CHK_SEQ.NEXTVAL
,(SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS
WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT')
,2
,'FILE'
,NVL(F_COUNT_FILE_ROWS('STA_BCK', (SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS
WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT')),0)
,SYSDATE
);
11. The external table rows
•
•
•
Insert this number with the following SQL statement.
It use the Oracle dictionary to find the file name.
It use the external table to calculate the number
INSERT INTO STA_CHK_LOT (
IO_COD,SEQ_NUM,SOURCE_COD,SORT_NUM,CHECK_DET,N1_VAL,STAMP_DTS)
VALUES ('employees1'
,STA_CHK_SEQ.NEXTVAL
,(SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS
WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT')
,3
,'EXTERNAL TABLE (STA_EMPLOYEES1_FXT)'
,(SELECT NVL(COUNT(*),0)
FROM STA_EMPLOYEES1_FXT)
,SYSDATE
);
12. The external view rows
•
•
•
Insert this number with the following SQL statement.
It use the Oracle dictionary to find the file name.
It use the external view and the configuration table to calculate the number
INSERT INTO STA_CHK_LOT (
IO_COD,SEQ_NUM,SOURCE_COD,SORT_NUM,CHECK_DET,N1_VAL,STAMP_DTS)
VALUES ('employees1'
,STA_CHK_SEQ.NEXTVAL
,(SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS
WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT')
,4
,'EXTERNAL VIEW (STA_EMPLOYEES1_FXV)'
,(SELECT NVL(COUNT(*),0) FROM STA_EMPLOYEES1_FXV)+(SELECT HEAD_CNT+FOO_CNT FROM STA_IO_CFT WHERE IO_COD
= 'employees1')
,SYSDATE
);
13. The staging table rows
•
•
•
Insert this number with the following SQL statement.
It use the Oracle dictionary to find the file name.
It use the staging table and the configuration table to calculate the number
INSERT INTO STA_CHK_LOT (
IO_COD,SEQ_NUM,SOURCE_COD,SORT_NUM,CHECK_DET,N1_VAL,STAMP_DTS)
VALUES ('employees1'
,STA_CHK_SEQ.NEXTVAL
,(SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS
WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT')
,5
,'STAGING TABLE (STA_EMPLOYEES1_STT)'
,(SELECT NVL(COUNT(*),0) FROM STA_EMPLOYEES1_STT)+(SELECT HEAD_CNT+FOO_CNT FROM STA_IO_CFT WHERE IO_COD
= 'employees1')
,SYSDATE
);
14. The summary check
•
•
•
Insert the summary check with the following SQL statement.
It use the detail table.
It use an Oracle 11g analytics function (but you can use something else)
INSERT INTO STA_IO_LOT (
IO_COD, SOURCE_COD, DEC_CNT, FIL_CNT, FXT_CNT,
FXV_CNT, STT_CNT,RET_COD, STAMP_DTS)
SELECT IO_COD, SOURCE_COD, DEC_CNT,FIL_CNT, FXT_CNT, FXV_CNT, STT_CNT
,(CASE WHEN (DEC_CNT=FIL_CNT
AND FIL_CNT=FXT_CNT
AND FXT_CNT=FXV_CNT
AND FXV_CNT=STT_CNT) THEN 'OK' ELSE 'NOT OK' END)
,SYSDATE
FROM (SELECT IO_COD,SOURCE_COD,SORT_NUM,N1_VAL
FROM STA_CHK_LOT
WHERE SOURCE_COD = (SELECT SUBSTR(LOCATION,1,80)
FROM USER_EXTERNAL_LOCATIONS WHERE TABLE_NAME = 'STA_EMPLOYEES1_FXT'))
PIVOT (
SUM(N1_VAL)
FOR SORT_NUM IN (
1 AS DEC_CNT,
2 AS FIL_CNT,
3 AS FXT_CNT,
4 AS FXV_CNT,
5 AS STT_CNT)
);
COMMIT;
15. Conclusion
We are at the end of this recipe. The final result of the two check tables are:
With only two log tables, a function and some SQL statement we have reached the
control of a Staging Area table loading, without ETL tools.
This is the philosophy of Micro ETL Foundation.
Email - massimo_cenci@yahoo.it
Blog (italian/english) - http://massimocenci.blogspot.it/