Recipe 5 of Data Warehouse and Business Intelligence - The null values management in the etl process
1. Recipes of Data Warehouse and
Business Intelligence
The NULL values management in the ETL process
2. The NULL management
•
In the Data Warehouse community , the presence or absence of NULL values has always been the subject of
conflicting opinions.
•
It is of great interest to see how seemingly insignificant details, may instead affect the loading and/or the result
of the extracted information, manually, or with the Business Intelligence tools .
•
Topics such as NULL management, have the particular ability to take the form of technical details, stuff for
programmers. We think we can neglect it because of the presence of many other complexities involved in the
development of a Data Warehouse project .
•
Unfortunately, in a Data Warehouse, there is nothing , absolutely nothing that can be overlooked. Each of its
components, is linked to each other and always has consequences on the final result.
•
This means being aware of the problems that may arise in the future to address them now, before it's too late .
Do not forget that in the Data Warehouse the "go back" because of a wrong choice, or even worse, ignored, it
can be very painful.
•
The management of NULL , to put it in technical language , or the management of the absence of
information, to put it in a logical language, it is just one of these topics .
3. The meaning of NULL
•
In a relational database , then in the majority of the databases that are the basis of the Data Warehouse &
Business Intelligence solutions , a NULL value in a field of a table, means the lack of information, so it is not a
value, but the absence of value .
•
This does not mean that it is a mistake, although it is possible that it is the result of a problem in the system
that provides the data feeding . Often it is not really possible to associate a value.
•
Suppose we consider a loan agreement . Among its various information , there is the closing day of the
contract. It ' obvious that this field remains NULL , as it is an information that you can only see in the future , at
the time of closing . For the moment it will be NULL .
•
Even in the domain of numerical values , the presence of NULL has a precise meaning , which is different from
the value 0 (zero). Think of a list of values that a customer pays as commissions to a bank . A value of 0 means
that the customer , perhaps because it is connected to a special agreement , pay a 0 value on a given
committee, but that committee is part of the contract. The value NULL may mean that the commission is not
covered because the customer has not that contract.
•
So, the presence of a NULL value , can have many meanings.
4. The NULL problem
•
Beyond the intrinsic meaning of NULL values, what are the consequences to the Data Warehouse? The
problems occur at the data extraction time. Let's see two examples.
•
Example 1
Suppose you have a list of contracts with its own expiration date. For simplicity, we simulate 3 contracts
using the SQL clause “WITH" on the fly to simulate a table with three rows.
The first row represents a contract that expired two days ago, the second line is a contract that you already
know that will expire in 5 days, the third row represents a contract has not expired (NULL). The request (or
report) is to extract all contracts that do not expire in the next 10 days.
The SQL is conceptually very simple: just select all contracts whose expiration date is greater than today +10.
It should be only one. Unfortunately the NULL will produce an incorrect result: 0 rows
SQL>
2
3
4
5
6
7
8
with tab as (
Select 'C1' contr, sysdate-2 data_scad from dual
union all
Select 'C1' contr, sysdate+5 data_scad from dual
union all
Select 'C1' contr, null data_scad from dual)
select * from tab
where data_scad > sysdate+10;
no rows selected
5. The NULL problem
Example 2
•
Suppose you have a table that contains in a line, the customer and the amount of commissions of all possible
contracts subscribed. Among them, the third column is a commission for that customer does not make
sense, so it has a NULL value. The request is to have the total amount of fees paid by the customer.
•
Even in this case, the SQL solution is very simple: just do the sum of all the commission fields .
Unfortunately, as in the previous example, the presence of NULL values will produce an incorrect result
because it nullifies the sum.
•
These two examples, very simple, show the pitfalls inherent in the presence of NULL values in the Data
Warehouse. Of course you can force, within the current SQL, the default values that manage the NULL, but this
should always be done at the risk to forget. I suggest the following rule.
SQL> with tab as (Select 'C1' cliente, 10 com1, 40 com2,null com3,18 com4 from dual )
1 select cliente,com1+com2+com3+com4 tot
2 from tab;
CL
TOT
-- ---------C1
Rule 1
Do not allow into the Data Warehouse the lack of information. Each field must have a default value that goes
to replace the NULL value. This must be done immediately, in the Staging Area, which will be the basis for the
next loading. You must not have NULL values.
6. The default values
•
As a consequence of the previous rule , we must decide which default values must be used to replace NULL
values. In order to make this decision, it is necessary to suggest a new rule:
Rule 2
Simplify the data types to use in the Data Warehouse. Use, if possible, only 2 types: text values (for Oracle
VARCHAR2 ) and numerical values ( for Oracle NUMBER ) . The “day” fields must all be expressed as the
concatenation of year, month and day, ie numerical format YYYYMMDD .
•
•
•
•
•
Obviously , if you have values of CLOB or BLOB type, use these types as well, we donot associate default values.
The use of the DATE format , but only for technical fields, may be allowed.
For textual values would try to occupy less space as possible, so not ' Undefined ', but something simple.
Personally , I use ' ? ' .
With regard to the numerical values, the default value can be Zero. While doing so , we lose the meaning of the
absence of information, however, does not produce wrong results (do not forget that in mathematics, 0 is the
neutral element of addition and subtraction ) . If the numerical value representing a day, then the default
should not be zero, but , basically the 99991231 , which is the maximum day. Using these two default , the two
examples we have seen previously would produce a correct result.
For technical fields of DATE type , it may be helpful to set the system date .
7. The exceptions
•
There are no rules without exceptions. The exceptions are those cases , indeed quite limited , in which the use
of the default value should be avoided because the business logic of the field. Let's take two examples :
•
Sometimes a field that define a day is not valued in the feeding system. It means that it is a day that start from
the beginning of time. In these cases, the default value should not be the maximum day, but the minimum
possible day, as, for example, 1-jan-1111 (11110101).
•
In the customer data table, the full name of the company can be very long. It is often broken into multiple
fields because the limited length of the fields of the feeding system . It means that to get the full name we
must concatenate multiple fields. In this case it would be wrong to use the default value, for example
‘?’, because the concatenation would produce a name full of '? '.
We can then state a new rule.
•
Rule 3
The choice of the global default values and the values of the exceptions (and the maintenance of the NULL
value is one of the options) must be decided on the basis of business requirement. It will be the analysis
phase to determine this choice .
8. The recipe
•
We will create a Staging Area table. This table will have, for each field, the definition of the default value, which
will be set according to the general rule and will take account of the exceptions.
•
The SQL statement that will replace the NULL values with the default value, will act as post-processing. I call it
the enrichment phase of Staging Area
•
This implementation will use a configuration table that make easy the creation of dynamic SQL statements that
could be used for all Staging Area tables. This will provide maximum scalability to the solution.
•
To do this, we need a naming convention. I have written several times about the importance of naming
convention inside a Data Warehouse project. In this implementation, we have the following conventions:
EDW = project code
COM_MEF = Common Area (COM), subarea Micro ETL Foundation (MEF)
CUST = data source code
STA_SS1 = Staging Area (STA), subarea Source System 1 (SS1)
9. Global Configuration of the default values
•
Let's start by creating a configuration table for the entire
Data Warehouse.
•
In it we will set the default values for the data types used.
•
In the SQL statement at your right, we will create the
table and initialize it with the default values that we
decided:
– a question mark for the text values
– zero for numeric values
– 99991231 for DATE fields in numeric format
– the system date for the date type of the technical
fields.
If we see the contents of the table, we get:
SQL> CREATE TABLE EDW_COM_MEF_CFT (
2 DEF_V VARCHAR2(30)
3 ,DEF_N NUMBER
4 ,DEF_YMD NUMBER
5 ,DEF_D VARCHAR2(30)
6 );
Table created.
SQL> INSERT INTO EDW_COM_MEF_CFT
2 VALUES (''''||'?'||'''',0,99991231,'SYSDATE');
1 row created.
10. Data Source Configuration
•
At this point we create the configuration table of
the data source file, with the following structure.
–
–
–
–
•
the unique code of the data source
the name of the table that configures the
fields of the data source
the name of the object with the data to be
loaded into the staging table
the name of the staging table.
This configuration table is very important because it
will allow us to generalize the loading process using
dynamic SQL statements.
SQL> CREATE TABLE EDW_COM_MEF_IO_CFT (
2 IO_COD VARCHAR2(10)
3 ,CXT_COD VARCHAR2(30)
4 ,FXV_COD VARCHAR2(30)
5 ,STT_COD VARCHAR2(30)
6 );
Table created.
SQL>
SQL> INSERT INTO EDW_COM_MEF_IO_CFT
2 VALUES ('CUST'
3 ,'EDW_STA_SS1_CUST_CXT'
4 ,'EDW_STA_SS1_CUST_FXV'
5 ,'EDW_STA_SS1_CUST_STT'
6 );
1 row created.
11. Creating and configuring the detail table of the data source
•
After configuring the data source, you must
configure its columns, (which will be the
same of the Staging table), their
type, and, what we need, the default value if
you want to make an exception to the global
default value for that data type.
SQL> CREATE TABLE EDW_STA_SS1_CUST_CXT (
2 COLUMN_COD VARCHAR2(30)
3 ,DATA_TYPE VARCHAR2(30)
4 ,DEF_TXT VARCHAR2(30)
5 );
Table created.
SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('KEY_ID','NUMBER',NULL);
•
With this configuration, we want to leave the
global default values for fields
KEY_ID, F1_COD, F2_NUM F4_DAT.
1 row created.
SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F1_COD','VARCHAR2',NULL);
1 row created.
•
We want to force a different default value for
specific fields F3_YMD and F5_COD.
SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F2_NUM','NUMBER',NULL);
1 row created.
SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F3_YMD','NUMBER',11110101);
1 row created.
SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F4_DAT','DATE',NULL);
1 row created.
SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F5_COD','VARCHAR2','NULL');
1 row created.
12. Simulation of source data
•
•
•
•
We simulate a data source with two lines, one with
all NULL values and one with real values.
The source data, in a real case, it could be a regular
table, an external table to the physical source or
otherwise.
We will use the WITH clause to create a view that
simulates the two lines. This is done solely for
convenience of exposition.
The content of the table will be:
SQL> CREATE OR REPLACE VIEW EDW_STA_SS1_CUST_FXV AS
2 SELECT
3 CAST(1 AS NUMBER) KEY_ID
4 ,CAST(NULL AS VARCHAR2(30)) F1_COD
5 ,CAST(NULL AS NUMBER) F2_NUM
6 ,CAST(NULL AS NUMBER) F3_YMD
7 ,CAST(NULL AS DATE) F4_DAT
8 ,CAST(NULL AS VARCHAR2(30)) F5_COD
9 FROM DUAL
10 UNION ALL
11 SELECT 2 KEY_ID
12 ,'CODE1' F1_COD
13 ,250 F2_NUM
14 ,20140207 F3_YMD
15 ,sysdate-10 F4_DAT
16 ,'CODE2'
17 FROM DUAL;
View created
13. Creating the Staging Area Table
•
We create the Staging Area table that will be
loaded from the data source showed in the
previous slide.
SQL> CREATE TABLE EDW_STA_SS1_CUST_STT (
2 KEY_ID NUMBER
3 ,F1_COD VARCHAR2(30)
4 ,F2_NUM NUMBER
5 ,F3_YMD NUMBER
6 ,F4_DAT DATE
7 ,F5_COD VARCHAR2(30)
8 );
Table created.
14. Setting the default values for the Staging Area table
•
•
•
Using the above settings, we can create a
dynamic procedure, which receiving the input
source code is able to set the default value to
the Staging Area table
It will associate the general approach when
there is no exception present in the
configuration table.
The names of the columns involved are
extracted directly from Oracle's data
dictionary (COLS, ie USER_TAB_COLUMNS)
After you create the procedure, we can run it.
SQL> exec p_default ('CUST');
•
We can verify the outcome of the procedure
by seeing the table structure present in the
data dictionary. It will show the default
values setted in the field
USER_TAB_COLUMNS.DATA_DEFAULT
create or replace procedure p_default(p_io varchar2) as
v_sql varchar2(4000);
v_io edw_com_mef_io_cft%rowtype;
v_cft edw_com_mef_cft%rowtype;
v_def varchar2(60);
type t_rc is ref cursor;
v_cur t_rc;
v_column_name varchar2(30);
v_data_type varchar2(30);
v_def_txt varchar2(30);
begin
select * into v_cft from edw_com_mef_cft;
select * into v_io from edw_com_mef_io_cft where io_cod = p_io;
v_sql := 'select a.column_name,a.data_type,b.def_txt'||
' from cols a'||' left outer join '||v_io.cxt_cod||' b'||
' on (a.column_name = b.column_cod)'||
' where a.table_name = '||''''||v_io.stt_cod||'''';
open v_cur for v_sql;
loop
fetch v_cur into v_column_name,v_data_type,v_def_txt;
exit when v_cur%notfound;
if (v_data_type = 'NUMBER') then
if (v_column_name like '%_YMD') then v_def := vl(v_def_txt,v_cft.def_ymd);
else v_def := nvl(v_def_txt,v_cft.def_n);
end if;
elsif (v_data_type = 'DATE') then v_def := nvl(v_def_txt,v_cft.def_d);
else v_def := nvl(v_def_txt,v_cft.def_v);
end if;
v_sql := 'ALTER TABLE '||v_io.stt_cod||
' MODIFY('||v_column_name||' DEFAULT '||v_def||')';
execute immediate v_sql;
end loop;
close v_cur;
end;
/
15. Loading Staging Area table
•
•
•
•
In order to load the data, we can use the following
procedure, dynamic and usable for any Staging Area
table.
After you create the procedure, we can run it.
SQL> exec p_ins_stt ('CUST');
We can verify the outcome of the procedure by
removing rows from the Staging table .
I wish to emphasize the fact that the load should not
change the data source. Obviously the forcing of the
default data could be performed at the time of
loading of the Staging table . The reason why it is
convenient to do as post-processing, is related to
the presence of consistency checks that we want to
implement on the input data. In order to perform
these checks, the data must not be modified or
changed, must be the same. Only after the positive
outcome of the checks we can enrich the data with
the default values.
create or replace procedure p_ins_stt(p_io varchar2) as
v_io edw_com_mef_io_cft%rowtype;
v_sql varchar2(32000);
v_list varchar2(4000);
begin
select * into v_io
from edw_com_mef_io_cft
where io_cod = p_io;
v_sql :=
'select listagg(f.column_name,'||''''||','||''''||') '||
'within group (order by f.column_id) '||
'from cols f '||
'inner join cols t on ( f.column_name = t.column_name '||
'and t.table_name = upper('||''''||v_io.stt_cod||''''||')) '||
'where f.table_name = upper('||''''||v_io.fxv_cod||''''||')';
execute immediate v_sql into v_list;
v_sql := 'insert into '||v_io.stt_cod||'('||v_list||')'||
' select distinct '||v_list||' from '||v_io.fxv_cod;
execute immediate v_sql;
commit;
end;
/
16. Creating the Function to extract the default values
•
•
The creation of this function is useful for getting in a
readable format, the default value from the Oracle data
dictionary as it is of type LONG.
This function will use it in the next procedure.
create or replace function f_dd(
p_tab varchar2
, p_col varchar2
) return varchar2 as
v_out varchar2(4000);
begin
select data_default
into v_out
from cols
where table_name = p_tab
and column_name = p_col;
return nvl(v_out,'null');
end;
/
sho errors
17. Updating the Staging Area table
•
With the help of the previous function, we can
now create a procedure that will change all
NULL values according to the default value.
•
We can launch it with:
> p_upd_stt exec ('CUST');
Now verify the result.
create or replace procedure p_upd_stt(p_io varchar2) as
v_sql clob;
v_io edw_com_mef_io_cft%rowtype;
begin
select * into v_io
from edw_com_mef_io_cft
where io_cod = p_io;
for r in (select ','||column_name||' = '||
'nvl('||column_name||
','||f_dd(table_name,column_name)||')' stm
from cols
where table_name = v_io.stt_cod) loop
v_sql := v_sql ||r.stm;
end loop;
v_sql := 'UPDATE '||v_io.stt_cod||' SET '||substr(v_sql,2);
execute immediate v_sql;
commit;
end;
/
18. Flow of the NULL management in a Data Warehouse
4
1
<prj>_ COM_MEF_IO_CFT
<prj>_ STA_<sio>_<io>_FXV
IO_COD
CXT_COD
FXV_COD
STT_COD
<key_id>
<io>
<prj>_STA_<sio>_<io>_CXT
<prj>_ STA_<sio>_<io>_FXV
<prj>_ STA_<sio>_<io>_STT
1
2
<f1_cod>
<f2>_num
<f3_ymd>
<f4_dat>
<f5_cod>
CODE1
250
20140207
27/01/2014 9.34.35
CODE2
2
p_ins_stt (<io>)
6
5
<prj>_COM_MEF_CFT
<prj>_ STA_<sio>_<io>_STT
DEF_V
DEF_N
DEF_YMD
DEF_D
‘?’
0
99991231
SYSDATE
p_default (<io>)
<key_id>
<f1_cod>
<f2>_num
<f3_ymd>
<f4_dat>
<f5_cod>
CODE1
250
20140207
27/01/2014 9.34.35
CODE2
1
2
3
p_upd_stt (<io>)
<prj>_STA_<sio>_<io>_CXT
column_cod
DATA_TYPE
<key_id>
NUMBER
<f1_cod>
NUMBER
<f4_dat>
DATE
<f5_cod>
1.
2.
3.
4.
5.
6.
7.
NUMBER
<f3_ymd>
VARCHAR2
<prj>_ STA_<sio>_<io>_STT
VARCHAR2
<f2>_num
DEF_TXT
7
Data Dictionary
(cols)
<key_id>
11110101
NULL
<prj> = Project code
<sio> = Sorce Subsystem
<io> = source cod
Configure the data source
Configure the global default values
Configure the exception values for every field of data source
Load (simulate) the data source
Set the default values for the Staging Area table
Load the Staging Area table
Update the Staging Area table with the default values
<f1_cod>
<f2>_num
<f3_ymd>
<f4_dat>
1
?
0
11110101
23/02/2014 11.21.30
2
CODE1
250
20140207
27/01/2014 9.34.35
<f5_cod>
CODE2