Recipe 5 of Data Warehouse and Business Intelligence - The null values management in the etl process


Published on

Recipes of Data Warehouse and Business Intelligence
The NULL values management in the ETL process

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Recipe 5 of Data Warehouse and Business Intelligence - The null values management in the etl process

  1. 1. Recipes of Data Warehouse and Business Intelligence The NULL values management in the ETL process
  2. 2. The NULL management • In the Data Warehouse community , the presence or absence of NULL values ​has always been the subject of conflicting opinions. • It is of great interest to see how seemingly insignificant details, may instead affect the loading and/or the result of the extracted information, manually, or with the Business Intelligence tools . • Topics such as NULL management, have the particular ability to take the form of technical details, stuff for programmers. We think we can neglect it because of the presence of many other complexities involved in the development of a Data Warehouse project . • Unfortunately, in a Data Warehouse, there is nothing , absolutely nothing that can be overlooked. Each of its components, is linked to each other and always has consequences on the final result. • This means being aware of the problems that may arise in the future to address them now, before it's too late . Do not forget that in the Data Warehouse the "go back" because of a wrong choice, or even worse, ignored, it can be very painful. • The management of NULL , to put it in technical language , or the management of the absence of information, to put it in a logical language, it is just one of these topics .
  3. 3. The meaning of NULL • In a relational database , then in the majority of the databases that are the basis of the Data Warehouse & Business Intelligence solutions , a NULL value in a field of a table, means the lack of information, so it is not a value, but the absence of value . • This does not mean that it is a mistake, although it is possible that it is the result of a problem in the system that provides the data feeding . Often it is not really possible to associate a value. • Suppose we consider a loan agreement . Among its various information , there is the closing day of the contract. It ' obvious that this field remains NULL , as it is an information that you can only see in the future , at the time of closing . For the moment it will be NULL . • Even in the domain of numerical values ​, the presence of NULL has a precise meaning , which is different from the value 0 (zero). Think of a list of values ​that a customer pays as commissions to a bank . A value of 0 means that the customer , perhaps because it is connected to a special agreement , pay a 0 value on a given committee, but that committee is part of the contract. The value NULL may mean that the commission is not covered because the customer has not that contract. • So, the presence of a NULL value , can have many meanings.
  4. 4. The NULL problem • Beyond the intrinsic meaning of NULL values​​, what are the consequences to the Data Warehouse? The problems occur at the data extraction time. Let's see two examples. • Example 1 Suppose you have a list of contracts with its own expiration date. For simplicity, we simulate 3 contracts using the SQL clause “WITH" on the fly to simulate a table with three rows. The first row represents a contract that expired two days ago, the second line is a contract that you already know that will expire in 5 days, the third row represents a contract has not expired (NULL). The request (or report) is to extract all contracts that do not expire in the next 10 days. The SQL is conceptually very simple: just select all contracts whose expiration date is greater than today +10. It should be only one. Unfortunately the NULL will produce an incorrect result: 0 rows SQL> 2 3 4 5 6 7 8 with tab as ( Select 'C1' contr, sysdate-2 data_scad from dual union all Select 'C1' contr, sysdate+5 data_scad from dual union all Select 'C1' contr, null data_scad from dual) select * from tab where data_scad > sysdate+10; no rows selected
  5. 5. The NULL problem Example 2 • Suppose you have a table that contains in a line, the customer and the amount of commissions of all possible contracts subscribed. Among them, the third column is a commission for that customer does not make sense, so it has a NULL value. The request is to have the total amount of fees paid by the customer. • Even in this case, the SQL solution is very simple: just do the sum of all the commission fields . Unfortunately, as in the previous example, the presence of NULL values will produce an incorrect result because it nullifies the sum. • These two examples, very simple, show the pitfalls inherent in the presence of NULL values in the Data Warehouse. Of course you can force, within the current SQL, the default values ​that manage the NULL, but this should always be done at the risk to forget. I suggest the following rule. SQL> with tab as (Select 'C1' cliente, 10 com1, 40 com2,null com3,18 com4 from dual ) 1 select cliente,com1+com2+com3+com4 tot 2 from tab; CL TOT -- ---------C1 Rule 1 Do not allow into the Data Warehouse the lack of information. Each field must have a default value that goes to replace the NULL value. This must be done immediately, in the Staging Area, which will be the basis for the next loading. You must not have NULL values.
  6. 6. The default values • As a consequence of the previous rule , we must decide which default values ​must be used to replace NULL values. In order to make this decision, it is necessary to suggest a new rule: Rule 2 Simplify the data types to use in the Data Warehouse. Use, if possible, only 2 types: text values ​(for Oracle VARCHAR2 ) and numerical values ​( for Oracle NUMBER ) . The “day” fields must all be expressed as the concatenation of year, month and day, ie numerical format YYYYMMDD . • • • • • Obviously , if you have values of CLOB or BLOB type, use these types as well, we donot associate default values​​. The use of the DATE format , but only for technical fields, may be allowed. For textual values would try to occupy less space as possible, so not ' Undefined ', but something simple. Personally , I use ' ? ' . With regard to the numerical values, the default value can be Zero. While doing so , we lose the meaning of the absence of information, however, does not produce wrong results (do not forget that in mathematics, 0 is the neutral element of addition and subtraction ) . If the numerical value representing a day, then the default should not be zero, but , basically the 99991231 , which is the maximum day. Using these two default , the two examples we have seen previously would produce a correct result. For technical fields of DATE type , it may be helpful to set the system date .
  7. 7. The exceptions • There are no rules without exceptions. The exceptions are those cases , indeed quite limited , in which the use of the default value should be avoided because the business logic of the field. Let's take two examples : • Sometimes a field that define a day is not valued in the feeding system. It means that it is a day that start from the beginning of time. In these cases, the default value should not be the maximum day, but the minimum possible day, as, for example, 1-jan-1111 (11110101). • In the customer data table, the full name of the company can be very long. It is often broken into multiple fields because the limited length of the fields of the feeding system . It means that to get the full name we must concatenate multiple fields. In this case it would be wrong to use the default value, for example ‘?’, because the concatenation would produce a name full of '? '. We can then state a new rule. • Rule 3 The choice of the global default values ​and the values ​of the exceptions (and the maintenance of the NULL value is one of the options) must be decided on the basis of business requirement. It will be the analysis phase to determine this choice .
  8. 8. The recipe • We will create a Staging Area table. This table will have, for each field, the definition of the default value, which will be set according to the general rule and will take account of the exceptions. • The SQL statement that will replace the NULL values with the default value, will act as post-processing. I call it the enrichment phase of Staging Area • This implementation will use a configuration table that make easy the creation of dynamic SQL statements that could be used for all Staging Area tables. This will provide maximum scalability to the solution. • To do this, we need a naming convention. I have written several times about the importance of naming convention inside a Data Warehouse project. In this implementation, we have the following conventions:     EDW = project code COM_MEF = Common Area (COM), subarea Micro ETL Foundation (MEF) CUST = data source code STA_SS1 = Staging Area (STA), subarea Source System 1 (SS1)
  9. 9. Global Configuration of the default values • Let's start by creating a configuration table for the entire Data Warehouse. • In it we will set the default values for the data types used. • In the SQL statement at your right, we will create the table and initialize it with the default values ​that we decided: – a question mark for the text values​​ – zero for numeric values – 99991231 for DATE fields in numeric format – the system date for the date type of the technical fields. If we see the contents of the table, we get: SQL> CREATE TABLE EDW_COM_MEF_CFT ( 2 DEF_V VARCHAR2(30) 3 ,DEF_N NUMBER 4 ,DEF_YMD NUMBER 5 ,DEF_D VARCHAR2(30) 6 ); Table created. SQL> INSERT INTO EDW_COM_MEF_CFT 2 VALUES (''''||'?'||'''',0,99991231,'SYSDATE'); 1 row created.
  10. 10. Data Source Configuration • At this point we create the configuration table of the data source file, with the following structure. – – – – • the unique code of the data source the name of the table that configures the fields of the data source the name of the object with the data to be loaded into the staging table the name of the staging table. This configuration table is very important because it will allow us to generalize the loading process using dynamic SQL statements. SQL> CREATE TABLE EDW_COM_MEF_IO_CFT ( 2 IO_COD VARCHAR2(10) 3 ,CXT_COD VARCHAR2(30) 4 ,FXV_COD VARCHAR2(30) 5 ,STT_COD VARCHAR2(30) 6 ); Table created. SQL> SQL> INSERT INTO EDW_COM_MEF_IO_CFT 2 VALUES ('CUST' 3 ,'EDW_STA_SS1_CUST_CXT' 4 ,'EDW_STA_SS1_CUST_FXV' 5 ,'EDW_STA_SS1_CUST_STT' 6 ); 1 row created.
  11. 11. Creating and configuring the detail table of the data source • After configuring the data source, you must configure its columns, (which will be the same of the Staging table), their type, and, what we need, the default value if you want to make an exception to the global default value for that data type. SQL> CREATE TABLE EDW_STA_SS1_CUST_CXT ( 2 COLUMN_COD VARCHAR2(30) 3 ,DATA_TYPE VARCHAR2(30) 4 ,DEF_TXT VARCHAR2(30) 5 ); Table created. SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('KEY_ID','NUMBER',NULL); • With this configuration, we want to leave the global default values ​for fields KEY_ID, F1_COD, F2_NUM F4_DAT. 1 row created. SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F1_COD','VARCHAR2',NULL); 1 row created. • We want to force a different default value for specific fields F3_YMD and F5_COD. SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F2_NUM','NUMBER',NULL); 1 row created. SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F3_YMD','NUMBER',11110101); 1 row created. SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F4_DAT','DATE',NULL); 1 row created. SQL> INSERT INTO EDW_STA_SS1_CUST_CXT VALUES ('F5_COD','VARCHAR2','NULL'); 1 row created.
  12. 12. Simulation of source data • • • • We simulate a data source with two lines, one with all NULL values and one with real values. The source data, in a real case, it could be a regular table, an external table to the physical source or otherwise. We will use the WITH clause to create a view that simulates the two lines. This is done solely for convenience of exposition. The content of the table will be: SQL> CREATE OR REPLACE VIEW EDW_STA_SS1_CUST_FXV AS 2 SELECT 3 CAST(1 AS NUMBER) KEY_ID 4 ,CAST(NULL AS VARCHAR2(30)) F1_COD 5 ,CAST(NULL AS NUMBER) F2_NUM 6 ,CAST(NULL AS NUMBER) F3_YMD 7 ,CAST(NULL AS DATE) F4_DAT 8 ,CAST(NULL AS VARCHAR2(30)) F5_COD 9 FROM DUAL 10 UNION ALL 11 SELECT 2 KEY_ID 12 ,'CODE1' F1_COD 13 ,250 F2_NUM 14 ,20140207 F3_YMD 15 ,sysdate-10 F4_DAT 16 ,'CODE2' 17 FROM DUAL; View created
  13. 13. Creating the Staging Area Table • We create the Staging Area table that will be loaded from the data source showed in the previous slide. SQL> CREATE TABLE EDW_STA_SS1_CUST_STT ( 2 KEY_ID NUMBER 3 ,F1_COD VARCHAR2(30) 4 ,F2_NUM NUMBER 5 ,F3_YMD NUMBER 6 ,F4_DAT DATE 7 ,F5_COD VARCHAR2(30) 8 ); Table created.
  14. 14. Setting the default values for the Staging Area table • • • Using the above settings, we can create a dynamic procedure, which receiving the input source code is able to set the default value to the Staging Area table It will associate the general approach when there is no exception present in the configuration table. The names of the columns involved are extracted directly from Oracle's data dictionary (COLS, ie USER_TAB_COLUMNS) After you create the procedure, we can run it. SQL> exec p_default ('CUST'); • We can verify the outcome of the procedure by seeing the table structure present in the data dictionary. It will show the default values ​setted in the field USER_TAB_COLUMNS.DATA_DEFAULT create or replace procedure p_default(p_io varchar2) as v_sql varchar2(4000); v_io edw_com_mef_io_cft%rowtype; v_cft edw_com_mef_cft%rowtype; v_def varchar2(60); type t_rc is ref cursor; v_cur t_rc; v_column_name varchar2(30); v_data_type varchar2(30); v_def_txt varchar2(30); begin select * into v_cft from edw_com_mef_cft; select * into v_io from edw_com_mef_io_cft where io_cod = p_io; v_sql := 'select a.column_name,a.data_type,b.def_txt'|| ' from cols a'||' left outer join '||v_io.cxt_cod||' b'|| ' on (a.column_name = b.column_cod)'|| ' where a.table_name = '||''''||v_io.stt_cod||''''; open v_cur for v_sql; loop fetch v_cur into v_column_name,v_data_type,v_def_txt; exit when v_cur%notfound; if (v_data_type = 'NUMBER') then if (v_column_name like '%_YMD') then v_def := vl(v_def_txt,v_cft.def_ymd); else v_def := nvl(v_def_txt,v_cft.def_n); end if; elsif (v_data_type = 'DATE') then v_def := nvl(v_def_txt,v_cft.def_d); else v_def := nvl(v_def_txt,v_cft.def_v); end if; v_sql := 'ALTER TABLE '||v_io.stt_cod|| ' MODIFY('||v_column_name||' DEFAULT '||v_def||')'; execute immediate v_sql; end loop; close v_cur; end; /
  15. 15. Loading Staging Area table • • • • In order to load the data, we can use the following procedure, dynamic and usable for any Staging Area table. After you create the procedure, we can run it. SQL> exec p_ins_stt ('CUST'); We can verify the outcome of the procedure by removing rows from the Staging table . I wish to emphasize the fact that the load should not change the data source. Obviously the forcing of the default data could be performed at the time of loading of the Staging table . The reason why it is convenient to do as post-processing, is related to the presence of consistency checks that we want to implement on the input data. In order to perform these checks, the data must not be modified or changed, must be the same. Only after the positive outcome of the checks we can enrich the data with the default values. create or replace procedure p_ins_stt(p_io varchar2) as v_io edw_com_mef_io_cft%rowtype; v_sql varchar2(32000); v_list varchar2(4000); begin select * into v_io from edw_com_mef_io_cft where io_cod = p_io; v_sql := 'select listagg(f.column_name,'||''''||','||''''||') '|| 'within group (order by f.column_id) '|| 'from cols f '|| 'inner join cols t on ( f.column_name = t.column_name '|| 'and t.table_name = upper('||''''||v_io.stt_cod||''''||')) '|| 'where f.table_name = upper('||''''||v_io.fxv_cod||''''||')'; execute immediate v_sql into v_list; v_sql := 'insert into '||v_io.stt_cod||'('||v_list||')'|| ' select distinct '||v_list||' from '||v_io.fxv_cod; execute immediate v_sql; commit; end; /
  16. 16. Creating the Function to extract the default values • • The creation of this function is useful for getting in a readable format, the default value from the Oracle data dictionary as it is of type LONG. This function will use it in the next procedure. create or replace function f_dd( p_tab varchar2 , p_col varchar2 ) return varchar2 as v_out varchar2(4000); begin select data_default into v_out from cols where table_name = p_tab and column_name = p_col; return nvl(v_out,'null'); end; / sho errors
  17. 17. Updating the Staging Area table • With the help of the previous function, we can now create a procedure that will change all NULL values ​according to the default value. • We can launch it with: > p_upd_stt exec ('CUST'); Now verify the result. create or replace procedure p_upd_stt(p_io varchar2) as v_sql clob; v_io edw_com_mef_io_cft%rowtype; begin select * into v_io from edw_com_mef_io_cft where io_cod = p_io; for r in (select ','||column_name||' = '|| 'nvl('||column_name|| ','||f_dd(table_name,column_name)||')' stm from cols where table_name = v_io.stt_cod) loop v_sql := v_sql ||r.stm; end loop; v_sql := 'UPDATE '||v_io.stt_cod||' SET '||substr(v_sql,2); execute immediate v_sql; commit; end; /
  18. 18. Flow of the NULL management in a Data Warehouse 4 1 <prj>_ COM_MEF_IO_CFT <prj>_ STA_<sio>_<io>_FXV IO_COD CXT_COD FXV_COD STT_COD <key_id> <io> <prj>_STA_<sio>_<io>_CXT <prj>_ STA_<sio>_<io>_FXV <prj>_ STA_<sio>_<io>_STT 1 2 <f1_cod> <f2>_num <f3_ymd> <f4_dat> <f5_cod> CODE1 250 20140207 27/01/2014 9.34.35 CODE2 2 p_ins_stt (<io>) 6 5 <prj>_COM_MEF_CFT <prj>_ STA_<sio>_<io>_STT DEF_V DEF_N DEF_YMD DEF_D ‘?’ 0 99991231 SYSDATE p_default (<io>) <key_id> <f1_cod> <f2>_num <f3_ymd> <f4_dat> <f5_cod> CODE1 250 20140207 27/01/2014 9.34.35 CODE2 1 2 3 p_upd_stt (<io>) <prj>_STA_<sio>_<io>_CXT column_cod DATA_TYPE <key_id> NUMBER <f1_cod> NUMBER <f4_dat> DATE <f5_cod> 1. 2. 3. 4. 5. 6. 7. NUMBER <f3_ymd> VARCHAR2 <prj>_ STA_<sio>_<io>_STT VARCHAR2 <f2>_num DEF_TXT 7 Data Dictionary (cols) <key_id> 11110101 NULL <prj> = Project code <sio> = Sorce Subsystem <io> = source cod Configure the data source Configure the global default values Configure the exception values for every field of data source Load (simulate) the data source Set the default values for the Staging Area table Load the Staging Area table Update the Staging Area table with the default values <f1_cod> <f2>_num <f3_ymd> <f4_dat> 1 ? 0 11110101 23/02/2014 11.21.30 2 CODE1 250 20140207 27/01/2014 9.34.35 <f5_cod> CODE2