SlideShare a Scribd company logo
1 of 4
Download to read offline
Poster Paper
Proc. of Int. Conf. on Advances in Computer Science and Application 2013

An Overview on Data Quality Issues at Data Staging
ETL
Nitin Anand 1 and Manoj Kumar2
1

Research Scholar , Deptt of Computer Science , AIACT&R, New Delhi
Email: proudtobeanindiannitin@gmail.com
2
Associate Professor, Deptt of Computer Science , AIACT&R, New Delhi
Email: manojgaur@yahoo.com
Abstract -A data warehouse (DW) is a collection of technologies
aimed at enabling the decision maker to make better and
faster decisions. Data warehouses differ from operational
databases in that they are subject oriented, integrated, time
variant, non volatile, summarized, larger, not normalized, and
perform OLAP. The generic data warehouse architecture
consists of three layers (data sources, DSA, and primary data
warehouse). During the ETL process, data is extracted from
an OLTP databases, transformed to match the data warehouse
schema, and loaded into the data warehouse database

4. The cleaning of the resulting data set on the basis of
database and business rules, and
5. The propagation of the data to the data warehouse and/
or data marts.
II. LITERATURE REVIEW
1. Amit Rudra and Emilie Yeo (1999 )”Key Issues in Achieving
Data Quality and Consistency in Data Warehousing among
Large Organizations in Australia
2. Munoz, Lilia, Mazon, Jose-Norberto, Trujillo, Juan, 2010.
Systematic review and comparison of modeling ETL processes
in data warehouse.
3. Jaideep Srivastava Warehouse Creation- A Potential
Roadblock to Data Warehousing.

Index Terms - Data Mart, Data Quality (DQ), Data Staging ,
Data Warehouse, ETL, OLAP,OLTP.

I. INTRODUCTION
Extraction-Transformation and Loading (ETL) tools are
pieces of software responsible for the extraction of data from
several sources, their cleansing, customization and insertion
into a data warehouse. The quality of the information depends
on 3 things: (1) the quality of the data itself, (2) the quality of
the application programs and (3) the quality of the database
schema ETL and data staging is considered to be more crucial
stage of data warehousing process where most of the data
cleansing and scrubbing of data is done. There can be myriad
of reasons at this stage which can contribute to the data
quality problems. To build a DW we must run the ETL tool
which has three tasks: (1) data is extracted from different data
sources, (2) propagated to the data staging area where it is
transformed and cleansed, and then (3) loaded to the data
warehouse. ETL tools are a category of specialized tools with
the task of dealing with data warehouse homogeneity,
cleaning, transforming, and loading problems [1].The
preparation of data before their actual loading in the
warehouse for further querying is necessary due to quality
problems, incompatible schemata, and unnecessary parts of
source data not relevant for the purposes of the warehouse.
The category of tools that are responsible for this task is
generally called Extraction- Transformation- Loading (ETL)
tools. The functionality of these tools can be coarsely
summarized in the following prominent tasks, which include:
1. The identification of relevant information at the source
side.
2. The extraction of this information,
3. The customization and integration of the information and
integration of the information coming from multiple sources
[2].
© 2013 ACEEE
DOI: 03.LSCS.2013.3.47

III. PHASES OF ETL
An ETL system consists of three consecutive functional
steps: extraction, transformation, and loading:
A. Extraction
The ETL Extraction step is responsible for extracting data
from the source systems. Each data source has its distinct
set of characteristics that need to be managed in order to
effectively extract data for the ETL process. The process
needs to effectively integrate systems that have different
platforms, such as different database management systems,
different operating systems, and different communications
protocols.
B. Transformation
The second step in any ETL scenario is data
transformation. The transformation step tends to make some
cleaning and con-forming on the incoming data to gain
accurate data which is correct, complete, consistent, and
unambiguous. This process includes data cleaning,
transformation, and integration. It de-fines the granularity of
fact tables, the dimension tables, DW schema (stare or
snowflake), derived facts, slowly changing fact tables and
dimension tables . All transformation rules and the resulting
schemas are described in the metadata repository.

46

C. Loading
Loading data to the target multidimensional structure is
the final ETL step. In this step, extracted and transformed
data is written into the dimensional structures actually
Poster Paper
Proc. of Int. Conf. on Advances in Computer Science and Application 2013

Fig 1 ETL workflow as a directed graph [19]

Fig. 2 Different perspectives for an ETL workflow [16]

accessed by the end users and applications.

approach [16]. We are mainly interested in the design and
administration parts of the lifecycle of the overall ETL process, and we depict them at the upper and lower part of Fig.
2, respectively. At the top of Fig. 2, we are mainly concerned
with the static design artifacts for a workflow systems. These
loading steps includes both loading dimension tables and
loading fact tables [4]. We will follow a traditional approach
and group the design artifacts into physical, with each category comprising its own perspective. We depict the logical
perspective on the left-hand side of Fig. 2, and the physical
perspective on the right-hand side. At the logical perspective, we classify the design artifacts that give an abstract
description of the workflow environment. First, the designer
is responsible for defining an execution plan for the scenario. The definition of an execution plan can be seen from
various perspectives. The execution sequence involves the
specification of which activity runs first, second, and so on,
which activities run in parallel, or when a semaphore is de

IV. A RATIONALE FOR THE T AXONOMY
An ETL workflow can be seen as a directed graph as
shown in Figure 1. The nodes of this graph are activities
and recordsets [17]. The edges of the graph are relationships
that combine activit ies and recordsets.
The edges of the graph are provider relationships that
combine activities and recordsets [3]. Following the common practice, we envisage ETL activities to be combined in a
workflow.
Therefore, we do not assume that the output of a certain
activity will be necessarily directed towards a recordset, but
rather, that the recipient of this data can be either another
activity or a recordset.[16]. In Figure 2 [10]
We follow a multi-perspective approach that enables to
separate these parameters and study them in a principled
© 2013 ACEEE
DOI: 03.LSCS.2013.3.47

47
Poster Paper
Proc. of Int. Conf. on Advances in Computer Science and Application 2013
TABLE I: DATA QUALITY

fined so that several activities are synchronized at a rendezvous point. ETL activities normally run in batch, so the designer needs to specify an execution schedule, i.e., the time
points or events that trigger the execution of the scenario as
a whole. Finally, due to system crashes, it is imperative that
there exists a recovery plan, specifying the sequence of steps
to be taken in the case of failure for a certain activity (e.g.,
retry to execute the activity, or undo any intermediate results
produced so far). On the right-hand side of Fig. 2, we can
also see the physical perspective, involving the registration
of the actual entities that exist in the real world. We will reuse
the terminology of [5] for the physical perspective. The
resource layer comprises the definition of roles (human or
software) that are responsible for executing the activities of
the workflow. The operational layer, at the same time, comprises the software modules that implement the design.

ETL

Sl no

CAUSES OF DATA QUALITY ISSUES AT DATA
STAGING AND ETL PHASE.

1
2
3

Business rules lack currency problems [8]
Lack of capturing only changes in source files [9]
Disabling data integrity constraints in data staging tables
cause wrong data and relationships to be extracted and
hence cause data quality problems [10].
The inability to restart the ETL process from checkpoints
without losing data [11]
Lack of Providing internal profiling or integration to
third-party data profiling and cleansing
tools.[11]
Lack of automatically generating rules for ETL tools to
build mappings that detect and fix data
defects[11]

4
5

6

7

Inability of integrating cleansing tasks into visual
workflows and diagrams[11]

8

V. CAUSES OF DATA QUALITY ISSUES AT DATA STAGING ETL
PHASE

Inability of enabling profiling, cleansing and ETL tools to
exchange data and meta data[11]
Lack of proper functioning of the extraction logic for each
source system (historical and incremental loads) cause
data quality problems.
Lack of generation of data flow and data lineage
documentation by the ETL process causes data
quality problems
Lack of error reporting, validation, and metadata updates
in ETL process cause data quality
problems.
Inappropriate handling of rerun strategies during ETL
causes data quality problems.
Lack of considering business rules by the transformation
logic cause data quality problems
Wrong impact analysis of change requests on ETL cause
data quality problems.
Type of staging area, relational or non relational affects
the data quality
The inability to schedule extracts by time, interval, or
event cause data quality problems
Hand coded ETL tools used for data warehousing lack in
generating single logical meta data store, which leads to
poor data quality.

9

One consideration is whether data cleansing is most
appropriate at the source system, during the ETL process, at
the staging database, or within the data warehouse [6] [7]. A
data cleaning process is executed in the data staging area in
order to improve the accuracy of the data warehouse. The
data staging area is the place where all ‘grooming’ is done on
data after it is called from the source systems. Staging and
ETL phase is considered to be most crucial stage of data
warehousing where maximum responsibility of data quality
efforts resides. Table 1 depicts some reasons for this [18]

10

11

12
13
14
15

VI. ALGORITHM FOR SIMULATION
16

The activity diagram (shown in figure 3) depicts the
method of secure data extraction. The main procedure for
secure data extraction process [12,13] is as follows.
// Identifying the sources and creating the source list.
This is done by the methods of Source Identifier class
1. Identify the list of clients attached to the server
2. Find the type of the databases by pinging to that client
3. Set the properties for the source
4. If it is a new source add to the data source list // Establishing
the connection and extracting data.
This is done by methods of Wrapper class
5. Check the type of the data source
6. Using appropriate drivers establish the connection
7. Map the data source and data staging area schemas
8. Extract the data // Loading of extracted data into data staging
area. Integrator
class does this.
9. Establish connection with data staging area
10. Install the data into data staging area // Modification /
updation of Data Staging Area (DSA).
Integrator updates DSA with the help of Monitor.
11. Identify the changes in the data sources and Inform to
the Integrator
12. Update DSA
© 2013 ACEEE
DOI: 03.LSCS.2013.3.47

ISSUES AT

17

VII. CONCLUSION AND FUTUREWORK
In this paper we have attempted to collect all possible
causes of data quality problems that may exist at all the phases
of data warehouse. In a recent study [14], the authors report
that due to the diversity and heterogeneity of data sources,
ETL is unlikely to become an open commodity market. This
paper describes the simulation model of Secure Data Extraction in ETL processes. This architecture gives us flexibility
of adding various types of information sources, which ultimately helps in storing the data into the Data Staging Area.
Since quality plays an important role in developing software
products, I have presented functional requirements along
with non-functional requirement i.e., security requirements..
This approach is better as compared to existing systems. In
[15], the authors report on their data warehouse population
system. The architecture of the system is discussed in the
paper, with particular interest (a) in a “shared data area which
is an in-memory area for data transformations, with a specialized area for rapid access to lookup tables and (b) the
pipelining of the ETL processes. Our classification of causes
48
Poster Paper
Proc. of Int. Conf. on Advances in Computer Science and Application 2013

Figure 3: Activity diagram for data extraction

will really help the data warehouse practitioners, implementers
and researchers for taking care of these issues before moving ahead with each phase of data warehousing. Each item
of the classification shown in table, will be converted into a
item of the research instrument such as questionnaire and
will be empirically tested by collecting views about these
items from the data warehouse practitioners, appropriately.

/www.tm.tue.nl/research/patterns/documentation.htm
[6] Wayne W. E. (2004) “Data Quality and the Bottom Line:
Achieving Business Success through a Commitment to High
Quality Data “, The Data warehouse Institute (TDWI) report
, available at www.dw-institute.com .
[7] Ralaph Kimball, The Data Warehouse ETL Toolkit, Wiley
India (P) Ltd (2004)
[8] Amit Rudra and Emilie Yeo (1999)”Key Issues in Achieving
Data Quality and Consistency in Data Warehousing among
Large Organizations in Australia”, Proceedings of the 32nd
Hawaii International Conference on System Sciences – 1999
[9] Arkedy Maydanxhik (2007), “ Causes of Data Quality
Problems”, Data Quality Assessment, Techniques Publications
LLC.
Available
at
http://media.techtarget.com/
searchDataManagement/downloads/
Data_Quality_Assessment_-_Chapter_1.pdf
[10] Won Kim et al (2002)- “A Taxonomy of Dirty Data “ Kluwer
Academic Publishers 2002
[11] Wayne Eckerson & Colin White (2003) “ Evaluating ETL and
Data Integratio Vassiliadis A generic and customizable
framework for the design of ETL scenarios, 2002
[17] Panos Vassiliadis, A Taxonomy of ETL activities 2009.
[18] Ranjit Singh, Dr Kanwaljeet, A descriptive classification of
causes of Data Quality problems in Data Warehouse,2010
[19] A.Simitsis. Mapping Conceptual to Logical Models for ETL
Processes . DOLAP 05, ACM, (2002) 67-76

REFERENCES
[1] Shilakes, C., Tylman, J., 1998. Enterprise Information Portals.
Enterprise Software Team. <http://www.sagemaker.com/company/downloads/eip/indepth.pdf>.
[2] J. Adzic and V. Fiore, “Data Warehouse Population Platform,”
Proc. Fifth Int l Workshop Design and Management of Data
Warehouses, 2003.
[3] A. Simitsis, P. Vassiliadis, and T. Sellis, “Optimizing ETL
Processes in Data Warehouses,” Proc. 21st IEEE Int’l Conf.
Data Eng.,pp. 564-575, 2005.
[4] Panos Vassiliadis, Alkis Simitsis, Panos Georgantas, Manolis
Terrovitis and Spiros Skiadopoulos”A Generic and
customizable framework for the design of
ETL
scenarios”,2002
[5] W.M.P. van der Aalst, A.H.M. ter Hofstede, B. Kiepus-zewski,
A.P. Barros. Workflow Patterns, BETA Working Paper Series,
WP 47, Eindhoven University of Technology, Eindhoven,
2000, available at the Workflow Patterns website, at tmithttp:/

© 2013 ACEEE
DOI: 03.LSCS.2013.3.47

49

More Related Content

What's hot

A new reverse engineering approach to
A new reverse engineering approach toA new reverse engineering approach to
A new reverse engineering approach toijseajournal
 
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM MethodEstimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Methodidescitation
 
Review on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative StudyReview on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative StudyCSCJournals
 
10 si(systems analysis and design )
10 si(systems analysis and design )10 si(systems analysis and design )
10 si(systems analysis and design )Nurdin Al-Azies
 
System Models in Software Engineering SE7
System Models in Software Engineering SE7System Models in Software Engineering SE7
System Models in Software Engineering SE7koolkampus
 
Issues in Query Processing and Optimization
Issues in Query Processing and OptimizationIssues in Query Processing and Optimization
Issues in Query Processing and OptimizationEditor IJMTER
 
Size estimation of olap systems
Size estimation of olap systemsSize estimation of olap systems
Size estimation of olap systemscsandit
 
SIZE ESTIMATION OF OLAP SYSTEMS
SIZE ESTIMATION OF OLAP SYSTEMSSIZE ESTIMATION OF OLAP SYSTEMS
SIZE ESTIMATION OF OLAP SYSTEMScscpconf
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...vtunotesbysree
 
The Database Environment Chapter 10
The Database Environment Chapter 10The Database Environment Chapter 10
The Database Environment Chapter 10Jeanie Arnoco
 
Developing Sales Information System Application using Prototyping Model
Developing Sales Information System Application using Prototyping ModelDeveloping Sales Information System Application using Prototyping Model
Developing Sales Information System Application using Prototyping ModelEditor IJCATR
 

What's hot (20)

A new reverse engineering approach to
A new reverse engineering approach toA new reverse engineering approach to
A new reverse engineering approach to
 
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM MethodEstimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
Estimation of Functional Size of a Data Warehouse System using COSMIC FSM Method
 
Ch03 (1)
Ch03 (1)Ch03 (1)
Ch03 (1)
 
Final
FinalFinal
Final
 
Review on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative StudyReview on Sorting Algorithms A Comparative Study
Review on Sorting Algorithms A Comparative Study
 
10 si(systems analysis and design )
10 si(systems analysis and design )10 si(systems analysis and design )
10 si(systems analysis and design )
 
Sadcw 6e chapter3
Sadcw 6e chapter3Sadcw 6e chapter3
Sadcw 6e chapter3
 
System Models in Software Engineering SE7
System Models in Software Engineering SE7System Models in Software Engineering SE7
System Models in Software Engineering SE7
 
Issues in Query Processing and Optimization
Issues in Query Processing and OptimizationIssues in Query Processing and Optimization
Issues in Query Processing and Optimization
 
Size estimation of olap systems
Size estimation of olap systemsSize estimation of olap systems
Size estimation of olap systems
 
SIZE ESTIMATION OF OLAP SYSTEMS
SIZE ESTIMATION OF OLAP SYSTEMSSIZE ESTIMATION OF OLAP SYSTEMS
SIZE ESTIMATION OF OLAP SYSTEMS
 
Sadcw 6e chapter12
Sadcw 6e chapter12Sadcw 6e chapter12
Sadcw 6e chapter12
 
Binary Sort
Binary SortBinary Sort
Binary Sort
 
Sadcw 6e chapter6
Sadcw 6e chapter6Sadcw 6e chapter6
Sadcw 6e chapter6
 
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
VTU 7TH SEM CSE DATA WAREHOUSING AND DATA MINING SOLVED PAPERS OF DEC2013 JUN...
 
The Database Environment Chapter 10
The Database Environment Chapter 10The Database Environment Chapter 10
The Database Environment Chapter 10
 
Physical Design and Development
Physical Design and DevelopmentPhysical Design and Development
Physical Design and Development
 
Databases
DatabasesDatabases
Databases
 
Ijmet 10 01_141
Ijmet 10 01_141Ijmet 10 01_141
Ijmet 10 01_141
 
Developing Sales Information System Application using Prototyping Model
Developing Sales Information System Application using Prototyping ModelDeveloping Sales Information System Application using Prototyping Model
Developing Sales Information System Application using Prototyping Model
 

Viewers also liked

Viewers also liked (9)

ENoLL @ EFQUEL 2012 Forum
ENoLL @ EFQUEL 2012 Forum ENoLL @ EFQUEL 2012 Forum
ENoLL @ EFQUEL 2012 Forum
 
Living Lab: Innovative Brussels Care - Support from the Cluster
Living Lab: Innovative Brussels Care - Support from the ClusterLiving Lab: Innovative Brussels Care - Support from the Cluster
Living Lab: Innovative Brussels Care - Support from the Cluster
 
Flemish Living Lab Platform Presentation
Flemish Living Lab Platform PresentationFlemish Living Lab Platform Presentation
Flemish Living Lab Platform Presentation
 
'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009'Living Lab' for HCI - presentation made at HCI International 2009
'Living Lab' for HCI - presentation made at HCI International 2009
 
Living Lab Expo 12102012
Living Lab Expo 12102012Living Lab Expo 12102012
Living Lab Expo 12102012
 
27_malmo living lab
27_malmo living lab27_malmo living lab
27_malmo living lab
 
Augmented Reality and Future
Augmented Reality and FutureAugmented Reality and Future
Augmented Reality and Future
 
Hayatı Kavramak
Hayatı KavramakHayatı Kavramak
Hayatı Kavramak
 
PRoF @ AAL 2012 Eindhoven - living lab or not ?
PRoF @ AAL 2012 Eindhoven - living lab or not ?PRoF @ AAL 2012 Eindhoven - living lab or not ?
PRoF @ AAL 2012 Eindhoven - living lab or not ?
 

Similar to An Overview on Data Quality Issues at Data Staging ETL

NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA cscpconf
 
ICT-DBA-level4
ICT-DBA-level4ICT-DBA-level4
ICT-DBA-level4Infotech27
 
Process management seminar
Process management seminarProcess management seminar
Process management seminarapurva_naik
 
IRJET- Comparative Study of ETL and E-LT in Data Warehousing
IRJET- Comparative Study of ETL and E-LT in Data WarehousingIRJET- Comparative Study of ETL and E-LT in Data Warehousing
IRJET- Comparative Study of ETL and E-LT in Data WarehousingIRJET Journal
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
An Integrated ERP With Web Portal
An Integrated ERP With Web PortalAn Integrated ERP With Web Portal
An Integrated ERP With Web PortalTracy Morgan
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEditor IJCATR
 
An Integrated ERP with Web Portal
An Integrated ERP with Web Portal An Integrated ERP with Web Portal
An Integrated ERP with Web Portal acijjournal
 
A Comparitive Study Of ETL Tools
A Comparitive Study Of ETL ToolsA Comparitive Study Of ETL Tools
A Comparitive Study Of ETL ToolsRhonda Cetnar
 
ETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsSOMASUNDARAM T
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouseKomal Choudhary
 
Monitoring and Supporting Data Conversion.pdf
Monitoring and Supporting  Data Conversion.pdfMonitoring and Supporting  Data Conversion.pdf
Monitoring and Supporting Data Conversion.pdfseifusisay06
 

Similar to An Overview on Data Quality Issues at Data Staging ETL (20)

NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
NEAR-REAL-TIME PARALLEL ETL+Q FOR AUTOMATIC SCALABILITY IN BIGDATA
 
ICT-DBA-level4
ICT-DBA-level4ICT-DBA-level4
ICT-DBA-level4
 
Etl techniques
Etl techniquesEtl techniques
Etl techniques
 
GROPSIKS.pptx
GROPSIKS.pptxGROPSIKS.pptx
GROPSIKS.pptx
 
Process management seminar
Process management seminarProcess management seminar
Process management seminar
 
ETL_Methodology.pptx
ETL_Methodology.pptxETL_Methodology.pptx
ETL_Methodology.pptx
 
IRJET- Comparative Study of ETL and E-LT in Data Warehousing
IRJET- Comparative Study of ETL and E-LT in Data WarehousingIRJET- Comparative Study of ETL and E-LT in Data Warehousing
IRJET- Comparative Study of ETL and E-LT in Data Warehousing
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
An Integrated ERP With Web Portal
An Integrated ERP With Web PortalAn Integrated ERP With Web Portal
An Integrated ERP With Web Portal
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
Enhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data AccessEnhancing Data Staging as a Mechanism for Fast Data Access
Enhancing Data Staging as a Mechanism for Fast Data Access
 
ijcatr04081001
ijcatr04081001ijcatr04081001
ijcatr04081001
 
An Integrated ERP with Web Portal
An Integrated ERP with Web Portal An Integrated ERP with Web Portal
An Integrated ERP with Web Portal
 
A Comparitive Study Of ETL Tools
A Comparitive Study Of ETL ToolsA Comparitive Study Of ETL Tools
A Comparitive Study Of ETL Tools
 
ETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse FundamentalsETL Process & Data Warehouse Fundamentals
ETL Process & Data Warehouse Fundamentals
 
Etl process in data warehouse
Etl process in data warehouseEtl process in data warehouse
Etl process in data warehouse
 
Monitoring and Supporting Data Conversion.pdf
Monitoring and Supporting  Data Conversion.pdfMonitoring and Supporting  Data Conversion.pdf
Monitoring and Supporting Data Conversion.pdf
 
Data warehouse physical design
Data warehouse physical designData warehouse physical design
Data warehouse physical design
 

More from idescitation (20)

65 113-121
65 113-12165 113-121
65 113-121
 
69 122-128
69 122-12869 122-128
69 122-128
 
71 338-347
71 338-34771 338-347
71 338-347
 
72 129-135
72 129-13572 129-135
72 129-135
 
74 136-143
74 136-14374 136-143
74 136-143
 
80 152-157
80 152-15780 152-157
80 152-157
 
82 348-355
82 348-35582 348-355
82 348-355
 
84 11-21
84 11-2184 11-21
84 11-21
 
62 328-337
62 328-33762 328-337
62 328-337
 
46 102-112
46 102-11246 102-112
46 102-112
 
47 292-298
47 292-29847 292-298
47 292-298
 
49 299-305
49 299-30549 299-305
49 299-305
 
57 306-311
57 306-31157 306-311
57 306-311
 
60 312-318
60 312-31860 312-318
60 312-318
 
5 1-10
5 1-105 1-10
5 1-10
 
11 69-81
11 69-8111 69-81
11 69-81
 
14 284-291
14 284-29114 284-291
14 284-291
 
15 82-87
15 82-8715 82-87
15 82-87
 
29 88-96
29 88-9629 88-96
29 88-96
 
43 97-101
43 97-10143 97-101
43 97-101
 

Recently uploaded

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayMakMakNepo
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceSamikshaHamane
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for BeginnersSabitha Banu
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationAadityaSharma884161
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxNirmalaLoungPoorunde1
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Celine George
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxOH TEIK BIN
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTiammrhaywood
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Celine George
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptxSherlyMaeNeri
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxsqpmdrvczh
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 

Recently uploaded (20)

Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
Quarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up FridayQuarter 4 Peace-education.pptx Catch Up Friday
Quarter 4 Peace-education.pptx Catch Up Friday
 
Roles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in PharmacovigilanceRoles & Responsibilities in Pharmacovigilance
Roles & Responsibilities in Pharmacovigilance
 
Full Stack Web Development Course for Beginners
Full Stack Web Development Course  for BeginnersFull Stack Web Development Course  for Beginners
Full Stack Web Development Course for Beginners
 
ROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint PresentationROOT CAUSE ANALYSIS PowerPoint Presentation
ROOT CAUSE ANALYSIS PowerPoint Presentation
 
OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...OS-operating systems- ch04 (Threads) ...
OS-operating systems- ch04 (Threads) ...
 
Employee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptxEmployee wellbeing at the workplace.pptx
Employee wellbeing at the workplace.pptx
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17Difference Between Search & Browse Methods in Odoo 17
Difference Between Search & Browse Methods in Odoo 17
 
Solving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptxSolving Puzzles Benefits Everyone (English).pptx
Solving Puzzles Benefits Everyone (English).pptx
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPTECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
ECONOMIC CONTEXT - LONG FORM TV DRAMA - PPT
 
Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17Field Attribute Index Feature in Odoo 17
Field Attribute Index Feature in Odoo 17
 
Judging the Relevance and worth of ideas part 2.pptx
Judging the Relevance  and worth of ideas part 2.pptxJudging the Relevance  and worth of ideas part 2.pptx
Judging the Relevance and worth of ideas part 2.pptx
 
Romantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptxRomantic Opera MUSIC FOR GRADE NINE pptx
Romantic Opera MUSIC FOR GRADE NINE pptx
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"Rapple "Scholarly Communications and the Sustainable Development Goals"
Rapple "Scholarly Communications and the Sustainable Development Goals"
 

An Overview on Data Quality Issues at Data Staging ETL

  • 1. Poster Paper Proc. of Int. Conf. on Advances in Computer Science and Application 2013 An Overview on Data Quality Issues at Data Staging ETL Nitin Anand 1 and Manoj Kumar2 1 Research Scholar , Deptt of Computer Science , AIACT&R, New Delhi Email: proudtobeanindiannitin@gmail.com 2 Associate Professor, Deptt of Computer Science , AIACT&R, New Delhi Email: manojgaur@yahoo.com Abstract -A data warehouse (DW) is a collection of technologies aimed at enabling the decision maker to make better and faster decisions. Data warehouses differ from operational databases in that they are subject oriented, integrated, time variant, non volatile, summarized, larger, not normalized, and perform OLAP. The generic data warehouse architecture consists of three layers (data sources, DSA, and primary data warehouse). During the ETL process, data is extracted from an OLTP databases, transformed to match the data warehouse schema, and loaded into the data warehouse database 4. The cleaning of the resulting data set on the basis of database and business rules, and 5. The propagation of the data to the data warehouse and/ or data marts. II. LITERATURE REVIEW 1. Amit Rudra and Emilie Yeo (1999 )”Key Issues in Achieving Data Quality and Consistency in Data Warehousing among Large Organizations in Australia 2. Munoz, Lilia, Mazon, Jose-Norberto, Trujillo, Juan, 2010. Systematic review and comparison of modeling ETL processes in data warehouse. 3. Jaideep Srivastava Warehouse Creation- A Potential Roadblock to Data Warehousing. Index Terms - Data Mart, Data Quality (DQ), Data Staging , Data Warehouse, ETL, OLAP,OLTP. I. INTRODUCTION Extraction-Transformation and Loading (ETL) tools are pieces of software responsible for the extraction of data from several sources, their cleansing, customization and insertion into a data warehouse. The quality of the information depends on 3 things: (1) the quality of the data itself, (2) the quality of the application programs and (3) the quality of the database schema ETL and data staging is considered to be more crucial stage of data warehousing process where most of the data cleansing and scrubbing of data is done. There can be myriad of reasons at this stage which can contribute to the data quality problems. To build a DW we must run the ETL tool which has three tasks: (1) data is extracted from different data sources, (2) propagated to the data staging area where it is transformed and cleansed, and then (3) loaded to the data warehouse. ETL tools are a category of specialized tools with the task of dealing with data warehouse homogeneity, cleaning, transforming, and loading problems [1].The preparation of data before their actual loading in the warehouse for further querying is necessary due to quality problems, incompatible schemata, and unnecessary parts of source data not relevant for the purposes of the warehouse. The category of tools that are responsible for this task is generally called Extraction- Transformation- Loading (ETL) tools. The functionality of these tools can be coarsely summarized in the following prominent tasks, which include: 1. The identification of relevant information at the source side. 2. The extraction of this information, 3. The customization and integration of the information and integration of the information coming from multiple sources [2]. © 2013 ACEEE DOI: 03.LSCS.2013.3.47 III. PHASES OF ETL An ETL system consists of three consecutive functional steps: extraction, transformation, and loading: A. Extraction The ETL Extraction step is responsible for extracting data from the source systems. Each data source has its distinct set of characteristics that need to be managed in order to effectively extract data for the ETL process. The process needs to effectively integrate systems that have different platforms, such as different database management systems, different operating systems, and different communications protocols. B. Transformation The second step in any ETL scenario is data transformation. The transformation step tends to make some cleaning and con-forming on the incoming data to gain accurate data which is correct, complete, consistent, and unambiguous. This process includes data cleaning, transformation, and integration. It de-fines the granularity of fact tables, the dimension tables, DW schema (stare or snowflake), derived facts, slowly changing fact tables and dimension tables . All transformation rules and the resulting schemas are described in the metadata repository. 46 C. Loading Loading data to the target multidimensional structure is the final ETL step. In this step, extracted and transformed data is written into the dimensional structures actually
  • 2. Poster Paper Proc. of Int. Conf. on Advances in Computer Science and Application 2013 Fig 1 ETL workflow as a directed graph [19] Fig. 2 Different perspectives for an ETL workflow [16] accessed by the end users and applications. approach [16]. We are mainly interested in the design and administration parts of the lifecycle of the overall ETL process, and we depict them at the upper and lower part of Fig. 2, respectively. At the top of Fig. 2, we are mainly concerned with the static design artifacts for a workflow systems. These loading steps includes both loading dimension tables and loading fact tables [4]. We will follow a traditional approach and group the design artifacts into physical, with each category comprising its own perspective. We depict the logical perspective on the left-hand side of Fig. 2, and the physical perspective on the right-hand side. At the logical perspective, we classify the design artifacts that give an abstract description of the workflow environment. First, the designer is responsible for defining an execution plan for the scenario. The definition of an execution plan can be seen from various perspectives. The execution sequence involves the specification of which activity runs first, second, and so on, which activities run in parallel, or when a semaphore is de IV. A RATIONALE FOR THE T AXONOMY An ETL workflow can be seen as a directed graph as shown in Figure 1. The nodes of this graph are activities and recordsets [17]. The edges of the graph are relationships that combine activit ies and recordsets. The edges of the graph are provider relationships that combine activities and recordsets [3]. Following the common practice, we envisage ETL activities to be combined in a workflow. Therefore, we do not assume that the output of a certain activity will be necessarily directed towards a recordset, but rather, that the recipient of this data can be either another activity or a recordset.[16]. In Figure 2 [10] We follow a multi-perspective approach that enables to separate these parameters and study them in a principled © 2013 ACEEE DOI: 03.LSCS.2013.3.47 47
  • 3. Poster Paper Proc. of Int. Conf. on Advances in Computer Science and Application 2013 TABLE I: DATA QUALITY fined so that several activities are synchronized at a rendezvous point. ETL activities normally run in batch, so the designer needs to specify an execution schedule, i.e., the time points or events that trigger the execution of the scenario as a whole. Finally, due to system crashes, it is imperative that there exists a recovery plan, specifying the sequence of steps to be taken in the case of failure for a certain activity (e.g., retry to execute the activity, or undo any intermediate results produced so far). On the right-hand side of Fig. 2, we can also see the physical perspective, involving the registration of the actual entities that exist in the real world. We will reuse the terminology of [5] for the physical perspective. The resource layer comprises the definition of roles (human or software) that are responsible for executing the activities of the workflow. The operational layer, at the same time, comprises the software modules that implement the design. ETL Sl no CAUSES OF DATA QUALITY ISSUES AT DATA STAGING AND ETL PHASE. 1 2 3 Business rules lack currency problems [8] Lack of capturing only changes in source files [9] Disabling data integrity constraints in data staging tables cause wrong data and relationships to be extracted and hence cause data quality problems [10]. The inability to restart the ETL process from checkpoints without losing data [11] Lack of Providing internal profiling or integration to third-party data profiling and cleansing tools.[11] Lack of automatically generating rules for ETL tools to build mappings that detect and fix data defects[11] 4 5 6 7 Inability of integrating cleansing tasks into visual workflows and diagrams[11] 8 V. CAUSES OF DATA QUALITY ISSUES AT DATA STAGING ETL PHASE Inability of enabling profiling, cleansing and ETL tools to exchange data and meta data[11] Lack of proper functioning of the extraction logic for each source system (historical and incremental loads) cause data quality problems. Lack of generation of data flow and data lineage documentation by the ETL process causes data quality problems Lack of error reporting, validation, and metadata updates in ETL process cause data quality problems. Inappropriate handling of rerun strategies during ETL causes data quality problems. Lack of considering business rules by the transformation logic cause data quality problems Wrong impact analysis of change requests on ETL cause data quality problems. Type of staging area, relational or non relational affects the data quality The inability to schedule extracts by time, interval, or event cause data quality problems Hand coded ETL tools used for data warehousing lack in generating single logical meta data store, which leads to poor data quality. 9 One consideration is whether data cleansing is most appropriate at the source system, during the ETL process, at the staging database, or within the data warehouse [6] [7]. A data cleaning process is executed in the data staging area in order to improve the accuracy of the data warehouse. The data staging area is the place where all ‘grooming’ is done on data after it is called from the source systems. Staging and ETL phase is considered to be most crucial stage of data warehousing where maximum responsibility of data quality efforts resides. Table 1 depicts some reasons for this [18] 10 11 12 13 14 15 VI. ALGORITHM FOR SIMULATION 16 The activity diagram (shown in figure 3) depicts the method of secure data extraction. The main procedure for secure data extraction process [12,13] is as follows. // Identifying the sources and creating the source list. This is done by the methods of Source Identifier class 1. Identify the list of clients attached to the server 2. Find the type of the databases by pinging to that client 3. Set the properties for the source 4. If it is a new source add to the data source list // Establishing the connection and extracting data. This is done by methods of Wrapper class 5. Check the type of the data source 6. Using appropriate drivers establish the connection 7. Map the data source and data staging area schemas 8. Extract the data // Loading of extracted data into data staging area. Integrator class does this. 9. Establish connection with data staging area 10. Install the data into data staging area // Modification / updation of Data Staging Area (DSA). Integrator updates DSA with the help of Monitor. 11. Identify the changes in the data sources and Inform to the Integrator 12. Update DSA © 2013 ACEEE DOI: 03.LSCS.2013.3.47 ISSUES AT 17 VII. CONCLUSION AND FUTUREWORK In this paper we have attempted to collect all possible causes of data quality problems that may exist at all the phases of data warehouse. In a recent study [14], the authors report that due to the diversity and heterogeneity of data sources, ETL is unlikely to become an open commodity market. This paper describes the simulation model of Secure Data Extraction in ETL processes. This architecture gives us flexibility of adding various types of information sources, which ultimately helps in storing the data into the Data Staging Area. Since quality plays an important role in developing software products, I have presented functional requirements along with non-functional requirement i.e., security requirements.. This approach is better as compared to existing systems. In [15], the authors report on their data warehouse population system. The architecture of the system is discussed in the paper, with particular interest (a) in a “shared data area which is an in-memory area for data transformations, with a specialized area for rapid access to lookup tables and (b) the pipelining of the ETL processes. Our classification of causes 48
  • 4. Poster Paper Proc. of Int. Conf. on Advances in Computer Science and Application 2013 Figure 3: Activity diagram for data extraction will really help the data warehouse practitioners, implementers and researchers for taking care of these issues before moving ahead with each phase of data warehousing. Each item of the classification shown in table, will be converted into a item of the research instrument such as questionnaire and will be empirically tested by collecting views about these items from the data warehouse practitioners, appropriately. /www.tm.tue.nl/research/patterns/documentation.htm [6] Wayne W. E. (2004) “Data Quality and the Bottom Line: Achieving Business Success through a Commitment to High Quality Data “, The Data warehouse Institute (TDWI) report , available at www.dw-institute.com . [7] Ralaph Kimball, The Data Warehouse ETL Toolkit, Wiley India (P) Ltd (2004) [8] Amit Rudra and Emilie Yeo (1999)”Key Issues in Achieving Data Quality and Consistency in Data Warehousing among Large Organizations in Australia”, Proceedings of the 32nd Hawaii International Conference on System Sciences – 1999 [9] Arkedy Maydanxhik (2007), “ Causes of Data Quality Problems”, Data Quality Assessment, Techniques Publications LLC. Available at http://media.techtarget.com/ searchDataManagement/downloads/ Data_Quality_Assessment_-_Chapter_1.pdf [10] Won Kim et al (2002)- “A Taxonomy of Dirty Data “ Kluwer Academic Publishers 2002 [11] Wayne Eckerson & Colin White (2003) “ Evaluating ETL and Data Integratio Vassiliadis A generic and customizable framework for the design of ETL scenarios, 2002 [17] Panos Vassiliadis, A Taxonomy of ETL activities 2009. [18] Ranjit Singh, Dr Kanwaljeet, A descriptive classification of causes of Data Quality problems in Data Warehouse,2010 [19] A.Simitsis. Mapping Conceptual to Logical Models for ETL Processes . DOLAP 05, ACM, (2002) 67-76 REFERENCES [1] Shilakes, C., Tylman, J., 1998. Enterprise Information Portals. Enterprise Software Team. <http://www.sagemaker.com/company/downloads/eip/indepth.pdf>. [2] J. Adzic and V. Fiore, “Data Warehouse Population Platform,” Proc. Fifth Int l Workshop Design and Management of Data Warehouses, 2003. [3] A. Simitsis, P. Vassiliadis, and T. Sellis, “Optimizing ETL Processes in Data Warehouses,” Proc. 21st IEEE Int’l Conf. Data Eng.,pp. 564-575, 2005. [4] Panos Vassiliadis, Alkis Simitsis, Panos Georgantas, Manolis Terrovitis and Spiros Skiadopoulos”A Generic and customizable framework for the design of ETL scenarios”,2002 [5] W.M.P. van der Aalst, A.H.M. ter Hofstede, B. Kiepus-zewski, A.P. Barros. Workflow Patterns, BETA Working Paper Series, WP 47, Eindhoven University of Technology, Eindhoven, 2000, available at the Workflow Patterns website, at tmithttp:/ © 2013 ACEEE DOI: 03.LSCS.2013.3.47 49