SlideShare a Scribd company logo
1 of 10
Download to read offline
See discussions, stats, and author profiles for this
publication at: http://www.researchgate.net/publication/251116903
On the Development of a User-Defined Quality
Measurement Tool for XML Documents
Eric Pardede and Tejasvi Gaur
Abstract The capability of eXtensible Markup Language (XML) for data
representation has been widely accepted by research communities and industries.
Even though it can be used for efficient data transfer, many industries look for a
more promising language on which to rely when it comes to their important data.
An ability to provide good XML data quality is necessary to make this data format
more reliable and usable. To measure data quality, the current methods are largely
driven by structural and technical factors and often assess data quality impartially,
not accounting for contextual factors. It is well known that different data share
common quality features: completeness, validity, accuracy and timeliness.
Nevertheless, the measurement of quality features will be unique, based on the data
format. The measurement of quality for XML documents cannot be generalised
from quality measurement in other data formats. In this chapter, we describe the
development of a user-defined quality metric for XML documents. For
implementation, we develop a tool that enables users to control XML data quality.
We use a case study in health informatics as the proof of concept.
Keywords Data quality · XML document · User-defined quality tool
1 Introduction
XML (eXtensible Markup Language) primarily facilitates the sharing of
semistructured data across different information systems, particularly via the
internet, such as passing data from server to client, machine to machine and
application to application. XML is an extraction from SGML (Standard Generalized
Markup Language) with the aim of performing similar web functions as HTML.
Compared to HTML, it gives users more choice and freedom to develop their own
tags without
214 E. Pardede and T. Gaur
E. Pardede (B)
Department of Computer Science and Computer Engineering, La Trobe University, Melbourne,
VIC 3086, Australia e-mail: e.pardede@latrobe.edu.au
W.W. Song et al. (eds.), Information Systems Development, 213
DOI 10.1007/978-1-4419-7355-9_18, C Springer Science+Business Media, LLC 2011
worrying about the web browser’s compatibility. In broader terms, XML is simpler,
more flexible and more extendable.
In the past decade, the use of XML as a data format has exceeded its use as a
markup language. Many domain-specific standards are now structured in the XML
format due to its extendable and self-describing nature. It is no surprise that a large
volume of XML documents are created and transmitted over the internet everyday.
Some of the data contain important and sensitive content and therefore, the data
quality has to be ensured. Data quality describes the relationship between the data
and the portrayal of actual phenomena and the degree of excellence of this
relationship. In simple terms, if the data is in the correct format for the purpose for
which it is required, then it has high data quality.
Much of the existing work [2, 4, 6] has investigated data quality dimensions in
various domains and data models. They agree to certain features: completeness,
accuracy, validity and timeliness. In this work, we discuss how these dimensions
are still applicable to measure the quality of XML documents. Based on these
dimensions as well, we implement a user-defined quality measurement tool. This
tool can be used to assist decision-making for business processes that use XML
document as their data format.
1.1 Roadmap
Following the introduction, in Section 2, we briefly discuss the related work. We
describe our solution in Section 3 and its implementation in Section 4. A case study
is provided in Section 5 and we conclude the chapter in Section 6.
2 Related Work
We found there was a limited amount of research on XML data quality. However,
the majority of this literature discussed various aspects of data quality in traditional
relational format. The communities have agreed on the most fundamental aspects of
data quality, and we argue that these aspects are also applicable to the XML data
format. The only difference is the way to measure these aspects due to the different
structure of XML data compared to traditional relational data.
Previous work [2, 4, 6] has listed four data quality dimensions:
• Completeness (C) is the extent to which data content is present.
Development of a User-Defined Quality Measurement Tool for XML Documents 215
• Accuracy (A) is the extent to which data is free from errors.
• Validity (V) is the extent to which data items conform to their corresponding
value domains.
• Timeliness (T) is the extent to which data is recent and up to date.
Each dimension is used to measure the Quality (Q) of the data. It is the
consolidated effect of all the above characteristics.
Table 1 XML data quality dimensions
Completeness Incomplete data due to a
missing important
value.
<Medical>
<Record>
<Name></Name>
<.........>******</......>
</Record>
</Medical>
Accuracy Mismatched tags create
errors.
<Medical>
<Record>
<Name>John</Names>
<.........>******</......>
</Record>
</Medical>
Validity The value does not
describe the content
accurately. For
example, no unit
measurement, etc.
<Medical>
<Record>
<Name>John</Name>
<.........>*******</......>
<Height>170</Height>
</Record>
</Medical>
Timeliness Updated value is not
incorporated correctly
into the data. For
example, the value of
the account has not
been updated.
<Medical>
<Record>
<Name>John</Name>
<.........>*******</......>
<Height>170 cm</Height>
<Account>(-)
$170</Account>
</Record>
</Medical>
A similar measure can also be applied to the XML data format (see Table 1). For
example, a completeness dimension has the same meaning wherever it is used, yet
Problem Example
216 E. Pardede and T. Gaur
can have different contexts and representation. How we measure complete data in a
relational format will be different to the way it is measured in a tree structure format.
Table 2 summarises the existing work in the area of data quality. This work was
applied to different domains/applications, used different data format and applied
different approaches to determine data quality. Each is unique, but all the solutions
are based on the same data quality dimensions. We have summarised all the work,
which lists their applications, their respective approaches and the respective data
used.
Table 2 Existing works
Applications Approach Database/data format
Decision support
systems [9]
Visualisation Relational
Case-based reasoning
systems [3]
Goal-question metrics Heterogeneous
e-Business [5] Case study on online
processing
Relational
Web services [7] Query based HTML, XML
Data warehouse [1] Empirical database Relational
Health care [8] Model driven Relational
Only a small amount of research measures the quality of web services, which
were naturally built using XML representation. However, this work cannot be used
for a quality measurement tool for XML databases and XML applications.
3 User-Defined Quality Approach
In this section, we propose a solution to measure XML data quality. The solution
includes a proposed metric and an algorithm to incorporate the metric into XML
documents.
A user-defined metric enables users to determine quality features that a set of
XML documents have to follow. In this metric, every element will be given a
weight, each of which is variable according to the user’s needs. In addition to the
weight, a user should be able to provide additional property to check the XML data,
for example, an option to provide preference units to an element.
The following metric formula uses all this information to measure the quality
factor of XML documents:
Quality =
×
100%
r
Development of a User-Defined Quality Measurement Tool for XML Documents 217
where r is the number of records in the document; N(vt) is the number of valid tags
in the record; N(t) is the number of tags in the record; weight(vt) is the user-defined
weight for a valid tag; and weight(t) is the user-defined weight for a tag.
We apply our quality metric in Algorithm 1. This algorithm takes an XML file
as an input. At first, it checks the document for all the starting and ending tags. Once
all are found, the system concatenates the XML document and stores it in an array
in the form of a text file.
The data quality checking procedure starts and the system will check the
document against the user-defined metric attribute and their respective units (Line
1-12 to 1-19). For each valid metric attribute, it adds up its respective weight to the
total weight of the documents. After the complete document is checked, all the
values are entered into the final data quality metric and the document’s data quality
is calculated.
4 Data Quality Checking Tool Implementation
We design a tool for users to define their own quality criteria for their XML
documents. The development of the tool follows the diagram in Fig. 1.
Fig. 1 Design model of data
quality checking tool Webpage XML File Types
JAVA
Program
Database
Output Error log
218 E. Pardede and T. Gaur
The prototype program is a JAVA-based program which connects to a MySQL
database and generates two outputs: (i) output.txt, which contains the breakdown of
the XML document in a well-structured form after the program reads the values
from the tags and (ii) error.log, which contains all the details of the XML document
that affects its quality. To populate the MySQL database, we use a web interface.
The interface will be used to manage the XML documents’ properties and the
quality factors. The summary of the implementation setup is shown in Table 3.
Our web interface takes the user-defined values for each attribute/element in the
XML document (Fig. 2). The user has the freedom to delete and change the entered
metric attributes. The attribute measurement units and their weights can be left
blank, in which case, default values are used.
Table 3 Implementation setup
Languages used XML, Java, PHP, HTML
Database used MySQL database
Input file types XML files
Output file types Text files, command line outputs
Drivers used MySQL-connector-Java-5.1.6-bin.jar
Server used WAMP server (only for local development and testing purposes)
Fig. 2 XML data quality tool web interface
At the current stage of the implementation, the prototype can take reasonably
large size of database with textual content. For typical XML data set, it can handle
up to 100 MB of data without significant performance problem. It is necessary to
Development of a User-Defined Quality Measurement Tool for XML Documents 219
realise that this prototype has been developed using small hardware resources (Core
2 Duo 2.0 GHz processor, with 2 GB RAM). For full industrial application with a
larger set of data, the more powerful hardware should be applied.
5 Case Study
We apply the developed tool in a real case study using health informatics data. The
health informatics sector, like many others, is experiencing a large growth in
incoming data, due to the increased number of requirements for which the data is
used.
The increase in available data
has also increased the need to
maintain data quality.
Fig. 3 User-defined metric attribute
check
In the case of health informatics or medical data, data quality is even more
important than efficiency and speed, as the nature of this data is critical and must be
precise. For example, the correct storage of records for patients’ blood types is
essential and can be a matter of life or death in an emergency situation.
Below is a sample XML document that contains information on a patient. Using
our quality tool, a user can identify the metric attributes that have to be checked, as
shown in Fig. 3. The user enters all the properties along with their weights. If no
unit is given, then default values are used.
<MEDICAL>
<RECORD>
<PATIENTNAME>Michael</PATIENTNAME>
<AGE>20yrs</AGE>
<MOBILE>0433384056</MOBILE>
<ADDRESS>24 the Fairway, Greensborough - 3334</ADDRESS>
220 E. Pardede and T. Gaur
<EMERGENCY>Steve-0433765673</EMERGENCY>
<WEIGHT>65KG</WEIGHT>
<HEIGHT>180cms</HEIGHT>
<BLOODGROUP>B+</BLOODGROUP>
<BLOODPRESSURE>120-65mmHg</BLOODPRESSURE>
<HEARTCONDITIONS>normal</HEARTCONDITIONS>
<REACTIONS>N/A</REACTIONS>
<ALLERGIES>DUST ALLERGIC</ALLERGIES>
</RECORD>
</MEDICAL>
In this case, the users want to employ the following properties for patient data:
(i) all the lines are complete and no blanks lines are present; (ii) all the fields are
complete and no blank fields are present; (iii) all the given units are present with the
respective attributes; (iv) age has ‘yrs’ as the measurement unit; (v) mobile is
represented in ten digits; (vi) weight has ‘kgs’ as the measurement unit; (vii) blood
pressure has its measuring units; (viii) for attributes which do not have measurement
units provided, default values are present; and (ix) all the attributes have given
weights.
Figure 4 shows an outcome of a sample XML document quality measurement. In
this case, the document only rates 76% against all the quality factors defined by a
user. Parts of the XML documents that do not fit the quality factors are logged for
future analysis.
Fig. 4 Measurement outcome sample
Based on the outcome, the users will be able to determine whether the XML
documents have met the quality criteria and therefore, can be used for further
processing or analysis. The users can also define a different set of quality factors
depending on the source or the further use of the XML documents.
In health informatics, different hospitals or clinics might have different facilities
and different practices for recording their data. If we want to integrate the data, such
as for a decision support system, the measurement tool can be used for screening, in
the data preparation stage.
Development of a User-Defined Quality Measurement Tool for XML Documents 221
While the tool enables user-defined quality metric, it also opens the subjectivity
problem. Questions such as who should make the decision on important attributes
and their weights should be based on the organisational policy. This chapter aims to
provide a tool, which in most cases cannot be run alone without clear procedure on
who should use it and how it is applied to assist the business process.
6 Conclusion and Future Work
Due to the increasing volume of XML data used for various applications, database
users need a tool to manage the quality of the data. Defining data quality has been
widely researched over many years, and a set of properties such as completeness,
accuracy, validity and timeliness have been set as confirmed data quality
dimensions.
Unfortunately, judging data quality using these features can be a tedious task and
to the best of our knowledge, there is no tool for measuring data quality, especially
for XML data. The need for such a tool has become of the utmost importance since
XML data, by its nature, will be used for data sharing and integration and therefore,
quality has to be maintained and screened carefully.
In this chapter, we apply quality dimensions for the XML data format and
implement it as a quality measurement tool. The quality dimensions are not static
and we provide user-defined input features to define the quality attributes and their
weight. For proof of concept, we provide a case study using health informatics data
and test the quality measurement using our quality measurement tool.
For future work, we will incorporate more sophisticated business quality factors
into the tool. We will also perform more scalability evaluation of our data quality
tool especially if they have to measure the quality of a large batch of XML
documents, such as the XML Warehouse. In addition, a user-defined quality
formula can be included in our quality tool.
References
1. Ballou, D. P., and Tayi, G. K. (1999) Enhancing data quality in data warehouse
environments,Communications of the ACM 42(1): 73–78.
2. Batini, C., and Scannapieco, M. (2006) Data Quality: Concepts, Methodologies and
Techniques. Springer, Berlin.
3. Bierer, A. (2007) Methodological assistance for integrating data quality evaluations into
casebased reasoning systems, Proceedings of the 7th International Conference on Case-Based
Reasoning (ICCBR 2007), Belfast, Northern Ireland, UK, pp. 254–268.
4. Even, A., and Shankaranarayanan, G. (2007) Utility-driven configuration of data quality in
datarepositories, IJIQ 1(1): 22–40.
5. Paulson, L. D. (2000) Data quality: A rising e-business concern, IEEE IT Professionals 2(4):
10–14.
6. Serrano, M. A., Calero, C., and Piattini, M. (2005) Metrics for data warehouse quality, In:
Khosrow-Pour, M. (Ed.) Encyclopedia of Information Science and Technology IV, Idea Group,
Hershey, PA, pp. 1938–1844.
222 E. Pardede and T. Gaur
7. Shankaranarayanan, G., and Cai, Y. (2005) A web services application for the data
qualitymanagement in the B2B networked environment, Proceedings of the 38th Hawaii
International Conference on System Sciences (HICCS 2005), Hawaii, USA, pp. 166.
8. Welzer, T., Brumen, V., Golob, I., and Druzovec, M. (2002) Medical diagnostic and data
quality, Proceedings of the IEEE Symposium on Computer-Based Medical Systems (CBMS
2002), Maribor, Slovenia, pp. 97–101.
9. Zhu, B., Shankar, G., and Cai, Y. (2007) Integrating data quality data into decision-making
process: An information visualization approach, Proceedings of the 12th International
Conference HCI International (HCII 2007) Part I, Beijing, China, pp. 366–369.

More Related Content

What's hot

Database Design
Database DesignDatabase Design
Database Designlearnt
 
Introduction of Database Design and Development
Introduction of Database Design and DevelopmentIntroduction of Database Design and Development
Introduction of Database Design and DevelopmentEr. Nawaraj Bhandari
 
Database application and design
Database application and designDatabase application and design
Database application and designsieedah
 
Three tier Architecture of ASP_Net
Three tier Architecture of ASP_NetThree tier Architecture of ASP_Net
Three tier Architecture of ASP_NetBiswadip Goswami
 
What is data model? And types.
What is data model? And types.What is data model? And types.
What is data model? And types.774477
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural LanguageIRJET Journal
 
Chapter12 designing databases
Chapter12 designing databasesChapter12 designing databases
Chapter12 designing databasesDhani Ahmad
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )SBGC
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalAmjad Ali
 
28 16094 31623-1-sm efficiency (edit ari)new2
28 16094 31623-1-sm efficiency (edit ari)new228 16094 31623-1-sm efficiency (edit ari)new2
28 16094 31623-1-sm efficiency (edit ari)new2IAESIJEECS
 
A survey of top k query processing techniques in relational database systems
A survey of top k query processing techniques in relational database systemsA survey of top k query processing techniques in relational database systems
A survey of top k query processing techniques in relational database systemsunyil96
 
03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septian03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septianIAESIJEECS
 
The three level of data modeling
The three level of data modelingThe three level of data modeling
The three level of data modelingsharmila_yusof
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityThomas Lee
 

What's hot (20)

Database Design
Database DesignDatabase Design
Database Design
 
Introduction of Database Design and Development
Introduction of Database Design and DevelopmentIntroduction of Database Design and Development
Introduction of Database Design and Development
 
Database application and design
Database application and designDatabase application and design
Database application and design
 
Three tier Architecture of ASP_Net
Three tier Architecture of ASP_NetThree tier Architecture of ASP_Net
Three tier Architecture of ASP_Net
 
What is data model? And types.
What is data model? And types.What is data model? And types.
What is data model? And types.
 
D.dsgn + dbms
D.dsgn + dbmsD.dsgn + dbms
D.dsgn + dbms
 
IRJET- An Efficient Way to Querying XML Database using Natural Language
IRJET-  	  An Efficient Way to Querying XML Database using Natural LanguageIRJET-  	  An Efficient Way to Querying XML Database using Natural Language
IRJET- An Efficient Way to Querying XML Database using Natural Language
 
Dbms
DbmsDbms
Dbms
 
Chapter12 designing databases
Chapter12 designing databasesChapter12 designing databases
Chapter12 designing databases
 
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
2017 IEEE Projects 2017 For Cse ( Trichy, Chennai )
 
Development of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrievalDevelopment of a new indexing technique for XML document retrieval
Development of a new indexing technique for XML document retrieval
 
28 16094 31623-1-sm efficiency (edit ari)new2
28 16094 31623-1-sm efficiency (edit ari)new228 16094 31623-1-sm efficiency (edit ari)new2
28 16094 31623-1-sm efficiency (edit ari)new2
 
SQL query Demo
SQL query DemoSQL query Demo
SQL query Demo
 
Introduction to XML
Introduction to XMLIntroduction to XML
Introduction to XML
 
A survey of top k query processing techniques in relational database systems
A survey of top k query processing techniques in relational database systemsA survey of top k query processing techniques in relational database systems
A survey of top k query processing techniques in relational database systems
 
Data models
Data modelsData models
Data models
 
03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septian03 fauzi indonesian 9456 11nov17 edit septian
03 fauzi indonesian 9456 11nov17 edit septian
 
The three level of data modeling
The three level of data modelingThe three level of data modeling
The three level of data modeling
 
Rdbms
RdbmsRdbms
Rdbms
 
Formal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data InteroperabilityFormal Models and Algorithms for XML Data Interoperability
Formal Models and Algorithms for XML Data Interoperability
 

Viewers also liked

Candidate-Pack copy
Candidate-Pack copyCandidate-Pack copy
Candidate-Pack copyconnor welch
 
Достопримечатильности Еревана
Достопримечатильности ЕреванаДостопримечатильности Еревана
Достопримечатильности ЕреванаLeon-Avetyan
 
الأماكن الجذابة في مالزيا
الأماكن الجذابة في مالزياالأماكن الجذابة في مالزيا
الأماكن الجذابة في مالزياAmalina Razali
 
ปฏิทินรายเดือน
ปฏิทินรายเดือนปฏิทินรายเดือน
ปฏิทินรายเดือนKatesuda Fon
 
medicinal properties of some herbal plants
medicinal properties of some herbal plantsmedicinal properties of some herbal plants
medicinal properties of some herbal plantsGnanabhaskar Danaboina
 
CO2 EOR Pilot Project
CO2 EOR Pilot ProjectCO2 EOR Pilot Project
CO2 EOR Pilot ProjectAhmed Nour
 
ประวัติส่วนตัว
ประวัติส่วนตัวประวัติส่วนตัว
ประวัติส่วนตัวKatesuda Fon
 

Viewers also liked (11)

Deepak Kumar CV
Deepak Kumar CVDeepak Kumar CV
Deepak Kumar CV
 
Candidate-Pack copy
Candidate-Pack copyCandidate-Pack copy
Candidate-Pack copy
 
Client Brochure
Client BrochureClient Brochure
Client Brochure
 
Достопримечатильности Еревана
Достопримечатильности ЕреванаДостопримечатильности Еревана
Достопримечатильности Еревана
 
antidepressants and anxiolytics
antidepressants and anxiolyticsantidepressants and anxiolytics
antidepressants and anxiolytics
 
الأماكن الجذابة في مالزيا
الأماكن الجذابة في مالزياالأماكن الجذابة في مالزيا
الأماكن الجذابة في مالزيا
 
ปฏิทินรายเดือน
ปฏิทินรายเดือนปฏิทินรายเดือน
ปฏิทินรายเดือน
 
medicinal properties of some herbal plants
medicinal properties of some herbal plantsmedicinal properties of some herbal plants
medicinal properties of some herbal plants
 
CO2 EOR Pilot Project
CO2 EOR Pilot ProjectCO2 EOR Pilot Project
CO2 EOR Pilot Project
 
ประวัติส่วนตัว
ประวัติส่วนตัวประวัติส่วนตัว
ประวัติส่วนตัว
 
Neutraceuticals;functional foods
Neutraceuticals;functional foods Neutraceuticals;functional foods
Neutraceuticals;functional foods
 

Similar to TejGaurThesis

ICT-DBA-level4
ICT-DBA-level4ICT-DBA-level4
ICT-DBA-level4Infotech27
 
Testing XML
Testing XML Testing XML
Testing XML mehramit
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification SystemIRJET Journal
 
A Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using XmlA Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using XmlIRJET Journal
 
Performance Analysis of Leading Application Lifecycle Management Systems for...
Performance Analysis of Leading Application Lifecycle  Management Systems for...Performance Analysis of Leading Application Lifecycle  Management Systems for...
Performance Analysis of Leading Application Lifecycle Management Systems for...Daniel van den Hoven
 
Parsing of xml file to make secure transaction in mobile commerce
Parsing of xml file to make secure transaction in mobile commerceParsing of xml file to make secure transaction in mobile commerce
Parsing of xml file to make secure transaction in mobile commerceijcsa
 
Automated Essay Grading using Features Selection
Automated Essay Grading using Features SelectionAutomated Essay Grading using Features Selection
Automated Essay Grading using Features SelectionIRJET Journal
 
IT6801-Service Oriented Architecture
IT6801-Service Oriented ArchitectureIT6801-Service Oriented Architecture
IT6801-Service Oriented ArchitectureMadhu Amarnath
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Jerry SILVER
 
A novel approach towards developing a statistical dependent and rank
A novel approach towards developing a statistical dependent and rankA novel approach towards developing a statistical dependent and rank
A novel approach towards developing a statistical dependent and rankIAEME Publication
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptxWidsoulDevil
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET Journal
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt064ChetanWani
 
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure DataIRJET Journal
 

Similar to TejGaurThesis (20)

5010
50105010
5010
 
ICT-DBA-level4
ICT-DBA-level4ICT-DBA-level4
ICT-DBA-level4
 
Testing XML
Testing XML Testing XML
Testing XML
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 
Unit 3 WEB TECHNOLOGIES
Unit 3 WEB TECHNOLOGIES Unit 3 WEB TECHNOLOGIES
Unit 3 WEB TECHNOLOGIES
 
A Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using XmlA Survey on Heterogeneous Data Exchange using Xml
A Survey on Heterogeneous Data Exchange using Xml
 
Performance Analysis of Leading Application Lifecycle Management Systems for...
Performance Analysis of Leading Application Lifecycle  Management Systems for...Performance Analysis of Leading Application Lifecycle  Management Systems for...
Performance Analysis of Leading Application Lifecycle Management Systems for...
 
XML Unit 01
XML Unit 01XML Unit 01
XML Unit 01
 
Parsing of xml file to make secure transaction in mobile commerce
Parsing of xml file to make secure transaction in mobile commerceParsing of xml file to make secure transaction in mobile commerce
Parsing of xml file to make secure transaction in mobile commerce
 
Automated Essay Grading using Features Selection
Automated Essay Grading using Features SelectionAutomated Essay Grading using Features Selection
Automated Essay Grading using Features Selection
 
IT6801-Service Oriented Architecture
IT6801-Service Oriented ArchitectureIT6801-Service Oriented Architecture
IT6801-Service Oriented Architecture
 
UNIT-1 Web services
UNIT-1 Web servicesUNIT-1 Web services
UNIT-1 Web services
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
 
WorkExamples
WorkExamplesWorkExamples
WorkExamples
 
A novel approach towards developing a statistical dependent and rank
A novel approach towards developing a statistical dependent and rankA novel approach towards developing a statistical dependent and rank
A novel approach towards developing a statistical dependent and rank
 
Data Science Process.pptx
Data Science Process.pptxData Science Process.pptx
Data Science Process.pptx
 
IRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence AreaIRJET- Testing Improvement in Business Intelligence Area
IRJET- Testing Improvement in Business Intelligence Area
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt
 
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
 

TejGaurThesis

  • 1. See discussions, stats, and author profiles for this publication at: http://www.researchgate.net/publication/251116903 On the Development of a User-Defined Quality Measurement Tool for XML Documents Eric Pardede and Tejasvi Gaur Abstract The capability of eXtensible Markup Language (XML) for data representation has been widely accepted by research communities and industries. Even though it can be used for efficient data transfer, many industries look for a more promising language on which to rely when it comes to their important data. An ability to provide good XML data quality is necessary to make this data format more reliable and usable. To measure data quality, the current methods are largely driven by structural and technical factors and often assess data quality impartially, not accounting for contextual factors. It is well known that different data share common quality features: completeness, validity, accuracy and timeliness. Nevertheless, the measurement of quality features will be unique, based on the data format. The measurement of quality for XML documents cannot be generalised from quality measurement in other data formats. In this chapter, we describe the development of a user-defined quality metric for XML documents. For implementation, we develop a tool that enables users to control XML data quality. We use a case study in health informatics as the proof of concept. Keywords Data quality · XML document · User-defined quality tool 1 Introduction XML (eXtensible Markup Language) primarily facilitates the sharing of semistructured data across different information systems, particularly via the internet, such as passing data from server to client, machine to machine and application to application. XML is an extraction from SGML (Standard Generalized Markup Language) with the aim of performing similar web functions as HTML. Compared to HTML, it gives users more choice and freedom to develop their own tags without
  • 2. 214 E. Pardede and T. Gaur E. Pardede (B) Department of Computer Science and Computer Engineering, La Trobe University, Melbourne, VIC 3086, Australia e-mail: e.pardede@latrobe.edu.au W.W. Song et al. (eds.), Information Systems Development, 213 DOI 10.1007/978-1-4419-7355-9_18, C Springer Science+Business Media, LLC 2011 worrying about the web browser’s compatibility. In broader terms, XML is simpler, more flexible and more extendable. In the past decade, the use of XML as a data format has exceeded its use as a markup language. Many domain-specific standards are now structured in the XML format due to its extendable and self-describing nature. It is no surprise that a large volume of XML documents are created and transmitted over the internet everyday. Some of the data contain important and sensitive content and therefore, the data quality has to be ensured. Data quality describes the relationship between the data and the portrayal of actual phenomena and the degree of excellence of this relationship. In simple terms, if the data is in the correct format for the purpose for which it is required, then it has high data quality. Much of the existing work [2, 4, 6] has investigated data quality dimensions in various domains and data models. They agree to certain features: completeness, accuracy, validity and timeliness. In this work, we discuss how these dimensions are still applicable to measure the quality of XML documents. Based on these dimensions as well, we implement a user-defined quality measurement tool. This tool can be used to assist decision-making for business processes that use XML document as their data format. 1.1 Roadmap Following the introduction, in Section 2, we briefly discuss the related work. We describe our solution in Section 3 and its implementation in Section 4. A case study is provided in Section 5 and we conclude the chapter in Section 6. 2 Related Work We found there was a limited amount of research on XML data quality. However, the majority of this literature discussed various aspects of data quality in traditional relational format. The communities have agreed on the most fundamental aspects of data quality, and we argue that these aspects are also applicable to the XML data format. The only difference is the way to measure these aspects due to the different structure of XML data compared to traditional relational data. Previous work [2, 4, 6] has listed four data quality dimensions: • Completeness (C) is the extent to which data content is present.
  • 3. Development of a User-Defined Quality Measurement Tool for XML Documents 215 • Accuracy (A) is the extent to which data is free from errors. • Validity (V) is the extent to which data items conform to their corresponding value domains. • Timeliness (T) is the extent to which data is recent and up to date. Each dimension is used to measure the Quality (Q) of the data. It is the consolidated effect of all the above characteristics. Table 1 XML data quality dimensions Completeness Incomplete data due to a missing important value. <Medical> <Record> <Name></Name> <.........>******</......> </Record> </Medical> Accuracy Mismatched tags create errors. <Medical> <Record> <Name>John</Names> <.........>******</......> </Record> </Medical> Validity The value does not describe the content accurately. For example, no unit measurement, etc. <Medical> <Record> <Name>John</Name> <.........>*******</......> <Height>170</Height> </Record> </Medical> Timeliness Updated value is not incorporated correctly into the data. For example, the value of the account has not been updated. <Medical> <Record> <Name>John</Name> <.........>*******</......> <Height>170 cm</Height> <Account>(-) $170</Account> </Record> </Medical> A similar measure can also be applied to the XML data format (see Table 1). For example, a completeness dimension has the same meaning wherever it is used, yet Problem Example
  • 4. 216 E. Pardede and T. Gaur can have different contexts and representation. How we measure complete data in a relational format will be different to the way it is measured in a tree structure format. Table 2 summarises the existing work in the area of data quality. This work was applied to different domains/applications, used different data format and applied different approaches to determine data quality. Each is unique, but all the solutions are based on the same data quality dimensions. We have summarised all the work, which lists their applications, their respective approaches and the respective data used. Table 2 Existing works Applications Approach Database/data format Decision support systems [9] Visualisation Relational Case-based reasoning systems [3] Goal-question metrics Heterogeneous e-Business [5] Case study on online processing Relational Web services [7] Query based HTML, XML Data warehouse [1] Empirical database Relational Health care [8] Model driven Relational Only a small amount of research measures the quality of web services, which were naturally built using XML representation. However, this work cannot be used for a quality measurement tool for XML databases and XML applications. 3 User-Defined Quality Approach In this section, we propose a solution to measure XML data quality. The solution includes a proposed metric and an algorithm to incorporate the metric into XML documents. A user-defined metric enables users to determine quality features that a set of XML documents have to follow. In this metric, every element will be given a weight, each of which is variable according to the user’s needs. In addition to the weight, a user should be able to provide additional property to check the XML data, for example, an option to provide preference units to an element. The following metric formula uses all this information to measure the quality factor of XML documents: Quality = × 100% r
  • 5. Development of a User-Defined Quality Measurement Tool for XML Documents 217 where r is the number of records in the document; N(vt) is the number of valid tags in the record; N(t) is the number of tags in the record; weight(vt) is the user-defined weight for a valid tag; and weight(t) is the user-defined weight for a tag. We apply our quality metric in Algorithm 1. This algorithm takes an XML file as an input. At first, it checks the document for all the starting and ending tags. Once all are found, the system concatenates the XML document and stores it in an array in the form of a text file. The data quality checking procedure starts and the system will check the document against the user-defined metric attribute and their respective units (Line 1-12 to 1-19). For each valid metric attribute, it adds up its respective weight to the total weight of the documents. After the complete document is checked, all the values are entered into the final data quality metric and the document’s data quality is calculated. 4 Data Quality Checking Tool Implementation We design a tool for users to define their own quality criteria for their XML documents. The development of the tool follows the diagram in Fig. 1. Fig. 1 Design model of data quality checking tool Webpage XML File Types JAVA Program Database Output Error log
  • 6. 218 E. Pardede and T. Gaur The prototype program is a JAVA-based program which connects to a MySQL database and generates two outputs: (i) output.txt, which contains the breakdown of the XML document in a well-structured form after the program reads the values from the tags and (ii) error.log, which contains all the details of the XML document that affects its quality. To populate the MySQL database, we use a web interface. The interface will be used to manage the XML documents’ properties and the quality factors. The summary of the implementation setup is shown in Table 3. Our web interface takes the user-defined values for each attribute/element in the XML document (Fig. 2). The user has the freedom to delete and change the entered metric attributes. The attribute measurement units and their weights can be left blank, in which case, default values are used. Table 3 Implementation setup Languages used XML, Java, PHP, HTML Database used MySQL database Input file types XML files Output file types Text files, command line outputs Drivers used MySQL-connector-Java-5.1.6-bin.jar Server used WAMP server (only for local development and testing purposes) Fig. 2 XML data quality tool web interface At the current stage of the implementation, the prototype can take reasonably large size of database with textual content. For typical XML data set, it can handle up to 100 MB of data without significant performance problem. It is necessary to
  • 7. Development of a User-Defined Quality Measurement Tool for XML Documents 219 realise that this prototype has been developed using small hardware resources (Core 2 Duo 2.0 GHz processor, with 2 GB RAM). For full industrial application with a larger set of data, the more powerful hardware should be applied. 5 Case Study We apply the developed tool in a real case study using health informatics data. The health informatics sector, like many others, is experiencing a large growth in incoming data, due to the increased number of requirements for which the data is used. The increase in available data has also increased the need to maintain data quality. Fig. 3 User-defined metric attribute check In the case of health informatics or medical data, data quality is even more important than efficiency and speed, as the nature of this data is critical and must be precise. For example, the correct storage of records for patients’ blood types is essential and can be a matter of life or death in an emergency situation. Below is a sample XML document that contains information on a patient. Using our quality tool, a user can identify the metric attributes that have to be checked, as shown in Fig. 3. The user enters all the properties along with their weights. If no unit is given, then default values are used. <MEDICAL> <RECORD> <PATIENTNAME>Michael</PATIENTNAME> <AGE>20yrs</AGE> <MOBILE>0433384056</MOBILE> <ADDRESS>24 the Fairway, Greensborough - 3334</ADDRESS>
  • 8. 220 E. Pardede and T. Gaur <EMERGENCY>Steve-0433765673</EMERGENCY> <WEIGHT>65KG</WEIGHT> <HEIGHT>180cms</HEIGHT> <BLOODGROUP>B+</BLOODGROUP> <BLOODPRESSURE>120-65mmHg</BLOODPRESSURE> <HEARTCONDITIONS>normal</HEARTCONDITIONS> <REACTIONS>N/A</REACTIONS> <ALLERGIES>DUST ALLERGIC</ALLERGIES> </RECORD> </MEDICAL> In this case, the users want to employ the following properties for patient data: (i) all the lines are complete and no blanks lines are present; (ii) all the fields are complete and no blank fields are present; (iii) all the given units are present with the respective attributes; (iv) age has ‘yrs’ as the measurement unit; (v) mobile is represented in ten digits; (vi) weight has ‘kgs’ as the measurement unit; (vii) blood pressure has its measuring units; (viii) for attributes which do not have measurement units provided, default values are present; and (ix) all the attributes have given weights. Figure 4 shows an outcome of a sample XML document quality measurement. In this case, the document only rates 76% against all the quality factors defined by a user. Parts of the XML documents that do not fit the quality factors are logged for future analysis. Fig. 4 Measurement outcome sample Based on the outcome, the users will be able to determine whether the XML documents have met the quality criteria and therefore, can be used for further processing or analysis. The users can also define a different set of quality factors depending on the source or the further use of the XML documents. In health informatics, different hospitals or clinics might have different facilities and different practices for recording their data. If we want to integrate the data, such as for a decision support system, the measurement tool can be used for screening, in the data preparation stage.
  • 9. Development of a User-Defined Quality Measurement Tool for XML Documents 221 While the tool enables user-defined quality metric, it also opens the subjectivity problem. Questions such as who should make the decision on important attributes and their weights should be based on the organisational policy. This chapter aims to provide a tool, which in most cases cannot be run alone without clear procedure on who should use it and how it is applied to assist the business process. 6 Conclusion and Future Work Due to the increasing volume of XML data used for various applications, database users need a tool to manage the quality of the data. Defining data quality has been widely researched over many years, and a set of properties such as completeness, accuracy, validity and timeliness have been set as confirmed data quality dimensions. Unfortunately, judging data quality using these features can be a tedious task and to the best of our knowledge, there is no tool for measuring data quality, especially for XML data. The need for such a tool has become of the utmost importance since XML data, by its nature, will be used for data sharing and integration and therefore, quality has to be maintained and screened carefully. In this chapter, we apply quality dimensions for the XML data format and implement it as a quality measurement tool. The quality dimensions are not static and we provide user-defined input features to define the quality attributes and their weight. For proof of concept, we provide a case study using health informatics data and test the quality measurement using our quality measurement tool. For future work, we will incorporate more sophisticated business quality factors into the tool. We will also perform more scalability evaluation of our data quality tool especially if they have to measure the quality of a large batch of XML documents, such as the XML Warehouse. In addition, a user-defined quality formula can be included in our quality tool. References 1. Ballou, D. P., and Tayi, G. K. (1999) Enhancing data quality in data warehouse environments,Communications of the ACM 42(1): 73–78. 2. Batini, C., and Scannapieco, M. (2006) Data Quality: Concepts, Methodologies and Techniques. Springer, Berlin. 3. Bierer, A. (2007) Methodological assistance for integrating data quality evaluations into casebased reasoning systems, Proceedings of the 7th International Conference on Case-Based Reasoning (ICCBR 2007), Belfast, Northern Ireland, UK, pp. 254–268. 4. Even, A., and Shankaranarayanan, G. (2007) Utility-driven configuration of data quality in datarepositories, IJIQ 1(1): 22–40. 5. Paulson, L. D. (2000) Data quality: A rising e-business concern, IEEE IT Professionals 2(4): 10–14. 6. Serrano, M. A., Calero, C., and Piattini, M. (2005) Metrics for data warehouse quality, In: Khosrow-Pour, M. (Ed.) Encyclopedia of Information Science and Technology IV, Idea Group, Hershey, PA, pp. 1938–1844.
  • 10. 222 E. Pardede and T. Gaur 7. Shankaranarayanan, G., and Cai, Y. (2005) A web services application for the data qualitymanagement in the B2B networked environment, Proceedings of the 38th Hawaii International Conference on System Sciences (HICCS 2005), Hawaii, USA, pp. 166. 8. Welzer, T., Brumen, V., Golob, I., and Druzovec, M. (2002) Medical diagnostic and data quality, Proceedings of the IEEE Symposium on Computer-Based Medical Systems (CBMS 2002), Maribor, Slovenia, pp. 97–101. 9. Zhu, B., Shankar, G., and Cai, Y. (2007) Integrating data quality data into decision-making process: An information visualization approach, Proceedings of the 12th International Conference HCI International (HCII 2007) Part I, Beijing, China, pp. 366–369.