IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
TejGaurThesis
1. See discussions, stats, and author profiles for this
publication at: http://www.researchgate.net/publication/251116903
On the Development of a User-Defined Quality
Measurement Tool for XML Documents
Eric Pardede and Tejasvi Gaur
Abstract The capability of eXtensible Markup Language (XML) for data
representation has been widely accepted by research communities and industries.
Even though it can be used for efficient data transfer, many industries look for a
more promising language on which to rely when it comes to their important data.
An ability to provide good XML data quality is necessary to make this data format
more reliable and usable. To measure data quality, the current methods are largely
driven by structural and technical factors and often assess data quality impartially,
not accounting for contextual factors. It is well known that different data share
common quality features: completeness, validity, accuracy and timeliness.
Nevertheless, the measurement of quality features will be unique, based on the data
format. The measurement of quality for XML documents cannot be generalised
from quality measurement in other data formats. In this chapter, we describe the
development of a user-defined quality metric for XML documents. For
implementation, we develop a tool that enables users to control XML data quality.
We use a case study in health informatics as the proof of concept.
Keywords Data quality · XML document · User-defined quality tool
1 Introduction
XML (eXtensible Markup Language) primarily facilitates the sharing of
semistructured data across different information systems, particularly via the
internet, such as passing data from server to client, machine to machine and
application to application. XML is an extraction from SGML (Standard Generalized
Markup Language) with the aim of performing similar web functions as HTML.
Compared to HTML, it gives users more choice and freedom to develop their own
tags without
2. 214 E. Pardede and T. Gaur
E. Pardede (B)
Department of Computer Science and Computer Engineering, La Trobe University, Melbourne,
VIC 3086, Australia e-mail: e.pardede@latrobe.edu.au
W.W. Song et al. (eds.), Information Systems Development, 213
DOI 10.1007/978-1-4419-7355-9_18, C Springer Science+Business Media, LLC 2011
worrying about the web browser’s compatibility. In broader terms, XML is simpler,
more flexible and more extendable.
In the past decade, the use of XML as a data format has exceeded its use as a
markup language. Many domain-specific standards are now structured in the XML
format due to its extendable and self-describing nature. It is no surprise that a large
volume of XML documents are created and transmitted over the internet everyday.
Some of the data contain important and sensitive content and therefore, the data
quality has to be ensured. Data quality describes the relationship between the data
and the portrayal of actual phenomena and the degree of excellence of this
relationship. In simple terms, if the data is in the correct format for the purpose for
which it is required, then it has high data quality.
Much of the existing work [2, 4, 6] has investigated data quality dimensions in
various domains and data models. They agree to certain features: completeness,
accuracy, validity and timeliness. In this work, we discuss how these dimensions
are still applicable to measure the quality of XML documents. Based on these
dimensions as well, we implement a user-defined quality measurement tool. This
tool can be used to assist decision-making for business processes that use XML
document as their data format.
1.1 Roadmap
Following the introduction, in Section 2, we briefly discuss the related work. We
describe our solution in Section 3 and its implementation in Section 4. A case study
is provided in Section 5 and we conclude the chapter in Section 6.
2 Related Work
We found there was a limited amount of research on XML data quality. However,
the majority of this literature discussed various aspects of data quality in traditional
relational format. The communities have agreed on the most fundamental aspects of
data quality, and we argue that these aspects are also applicable to the XML data
format. The only difference is the way to measure these aspects due to the different
structure of XML data compared to traditional relational data.
Previous work [2, 4, 6] has listed four data quality dimensions:
• Completeness (C) is the extent to which data content is present.
3. Development of a User-Defined Quality Measurement Tool for XML Documents 215
• Accuracy (A) is the extent to which data is free from errors.
• Validity (V) is the extent to which data items conform to their corresponding
value domains.
• Timeliness (T) is the extent to which data is recent and up to date.
Each dimension is used to measure the Quality (Q) of the data. It is the
consolidated effect of all the above characteristics.
Table 1 XML data quality dimensions
Completeness Incomplete data due to a
missing important
value.
<Medical>
<Record>
<Name></Name>
<.........>******</......>
</Record>
</Medical>
Accuracy Mismatched tags create
errors.
<Medical>
<Record>
<Name>John</Names>
<.........>******</......>
</Record>
</Medical>
Validity The value does not
describe the content
accurately. For
example, no unit
measurement, etc.
<Medical>
<Record>
<Name>John</Name>
<.........>*******</......>
<Height>170</Height>
</Record>
</Medical>
Timeliness Updated value is not
incorporated correctly
into the data. For
example, the value of
the account has not
been updated.
<Medical>
<Record>
<Name>John</Name>
<.........>*******</......>
<Height>170 cm</Height>
<Account>(-)
$170</Account>
</Record>
</Medical>
A similar measure can also be applied to the XML data format (see Table 1). For
example, a completeness dimension has the same meaning wherever it is used, yet
Problem Example
4. 216 E. Pardede and T. Gaur
can have different contexts and representation. How we measure complete data in a
relational format will be different to the way it is measured in a tree structure format.
Table 2 summarises the existing work in the area of data quality. This work was
applied to different domains/applications, used different data format and applied
different approaches to determine data quality. Each is unique, but all the solutions
are based on the same data quality dimensions. We have summarised all the work,
which lists their applications, their respective approaches and the respective data
used.
Table 2 Existing works
Applications Approach Database/data format
Decision support
systems [9]
Visualisation Relational
Case-based reasoning
systems [3]
Goal-question metrics Heterogeneous
e-Business [5] Case study on online
processing
Relational
Web services [7] Query based HTML, XML
Data warehouse [1] Empirical database Relational
Health care [8] Model driven Relational
Only a small amount of research measures the quality of web services, which
were naturally built using XML representation. However, this work cannot be used
for a quality measurement tool for XML databases and XML applications.
3 User-Defined Quality Approach
In this section, we propose a solution to measure XML data quality. The solution
includes a proposed metric and an algorithm to incorporate the metric into XML
documents.
A user-defined metric enables users to determine quality features that a set of
XML documents have to follow. In this metric, every element will be given a
weight, each of which is variable according to the user’s needs. In addition to the
weight, a user should be able to provide additional property to check the XML data,
for example, an option to provide preference units to an element.
The following metric formula uses all this information to measure the quality
factor of XML documents:
Quality =
×
100%
r
5. Development of a User-Defined Quality Measurement Tool for XML Documents 217
where r is the number of records in the document; N(vt) is the number of valid tags
in the record; N(t) is the number of tags in the record; weight(vt) is the user-defined
weight for a valid tag; and weight(t) is the user-defined weight for a tag.
We apply our quality metric in Algorithm 1. This algorithm takes an XML file
as an input. At first, it checks the document for all the starting and ending tags. Once
all are found, the system concatenates the XML document and stores it in an array
in the form of a text file.
The data quality checking procedure starts and the system will check the
document against the user-defined metric attribute and their respective units (Line
1-12 to 1-19). For each valid metric attribute, it adds up its respective weight to the
total weight of the documents. After the complete document is checked, all the
values are entered into the final data quality metric and the document’s data quality
is calculated.
4 Data Quality Checking Tool Implementation
We design a tool for users to define their own quality criteria for their XML
documents. The development of the tool follows the diagram in Fig. 1.
Fig. 1 Design model of data
quality checking tool Webpage XML File Types
JAVA
Program
Database
Output Error log
6. 218 E. Pardede and T. Gaur
The prototype program is a JAVA-based program which connects to a MySQL
database and generates two outputs: (i) output.txt, which contains the breakdown of
the XML document in a well-structured form after the program reads the values
from the tags and (ii) error.log, which contains all the details of the XML document
that affects its quality. To populate the MySQL database, we use a web interface.
The interface will be used to manage the XML documents’ properties and the
quality factors. The summary of the implementation setup is shown in Table 3.
Our web interface takes the user-defined values for each attribute/element in the
XML document (Fig. 2). The user has the freedom to delete and change the entered
metric attributes. The attribute measurement units and their weights can be left
blank, in which case, default values are used.
Table 3 Implementation setup
Languages used XML, Java, PHP, HTML
Database used MySQL database
Input file types XML files
Output file types Text files, command line outputs
Drivers used MySQL-connector-Java-5.1.6-bin.jar
Server used WAMP server (only for local development and testing purposes)
Fig. 2 XML data quality tool web interface
At the current stage of the implementation, the prototype can take reasonably
large size of database with textual content. For typical XML data set, it can handle
up to 100 MB of data without significant performance problem. It is necessary to
7. Development of a User-Defined Quality Measurement Tool for XML Documents 219
realise that this prototype has been developed using small hardware resources (Core
2 Duo 2.0 GHz processor, with 2 GB RAM). For full industrial application with a
larger set of data, the more powerful hardware should be applied.
5 Case Study
We apply the developed tool in a real case study using health informatics data. The
health informatics sector, like many others, is experiencing a large growth in
incoming data, due to the increased number of requirements for which the data is
used.
The increase in available data
has also increased the need to
maintain data quality.
Fig. 3 User-defined metric attribute
check
In the case of health informatics or medical data, data quality is even more
important than efficiency and speed, as the nature of this data is critical and must be
precise. For example, the correct storage of records for patients’ blood types is
essential and can be a matter of life or death in an emergency situation.
Below is a sample XML document that contains information on a patient. Using
our quality tool, a user can identify the metric attributes that have to be checked, as
shown in Fig. 3. The user enters all the properties along with their weights. If no
unit is given, then default values are used.
<MEDICAL>
<RECORD>
<PATIENTNAME>Michael</PATIENTNAME>
<AGE>20yrs</AGE>
<MOBILE>0433384056</MOBILE>
<ADDRESS>24 the Fairway, Greensborough - 3334</ADDRESS>
8. 220 E. Pardede and T. Gaur
<EMERGENCY>Steve-0433765673</EMERGENCY>
<WEIGHT>65KG</WEIGHT>
<HEIGHT>180cms</HEIGHT>
<BLOODGROUP>B+</BLOODGROUP>
<BLOODPRESSURE>120-65mmHg</BLOODPRESSURE>
<HEARTCONDITIONS>normal</HEARTCONDITIONS>
<REACTIONS>N/A</REACTIONS>
<ALLERGIES>DUST ALLERGIC</ALLERGIES>
</RECORD>
</MEDICAL>
In this case, the users want to employ the following properties for patient data:
(i) all the lines are complete and no blanks lines are present; (ii) all the fields are
complete and no blank fields are present; (iii) all the given units are present with the
respective attributes; (iv) age has ‘yrs’ as the measurement unit; (v) mobile is
represented in ten digits; (vi) weight has ‘kgs’ as the measurement unit; (vii) blood
pressure has its measuring units; (viii) for attributes which do not have measurement
units provided, default values are present; and (ix) all the attributes have given
weights.
Figure 4 shows an outcome of a sample XML document quality measurement. In
this case, the document only rates 76% against all the quality factors defined by a
user. Parts of the XML documents that do not fit the quality factors are logged for
future analysis.
Fig. 4 Measurement outcome sample
Based on the outcome, the users will be able to determine whether the XML
documents have met the quality criteria and therefore, can be used for further
processing or analysis. The users can also define a different set of quality factors
depending on the source or the further use of the XML documents.
In health informatics, different hospitals or clinics might have different facilities
and different practices for recording their data. If we want to integrate the data, such
as for a decision support system, the measurement tool can be used for screening, in
the data preparation stage.
9. Development of a User-Defined Quality Measurement Tool for XML Documents 221
While the tool enables user-defined quality metric, it also opens the subjectivity
problem. Questions such as who should make the decision on important attributes
and their weights should be based on the organisational policy. This chapter aims to
provide a tool, which in most cases cannot be run alone without clear procedure on
who should use it and how it is applied to assist the business process.
6 Conclusion and Future Work
Due to the increasing volume of XML data used for various applications, database
users need a tool to manage the quality of the data. Defining data quality has been
widely researched over many years, and a set of properties such as completeness,
accuracy, validity and timeliness have been set as confirmed data quality
dimensions.
Unfortunately, judging data quality using these features can be a tedious task and
to the best of our knowledge, there is no tool for measuring data quality, especially
for XML data. The need for such a tool has become of the utmost importance since
XML data, by its nature, will be used for data sharing and integration and therefore,
quality has to be maintained and screened carefully.
In this chapter, we apply quality dimensions for the XML data format and
implement it as a quality measurement tool. The quality dimensions are not static
and we provide user-defined input features to define the quality attributes and their
weight. For proof of concept, we provide a case study using health informatics data
and test the quality measurement using our quality measurement tool.
For future work, we will incorporate more sophisticated business quality factors
into the tool. We will also perform more scalability evaluation of our data quality
tool especially if they have to measure the quality of a large batch of XML
documents, such as the XML Warehouse. In addition, a user-defined quality
formula can be included in our quality tool.
References
1. Ballou, D. P., and Tayi, G. K. (1999) Enhancing data quality in data warehouse
environments,Communications of the ACM 42(1): 73–78.
2. Batini, C., and Scannapieco, M. (2006) Data Quality: Concepts, Methodologies and
Techniques. Springer, Berlin.
3. Bierer, A. (2007) Methodological assistance for integrating data quality evaluations into
casebased reasoning systems, Proceedings of the 7th International Conference on Case-Based
Reasoning (ICCBR 2007), Belfast, Northern Ireland, UK, pp. 254–268.
4. Even, A., and Shankaranarayanan, G. (2007) Utility-driven configuration of data quality in
datarepositories, IJIQ 1(1): 22–40.
5. Paulson, L. D. (2000) Data quality: A rising e-business concern, IEEE IT Professionals 2(4):
10–14.
6. Serrano, M. A., Calero, C., and Piattini, M. (2005) Metrics for data warehouse quality, In:
Khosrow-Pour, M. (Ed.) Encyclopedia of Information Science and Technology IV, Idea Group,
Hershey, PA, pp. 1938–1844.
10. 222 E. Pardede and T. Gaur
7. Shankaranarayanan, G., and Cai, Y. (2005) A web services application for the data
qualitymanagement in the B2B networked environment, Proceedings of the 38th Hawaii
International Conference on System Sciences (HICCS 2005), Hawaii, USA, pp. 166.
8. Welzer, T., Brumen, V., Golob, I., and Druzovec, M. (2002) Medical diagnostic and data
quality, Proceedings of the IEEE Symposium on Computer-Based Medical Systems (CBMS
2002), Maribor, Slovenia, pp. 97–101.
9. Zhu, B., Shankar, G., and Cai, Y. (2007) Integrating data quality data into decision-making
process: An information visualization approach, Proceedings of the 12th International
Conference HCI International (HCII 2007) Part I, Beijing, China, pp. 366–369.