Vol 8 No 1 - December 2013

ISSN: 1694-2507 (Print)
ISSN: 1694-2108 (Online)
International Journal of Computer Science
and Business Informatics
(IJCSBI.ORG)
VOL 8, NO 1
DECEMBER 2013

Table of Contents VOL 8, NO 1 DECEMBER 2013
An Integrated Distributed Clustering Algorithm for Large Scale WSN...................................................1
S. R. Boselin Prabhu, S. Sophia, S. Arthi and K. Vetriselvi
An Efficient Connection between Statistical Software and Database Management System ................... 1
Sunghae Jun
Pragmatic Approach to Component Based Software Metrics Based on Static Methods ......................... 1
S. Sagayaraj and M. Poovizhi
SDI System with Scalable Filtering of XML Documents for Mobile Clients ............................................... 1
Yi Yi Myint and Hninn Aye Thant
An Easy yet Effective Method for Detecting Spatial Domain LSB Steganography .................................... 1
Minati Mishra and Flt. Lt. Dr. M. C. Adhikary
Minimizing the Time of Detection of Large (Probably) Prime Numbers ................................................... 1
Dragan Vidakovic, Dusko Parezanovic and Zoran Vucetic
Design of ATL Rules for TransformingUML 2 Sequence Diagrams into Petri Nets..................................... 1
Elkamel Merah, Nabil Messaoudi, Dalal Bardou and Allaoua Chaoui
IJCSBI.ORG

International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 8, No. 1. DECEMBER 2013 1
An Integrated Distributed Clustering
Algorithm for Large Scale WSN
S. R. BOSELIN PRABHU
Assistant Professor, Department of Electronics and Communication Engineering
SVS College of Engineering, Coimbatore, India.
S. SOPHIA
Professor, Department of Electronics and Communication Engineering
Sri Krishna College of Engineering and Technology, Coimbatore, India.
S. ARTHI & K. VETRISELVI
UG Students, Department of Electronics and Communication Engineering
SVS College of Engineering, Coimbatore, India.
Abstract
Latest researches in wireless communications and electronics has imposed
the progress of low-cost wireless sensor nodes. Clustering is a thriving
topology control approach, which can prolong the lifetime and increase
scalability for wireless sensor networks. The admired criteria for clustering
methodology are to select cluster heads with more residual energy and to
rotate them periodically. Sensors at heavy traffic locations quickly deplete
their energy resources and die much earlier, leaving behind energy hole and
network partition. In this paper, a model of distributed layer-based
clustering algorithm is proposed based on three concepts. First, the
aggregated data is forwarded from cluster head to the base station through
cluster head of the next higher layer with shortest distance between the
cluster heads. Second, cluster head is elected based on the clustering factor,
which is the combination of residual energy and the number of neighbors of
a particular node within a cluster. Third, each cluster has a crisis hindrance
node, which does the function of cluster head when the cluster head fails to
carry out its work in some critical conditions. The key aim of the proposed
algorithm is to accomplish energy efficiency and to prolong the network
lifetime. The proposed distributed clustering algorithm is contrasted with
the existing clustering algorithm LEACH.
Keywords: Wireless sensor network (WSN), distributed clustering
algorithm, cluster head, residual energy, energy efficiency, network lifetime.

IJCSBI.ORG
1. INTRODUCTION
Wireless sensor network (WSN) is a collection of huge number of small,
low-power and low-cost electronic devices called sensor nodes. Each sensor
node consists of four major blocks: sensing, processing, power and
communication unit and they are responsible for sensing, processing and
wireless communications (figure 1). These nodes bring together the relevant
data from the environment and then transfer the gathered data to base station
(BS). Since WSNs has many advantages like self organization,
infrastructure-free, fault-tolerance and locality, they have a wide variety of
potential applications like border security and surveillance, environmental
monitoring and forecasting, wildlife animal protection and home
automation, disaster management and control. Considering that sensor
nodes are usually deployed in remote locations, it is impossible to recharge
their batteries. Therefore, ways to utilize the limited energy resource wisely
to extend the lifetime of sensor networks is a very demanding research issue
for these sensor networks.
Figure 1: Various components of a wireless sensor node
Clustering [2-7] is an effectual topology control approach, which can
prolong the lifetime and increase scalability for these sensor networks. The
popular criterion for clustering technique (figure 2) is to select a cluster head
(CH) with more residual energy and to spin them periodically. The basic
idea of clustering algorithms is to use the data aggregation [8-11]
mechanism in the cluster head to lessen the amount of data transmission.
Clustering goes behind some advantages like network scalability, localizing

IJCSBI.ORG
route setup, uses communication bandwidth [17] efficiently and takes
advantage of network lifetime [12-16]. By the data aggregation process,
unnecessary communication between sensor nodes, cluster head and the
base station is evaded. In this paper, a well-defined model of distributed
layer-based clustering algorithm is proposed based of three concepts: the
aggregated data is forwarded from the cluster head to the base station
through cluster head of the next higher layer with shortest distance between
the cluster heads, cluster head is elected based on the clustering factor and
the crisis hindrance node does the function of cluster head when the cluster
head fails to carry out its work. The prime aim of the proposed algorithm is
to attain energy efficiency and increased network lifetime.
Figure 2: Cluster formation in a wireless sensor network
The rest of this paper is structured as follows. A literature review of existing
distributed clustering algorithms, talking about their projected advantages
and shortcomings is profoundly conversed in Section 2. An evaluation of
the existing clustering algorithm LEACH (Low Energy Adaptive Clustering
Hierarchy) and the basic concept behind this algorithm is briefed in Section
3. Section 4 sketches a precise model of the proposed distributed layer-
based clustering algorithm, enumerating the precious hiding concepts
behind it. Finally, the last section gives the conclusion creatively.

IJCSBI.ORG
2. A REVIEW OF EXISTING CLUSTERING ALGORITHMS
Bandyopadhyay and Coyle anticipated EEHC [18], which is a randomized
clustering algorithm which categorizes the sensor nodes into hierarchy of
clusters with an objective of minimizing the total energy spent in the system
to communicate the information gathered by the sensors to the information
processing center. It has variable cluster count, the immobile cluster head
aggregates and relays the data to the BS. It is valid for extensive large scale
networks. The peculiar negative aspect of this algorithm is that, some nodes
remain un-clustered throughout the clustering process.
Barker, Ephremides and Flynn proposed LCA [19], which is chiefly
developed to avoid the communication collisions among the nodes by using
a TDMA time-slot. It makes utilization of single-hop scheme thereby
attaining high degree of connectivity when CH is selected randomly. The
restructured version of LCA, the LCA2 was implemented to lessen the
number of nodes compared to the original LCA algorithm. The key
drawback of this algorithm is that, the single-hop clustering leads to the
creation of more number of clusters.
Nagpal and Coore proposed CLUBS [20], which is executed with an idea to
form overlapping clusters with maximum cluster diameter of two hops. The
clusters are created by local broadcasting and its convergence depends on
the local density of the wireless sensor nodes. This algorithm can be
implemented in asynchronous environment without dropping efficiency.
The main difficulty is the overlapping of clusters, clusters having their CHs
within one hop range of each other, thereby both the clusters will collapse
and CH election process will get restarted.
Demirbas, Arora and Mittal brought out FLOC [21], which shows double-
band nature of wireless radio-model for communication. The nodes can
commune reliably with the nodes in the inner-band and unreliably with the
nodes that are in the outer-band. The chief disadvantage of the algorithm is,
the communication between the nodes in the outer band is unreliable and the
messages have maximum probability of getting lost during communication.
Ye, Li, Chen and Wu proposed EECS [22], which is based on a supposition
that all CHs can communicate directly with the BS. The clusters have
variable size, those closer to the CH are larger in size and those farther from
CH are smaller in size. It is really energy efficient in intra-cluster
communication and shows an excellent improvement in network lifetime.

IJCSBI.ORG
EEUC is anticipated for uniform energy consumption within the sensor
network. It forms dissimilar clusters, with a guessing that each cluster can
have variable sizes. Probabilistic selection of CH is the focal shortcoming of
this algorithm. Few nodes will be gone without being part of any cluster.
Yu, Li and Levy proposed DECA, which selects CH based on residual
energy, connectivity and a node identifier. It is greatly energy efficient, as it
uses lesser messages for CH selection. The main trouble with this algorithm
is that high risk of wrong CH selection which leads to the discarding of
every packets sent by the wireless sensor node.
Ding, Holliday and Celik proposed DWEHC, which elects CH on the basis
of weight, a combination of nodes’ residual energy and its distance to the
neighboring nodes. It produces well balanced clusters, independent of
network topology. A node possessing largest weight in a cluster is
designated as CH. The algorithm constructs multilevel clusters and the
nodes in every cluster reach CH by relaying through other intermediate
nodes. The foremost problem occurs due to much energy utilization by
several iterations until the nodes settle in most energy efficient topology.
HEED is a well distributed clustering algorithm in which CH selection is
done by taking into account the residual energy of the nodes and intra-
cluster communication cost leading to prolonged network lifetime. It is clear
that it can have variable cluster count and supports heterogeneous sensors.
The problems with HEED are its application narrowed only to static
networks, the employment of complex methods and multiple clustering
messages per node for CH selection even though it prevents random
selection of CH.
3. AN EVALUATION OF LEACH ALGORITHM
LEACH [1] is one of the most well-liked clustering mechanisms for WSNs
and it is considered as the representative energy efficient protocol. In this
protocol, sensor nodes are unified together to form a cluster. In each cluster,
one sensor node is chosen arbitrarily to act as a cluster head (CH), which
collects data from its member nodes, aggregates them and then forwards to
the base station. It disperses the operation unit into many rounds and each
round consists of two phases: the set-up phase and the steady phase. During
the set-up phase, initial clusters are fashioned and cluster heads are selected.
All the wireless sensor nodes produce a random number between 0 and 1. If
the number is lesser than the threshold, then the node selects itself as the
cluster head for the present round. The threshold for cluster head selection
in LEACH for a particular round is given in equation 1. Gone selecting

IJCSBI.ORG
itself as a CH, the sensor node broadcasts an advertisement message which
has its own ID. The non-cluster head nodes can formulate an assessment,
which cluster to join based on the strength of the received advertisement
signal. After the decision is made, every non-cluster head node should
transmit a join- request message to the chosen cluster head to specify that it
will be a member of the cluster. The cluster head fashions and broadcasts a
time division multiple access (TDMA) schedule to exchange the data with
non-cluster sensor nodes without collision after it receives all the join-
request messages.
(1)
where p is the preferred percentage of cluster heads, r is the current round
number and G is the set of nodes which have not been chosen as cluster
head for the last 1/p rounds.
The steady phase commences after the clusters are fashioned and the TDMA
schedules are broadcasted. All of the sensor nodes transmits their data to the
cluster head once per round during their allotted transmission slot based on
the TDMA schedule and in other time, they turn off the radio in order to
trim down the energy consumption. However, the cluster heads must stay
awake all the time. Therefore, it can receive every data from the nodes
within their own clusters. On receiving the data from the cluster, the cluster
head carries out data aggregation mechanism and onwards it to the base
station directly. This is the entire mechanism of the steady state phase. After
a certain predefined time, the network will step into the next round. LEACH
is the basic clustering protocol which processes cluster approach and it can
prolong the network lifetime in comparison with other multi-hop routing
and static routing. However, there are still some hiding problems that should
be considered.
LEACH does not take into account the residual energy to elect cluster heads
and to construct the clusters. As a result, nodes with lesser energy may be
elected as cluster heads and then die much earlier. Moreover, since a node
selects itself as a cluster head only according to the value of the calculated
probability, it is hard to guarantee the number of cluster heads and their
distribution. Also in LEACH clustering algorithm, the cluster heads are
selected randomly and hence the weaker nodes drain easily. To rise above

IJCSBI.ORG
these shortcomings in LEACH, a model of distributed layer-based clustering
algorithm is proposed, where clusters are arranged in to hierarchical layers.
Instead of cluster heads directly sending the aggregated data to the base
station, sends them to their next layer nearer cluster heads. These cluster
heads send their data along with that received from lower level cluster heads
to the next layer nearer cluster heads. The cumulative process gets repeated
and finally the data from all the layers reach the base station. The proposed
model is dedicated with some expensive designs, focusing on reduced
energy utilization and improved network lifetime of the sensor network.
4. THE PROPOSED CLUSTERING ALGORITHM
The proposed clustering algorithm is well distributed, where the sensor
nodes are deployed randomly to sense the target environment. The nodes are
divided into clusters with each cluster having a CH. The nodes throw the
information during their TDMA timeslot to their respective CH which fuses
the data to avoid redundant information by the process of data aggregation.
The aggregated data is forwarded to the BS. Compared to the existing
algorithms, the proposed algorithm has three distinguishing features. First,
the aggregated data is forwarded from the cluster head to the base station
through cluster head of the next higher layer with shortest distance between
the cluster heads. Second, cluster head is elected based on the clustering
factor, which is the combination of residual energy and the number of
neighbors of a particular node within a cluster. Third, each cluster has a
crisis hindrance node, that does the function of cluster head when the cluster
head fails to carry out its work in some conditions.
Figure 3: Aggregated data forwarding in the proposed algorithm

IJCSBI.ORG
A. Aggregated Data Forwarding
In a network of N nodes, each node is assigned with an exclusive Node
Identity (NID). The NID just serves as a recognition of the nodes and has no
relationship with location or clustering. The CH will be placed at the center
and the nodes will be organized in to several layers around the CH. Every
clusters are arranged into hierarchical layers and layer numbers are assigned
to each clusters. The cluster that is far away from the base station is
designated as the lowest layer and the cluster nearer to the base station is
designated as the highest layer. The main characteristic feature of the
proposed algorithm is that the lowest layer cluster head forwards only its
own aggregated data to the next layer cluster head but the highest layer
forwards all the aggregated data from the preceding cluster heads to the base
station (figure 3). Thus lower workload is assigned to the lower layers but
the higher layers are assigned with greater workload. The workload assigned
to a particular cluster head is directly proportional to the energy utilization
of the cluster head. In order to balance the energy utilization among the
cluster head, the concept of variable transmission power is employed, where
the transmission power reduces with increase in layer numbers. In LEACH,
each cluster head forwards the aggregated data to the base station directly
which uses much energy. The proposed algorithm uses a multi-hop fashion
of data forwarding from cluster head to the base station resulting in reduced
energy utilization.

IJCSBI.ORG
Figure 4: Mechanism of cluster head selection in the proposed algorithm
B. Cluster Head Selection
The cluster head is elected based on the clustering factor (figure 4), which is
the combination of residual energy and the number of neighbors of a
particular node within a cluster. Residual energy is defined as the energy
remaining within a particular node after some number of rounds. This is
generally believed as one of the main parameter for CH selection in the
proposed algorithm. A neighboring node is a node that remains closer to a
particular node within one hop distance. LEACH selects cluster head only
based on residual energy, but in the proposed algorithm an additional
parameter is included basically to elect the cluster head properly, thereby to
reduce the node death rate. The main characteristic feature of the proposed
algorithm compared to LEACH is that, the base station does not involve in
clustering process directly or indirectly. A node with highest clustering
factor is selected as cluster head for the current round. This is generally
significant in mobile environment, when the sensor nodes move, the number

IJCSBI.ORG
of neighbors vary which should be taken into account but it is barely not
concentrated in the LEACH clustering mechanism.
C. Alternate Crisis Hindrance Node
In a cluster with large number of nodes, cluster crisis does not affect the
overall performance of the wireless sensor system. But in the case of
network with less number of nodes, cluster crisis greatly affects the wireless
sensor system. Care should be done when cluster head selection process by
applying alternate recovery mechanisms. In addition to the regular cluster
head, additional cluster node is assigned the task of secondary cluster head,
and the particular node is called as crisis hindrance node. Generally the
cluster collapses when the cluster head fails. In such situations, crisis
hindrance node act as cluster head and recovers the cluster. The main
characteristic feature of the proposed algorithm is that, the crisis hindrance
node solely performs the function of recovery mechanism and does not
involve in sensing process. In case of LEACH, the distribution and the
loading of CHs to all nodes in the networks is not uniform by switching the
cluster heads periodically. Hence, there is a maximum probability of a
cluster to be collapsed easily, but it can be avoided in the proposed
algorithm with the help of crisis hindrance node.
6. CONCLUSION AND FUTURE WORK
This paper gives a brief introduction on clustering process in wireless sensor
networks. A study on the well evaluated distributed clustering algorithm
Low Energy Adaptive Clustering Hierarchy (LEACH) is described
artistically. To overcome the drawbacks of the existing LEACH algorithm, a
model of distributed layer-based clustering algorithm is proposed for
clustering the wireless sensor nodes. The proposed distributed clustering
algorithm is based on the aggregated data being forwarded from the cluster
head to the base station through cluster head of the next higher layer with
shortest distance between the cluster heads. The selection of cluster head is
based on the clustering factor, which is the combination of residual energy
and the number of neighbors of a particular node within a cluster. Also each
cluster has a crisis hindrance node. In future, the algorithm will be simulated
using the network simulator and the simulated results will be compared with
two or three existing distributed clustering algorithms.
7. ACKNOWLEDGMENTS
Our sincere gratitude to the management of SVS Educational Institutions
and my Research Supervisor Dr. S. Sophia who served as a guiding light to
come out with this amazing research work.

IJCSBI.ORG
REFERENCES
[1] W.B.Heinzelman, A.P.Chandrakasan, H.Balakrishnan, (2002), “An application specific
protocol architecture for wireless microsensor networks”, IEEE Transactions on Wireless
Communication Volume 1, Number 4, Pages 660-670.
[2] O.Younis, S.Fahmy, (2004), “HEED: A hybrid energy-efficient distributed clustering
approach for adhoc sensor networks”, IEEE Transactions on Mobile Computing, Volume 3,
Number 4, Pages 366-379.
[3] S.Zairi, B.Zouari, E.Niel, E.Dumitrescu, (2012), “Nodes self-scheduling approach for
maximizing wireless sensor network lifetime based on remaining energy” IET Wireless
Sensor Systems, Volume 2, Number 1, Pages 52-62.
[4] I.Akyildiz, W.Su, Y.Sankarasubramaniam, E.Cayirci, (2002), “A Survey on sensor
networks”, IEEE Communications Magazine, Pages 102-114.
[5] G.J.Pottie, W.J.Kaiser, (2000), “Embedding the internet: wireless integrated network
sensors”, Communications of the ACM, Volume 43, Number 5, Pages 51-58.
[6] J.H.Chang, L.Tassiulas, (2004), “Maximum lifetime routing in wireless sensor
networks”, IEEE/ACM Transactions on Networking, Volume 12, Number 4, Pages 609-
619.
[7] S.R.Boselin Prabhu, S.Sophia, (2011), “A survey of adaptive distributed clustering
algorithms for wireless sensor networks”, International Journal of Computer Science and
Engineering Survey, Volume 2, Number 4, Pages 165-176.
[8] S.R.Boselin Prabhu, S.Sophia, (2012), “A Research on decentralized clustering
algorithms for dense wireless sensor networks”, International Journal of Computer
Applications , Volume 57, Number 20, Pages 0975-0987.
[9] S.R.Boselin Prabhu, S.Sophia, (2013), “Mobility assisted dynamic routing for mobile
wireless sensor networks”, International Journal of Advanced Information Technology ,
Volume 3, Number 1, Pages 09-19.
[10] S.R.Boselin Prabhu, S.Sophia, (2013), “A review of energy efficient clustering
algorithm for connecting wireless sensor network fields”, International Journal of
Engineering Research & Technology, Volume 1, Number 4, Pages 477–481.
[11] S.R.Boselin Prabhu, S.Sophia, (2013), “Capacity based clustering model for dense
wireless sensor networks”, International Journal of Computer Science and Business
Informatics, Volume 5, Number 1.
[12] J.Deng, Y.S.Han, W.B.Heinzelman, P.K.Varshney, (2005), “Balanced-energy sleep
scheduling scheme for high density cluster-based sensor networks”, Elsevier Computer
Communications Journal, Special Issue on ASWN04, Pages 1631-1642.
[13] C.Y.Wen, W.A.Sethares, (2005), “Automatic decentralized clustering for wireless
sensor networks”, EURASIP Journal of Wireless Communication Networks, Volume 5,
Number 5, Pages 686-697.
[14] S.D.Murugananthan, D.C.F.Ma, R.I.Bhasin, A.O.Fapojuwo, (2005) “A centralized
energy-efficient routing protocol for wireless sensor networks”, IEEE Transactions on
Communication Magazine, Volume 43, Number 3, Pages S8-13.
[15] F.Bajaber, I.Awan, (2009), “Centralized dynamic clustering for wireless sensor
networks”, Proceedings of the International Conference on Advanced Information
Networking and Applications.

IJCSBI.ORG
[16] Pedro A. Forero, Alfonso Cano, Georgios B.Giannakis, (2011), “Distributed clustering
using wireless sensor networks”, IEEE Journal of Selected Topics in Signal Processing,
Volume 5, Pages 707-724.
[17] Lianshan Yan, Wei Pan, Bin Luo, Xiaoyin Li, Jiangtao Liu, (2011), “Modified energy-
efficient protocol for wireless sensor networks in the presence of distributed optical fiber
sensor link, IEEE Sensors Journal, Volume 11, Number 9, Pages 1815-1819.
[18] S.Bandyopadhay, E.Coyle, (2003), “An energy-efficient hierarchical clustering
algorithm for wireless sensor networks”, Proceedings of the 22nd
Annual Joint Conference
of the IEEE Computer and Communications Societies (INFOCOM 2003), San Francisco,
California.
[19] D.J.Barker, A.Ephremides, J.A.Flynn, (1984), “The design and simulation of a mobile
radio network with distributed control”, IEEE Journal on Selected Areas in
Communications, Pages 226-237.
[20] R.Nagpal, D.Coore, (2002), “An algorithm for group formation in an amorphous
computer”, Proceedings of IEEE Military Communications Conference (MILCOM 2002),
Anaheim, CA.
[21] M.Demirbas, A.Arora, V.Mittal, (2004), “FLOC: A fast local clustering service for
wireless sensor networks”, Proceedings of Workshop on Dependability Issues in Wireless
Ad Hoc Networks and Sensor Networks (DIWANS’04), Italy.
[22] M.Ye, C.F.Li, G.H.Chen, J.Wu, (2005), “EECS: An energy efficient clustering scheme
in wireless sensor networks”, Proceedings of the Second IEEE International Performance
Computing and Communications Conference (IPCCC), Pages 535-540.

IJCSBI.ORG
An Efficient Connection between
Statistical Software and Database
Management System
Sunghae Jun
Department of Statistics, Cheongju University
Chungbuk 360-764 Korea
ABSTRACT
In big data era, we need to manipulate and analyze the big data. For the first step of big data
manipulation, we can consider traditional database management system. To discover novel
knowledge from the big data environment, we should analyze the big data. Many statistical
methods have been applied to big data analysis, and most works of statistical analysis are
dependent on diverse statistical software such as SAS, SPSS, or R project. In addition, a
considerable portion of big data is stored in diverse database systems. But, the data types of
general statistical software are different from the database systems such as Oracle, or
MySQL. So, many approaches to connect statistical software to database management
system (DBMS) were introduced. In this paper, we study on an efficient connection
between the statistical software and DBMS. To show our performance, we carry out a case
study using real application.
Keywords
Statistical software, Database management system, Big data analysis, Database connection,
MySQL, R project.
1. INTRODUCTION
Every day, huge data are created from diverse fields, and stored in computer
systems. These big data are extremely large and complex [1]. So, it is very
difficult to manage and analyze them. But, big data analysis is important
issue in many fields such as marketing, finance, technology, or medicine.
Big data analysis is based on statistics and machine learning algorithms. In
addition, data analysis is depended on statistical software, and the data are
stored in database systems. So, for big data analysis, we should manage
statistical software and database system effectively. In this paper, we
consider R project system as statistical software. R is an environment for
statistical computing including statistical analysis and graphical display of
data [2]. This program provides most of statistical and machine learning
methods for big data analysis. We use MySQL for connecting database
system from R project. The MySQL is a database management system
(DBMS) product that is the most popular open source database in the world,
in addition, this is a free software like R system [3]. So, in our research, we
use R and MySQL for an efficient connection between statistical software
and DBMS. There was a work about DB access through R [4]. This covered

IJCSBI.ORG
the DB access problems of R, and showed the ODBC (open database
connectivity) drivers for connecting R and DBMS such as MySQL,
PostgreSQL, and Oracle. Also, the authors of this paper introduced the
installation and technological environment for the DB access. But, they did
not illustrate detailed approaches for real applications. That is, their work
was about a conceptual suggestion for the access of R to MySQL. So, in this
paper, we perform more specific study for connection between statistical
software, R to DBMS, MySQL. In our case study, we will show detailed
and efficient connection of R to MySQL using specific data set from the
University of California, Irvine (UCI) machine learning repository [5]. We
will cover our research background in next section. In section 3, our
proposed methodology will be shown. We also introduce an efficient
connection between statistical database and DBMS in section 4. Lastly we
conclude our study and offer our future works for statistical database system.
2. RESEARCH BACKGROUND
2.1 Statistical Software
To analyze data, we can consider diverse approaches using statistical
software. These days, there are so many products for statistical software.
SAS (statistical analysis system) is the most popular software for statistical
analysis [6]. But, this is expensive, so there are not many companies using
SAS except large size companies. SPSS (statistical analysis in social science)
is another representative software [7], but this is also expensive. Minitab [8]
and S-Plus [9] are well used statistics packages and these are all not free.
Recently, R has been used in many works for statistical data analysis, and
this is free. In addition, R also provides most of statistical functions
included in SAS, or SPSS. R is open source program, so we can modify R
functions for our statistical computing. This is very useful advantage of R.
Therefore, we consider R for connection to database system in this research.
2.2 Database Management System
Database is a collection of data, and database management system (DBMS)
is a software for managing database using structured query language (SQL)
[10],[11]. Oracle is one of popular DBMS products [12], but it is expensive.
MySQL is another DBMS, which is widely used open source software in the
world [3]. Also, most functions of MySQL are similar to Oracle [3]. So, in
this paper, we use MySQL for DBMS connecting to statistical software, R.
Using MySQL DBMS efficiently, we use RODBC package supported by R
CRAN in our research [13].
3. STATISTICAL DATABASE SYSTEM
The main goal of our study is to solve the cost problem for constructing
statistical database system, because we should buy additional product to
connect statistical software to DBMS. For example, for the connection

IJCSBI.ORG
between SAS and DBMS, we need „SAS/Access‟ product as supplementary
software. In general, this is expensive. So, we tried to make the connection
between statistical software and DBMS without cost. The „efficient‟ of our
paper was about „cost‟. There are many approaches to connect statistical
software and DBMS. To use most of them, we should buy additional
products. But, there are few free approaches. So, we find an approach to
connect statistical software and DBMS without cost. In this paper, we study
an efficient connection between DBMS and statistical software. We select
the MySQL as a DBMS for our research, and use R project as statistical
software because not only they are free but also they have good functions. In
addition, the R and MySQL have strong performance in statistical
computing and DBMS respectively for constructing statistical database
system [14],[15],[16],[17]. In general, big data are transformed to structured
data type for statistical analysis as follow;
Figure 1. From big data to statistical analysis
First, big data are stored in DB by creating table. Second, big data are
changed to structured data by preprocessing based on text mining. All data
by DB and text mining are analyzed by statistical analysis. We find that text
mining process is hard work for data preprocessing [18]. So, we know that
table creation is more effective approach for big data analysis. To construct
MySQL DB, we use console or graphic user interface (GUI) environments
as follow;
Figure 2. User interface of MySQL

IJCSBI.ORG
In this paper, we use SQL codes in the MySQL console. Also, we use
RODBC as an ODBC database interface between R and MySQL [13]. In
general R system, package is a set of additional R functions. R packages are
not installed in basic R system. If we need to use a package, we have to add
the package to the R system. Also we can search all packages from the R
CRAN, and install them from the CRAN [19]. The RODBC package
provides efficient functions for ODBC database access. So, our research is
based on RODBC package to connect R to MySQL. To install RODBC in R
system, we should select R CRAN mirror site. After RODBC installation,
we load this package on R system as follow;
>library (RODBC)
The R system uses „library‟ function for loading a package. By this R code,
we can use all functions provided by RODBC package such as odbcConnect,
sqlFetch, and sqlQuery. They are used in our research for DB accessing and
connecting. To connect MySQL DB, we use „odbcConnect‟ function of
RODBC package as follow;
>db_con =odbcConnect("stat_MySQL")
User = , Password = , Database =
The DSN is „stat_MySQL‟ and the „db_con‟ object of R system includes the
connecting result. Also, in this connecting process, we decide user name,
password, and determined database. If R and MySQL are connected each
other, we can show the tables of MySQL DB using „sqlTables‟ function as
follow;
>sqlTables(con)
TABLE_CAT TABLE_SCHEM TABLE_NAME TABLE_TYPE
REMARKS
The result of this function is the information of connected DB and its tables.
3.1 Structure of DB Connection Software
In general, for connecting DBMS to application software, we should use
ODBC connector [20]. R as a statistical software is also needed to ODBC
driver to access MySQL DBMS. In this paper, we consider RODBC
package for efficient connection between R and MySQL. Figure 3 shows
the ODBC connection between DBMS and statistical software, and their
specific products.

IJCSBI.ORG
Figure 3. Connection between DBMS and statistical software
Oracle and MySQL are representative DBMS products, and SAS and R
system are popular software for statistical analysis. General ODBC program
is used for connecting application software to DBMS. So, there are so many
ODBC drivers for diverse DBMS and application products. Our work is
focused on the connection R and MySQL, and we select RODBC as an
ODBC driver. The RODBC is a package of many R packages for DB
accessing. RMySQL is another R package for R and MySQL [21]. This
package is also R interface to access the MySQL DBMS. In addition to
RODBC and RMySQL, there are some packages for connecting R to
MySQL. In this paper, we use RODBC for MySQL accessing. This is an
ODBC driver like SAS connection to DBMS as follow.
Figure 4. Connection between MySQL/Oracle and SAS
SAS uses some ODBC drivers for diverse DBMS such as MySQL and
Oracle. Also, the drivers use their data source name (DSN). In this research,
we also use DSN for RODBC package. Next, we show more detailed
connection between R and MySQL.
3.2 Efficient Connection between R and MySQL
The RODBC package of R system is an efficient ODBC connector. This
includes diverse functions to access DBMS as follow;
•odbcConnect: function for open connections to ODBC
•sqlFetch: function for fetching tables from DB

IJCSBI.ORG
•sqlQuery: function for SQL query
•sqlSave: function for writing data frame to table in DB
Also, we can use more functions for accessing and manipulating MySQL
DB by RODBC packages. The process of connection between R and
MySQL is as follow;
Figure 5. Connecting process between R and MySQL
Using RODBC package, R system get necessary data from MySQL DB, and
we analyze the connected data. Also, R system accesses to MySQL by
sqlQuery function of RODBC, and create a table for storing analysis result
using R system. Our process of connection between R and MySQL is shown
as follow;
Figure 6. Connecting process between R and MySQL
A table of MySQL DB is transformed to an object in R by RODBC
connector. So, we are able to analyze the object data from the DB table. We
also perform online transaction processing (OLAP) for data summarization
and visualization. Next, we carry out a case study for verifying our work.

IJCSBI.ORG
4. CASE STUDY
To illustrate a case study in real problem, we used „RODBC‟ package from
R-project [13]. This is the software for ODBC database connection between
R and DBMS such as MySQL. Also, we made experiment using an example
data set from the UCI machine learning repository [5].
4.1 UCI Machine Learning Repository
For our case study, we used “Abalone” data set from the UCI machine
learning repository [5]. This data set consisted of 8 variables (columns) and
4,177 observations (rows). The main goal of the data is to predict the age of
abalone from the physical measurements. Next table shows the variables
and their values [5].
Table 1. Table captions should be placed above the table
Variable Data type Description
Sex Nominal M(male), F(female), I(infant)
Length Continuous Longest shell measurement
Diameter Continuous perpendicular to length
Height Continuous with meat in shell
Whole_weight Continuous whole abalone
Shucked_weight Continuous weight of meat
Viscera_weight Continuous gut weight (after bleeding)
Shell_weight Continuous after being dried
Rings Discrete +1.5 gives the age in years
The last variable (rings) is target variable, and others are all input variables.
We constructed MySQL DB using this data set. The original data from UCI
machine learning repository was text file separated by „comma‟, but the
MySQL needed data file separated by „tab key‟ for DB loading file. So, we
transformed the data type using Excel as follow.

IJCSBI.ORG
Figure 7. Data transformation for MySQL loading
To load text data file on MySQL, we should make a table to save these data.
So, we create the table in next step.
4.2 DB Creation
We used SQL to create table for loading Abalone data set on MySQL
DBMS as follow;
• CREATE DATABASE case_study;
• USE case_study;
• CREATE TABLE abalone( Sex CHAR(3), Length FLOAT(10), Diameter
FLOAT(10), Height FLOAT(10), Whole_weight FLOAT(10),
Shucked_weight FLOAT(10), Viscera_weight FLOAT(10), Shell_weight
FLOAT(10), Rings INT(5));
• LOAD DATA INFILE 'd:/data/abalone.txt' INTO TABLE abalone;
• SELECT * FROM abalone;
Using above SQL codes, we constructed a table of Abalone data in MySQL
DB(case_study). Next, we connected the table of abalone in case_study DB
to R system.
4.3 Connecting R to MySQL
We used RODBC package for connecting R to MySQL as follow;

IJCSBI.ORG
>library(RODBC)
>abalone_con=odbcConnect("abalone_ODBC")
>sqlTables(abalone_con)
TABLE_SCHEM TABLE_NAME TABLE_TYPE
case_study abalone TABLE
>vars=sqlQuery(abalone_con, "SELECT sex, diameter, rings FROM
abalone")
Sex Diameter Rings
1 M 0.365 15
2 M 0.265 7
3 F 0.420 9
4 M 0.365 10
5 I 0.255 7
…
Using above R codes, we saved three variables of abalone data set to „vars‟
R object. We found the abalone table was created well from the SQL query
result by sqlQuery function. This function enabled the usage of SQL in R
system. So, we analyzed abalone data using analytical functions of R system.
Next, the result of data analysis is shown.
4.4 Data Analysis
First, we performed data summarization of three variables using „summary‟
function of R system as follow;
>summary(vars)
sex diameter rings
F:1307 Min. :0.0550 Min. : 1.000
I:1342 1st Qu.:0.3500 1st Qu.: 8.000
M:1528 Median :0.4250 Median : 9.000
Mean :0.4079 Mean : 9.934
3rd Qu.:0.4800 3rd Qu.:11.000
Max. :0.6500Max. :29.000
This function provided frequency or descriptive statistic according to data
type (continuous or nominal). For example diameter is continuous variable,
so we got minimum, 25 percentile, median, mean, 75 percentile, and
maximum values. Next we carried out data visualization as follow;
>boxplot(vars$diameter)

IJCSBI.ORG
Figure 8. Boxplot: data visualization of MySQL table
This shows boxplot of diameter variable of abalone table. Using graphical
functions supported by R system, we can also get diverse visualization
results such as histogram, plot, and so on. Lastly we constructed regression
model using „reg‟ function as follow;
>regression_result=lm(rings~diameter, data=vars)
>sunnary(regression_result)
Residuals:
Min 1Q Median 3Q Max
-5.19 -1.69 -0.72 0.91 16.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.3186 0.1727 13.42 <2e-16 ***
diameter 18.6699 0.4115 45.37 <2e-16 ***
R-squared: 0.3302, Adj. R-squared: 0.3301
The regression is popular model in statistical analysis. The dependent and
independent variables are „rings‟ and „diameter‟ respectively. So, we got the
following regression equation;
Rings=2.3186+18.6699diameter. Therefore, in our case study, we illustrated
a case study of connection between R and MySQL.
5. CONCLUSION
In this paper, we studied on the efficient connection between DBMS and
statistical software. We used R system and MySQL as statistical software
and DBMS respectively. The RODBC package was used for DB connection
in our study. After connecting between R and MySQL, we analyzed the data
of MySQL table. So, this can be expanded to the big data analysis. In our

IJCSBI.ORG
case study, we illustrated how our approach could be applied in real
application. We selected Abalone data set from the UCI machine learning
repository for our case study. Our result contributes to the works related to
big data analysis. In addition, we can analyze the data in DBMS directly by
statistical methods. In our future works, we will expand the scope of the
connection between DBMS and statistical software to more products.
6. DISCUSSION
The biggest problem of statistical database system is the cost of connecting
between statistical software and DBMS. For example, we should buy
„SAS/Access‟ product additionally and install it to SAS base system for
connecting SAS and DBMS. Generally this supplementary product is
expensive, so most users have had difficulty to use statistical databases
system. In this paper, we selected R system as statistical software instead of
SAS, and we used RODBC as ODBC connector instead of SAS/Access,
because R and RODBC are all free. But, their performance is similar to SAS.
Also, in new analytical functions such as statistical leaning theory and
machine learning algorithm, they surpass SAS.
REFERENCES
[1] Sathi, A. Big Data Analytics. An Article from IBM Corporation, 2012.
[2] Heiberger, R. M., and Neuwirth, E.R through Excel – A Spreadsheet Interface for
Statistics, Data Analysis, and Graphics. Springer, 2009.
[3] MySQL, The World’s most popular open source database. http://www.mysql.com,
accessed on October 2013.
[4] Sim, S., Kang, H., and Lee, Y. Access to Database through the R-Language. The
Korean Communications in Statistics, 15, 1 (2008), 51-64.
[5] UCI Machine Learning Repository, http://archive.ics.uci.edu/ml, accessed on October
2013.
[6] SAS, http://www.sas.com,accessed on October 2013.
[7] SPSS, http://www-01.ibm.com/software/analytics/spss/, accessed on October 2013.
[8] Minitab, http://www.minitab.com, accessed on October 2013.
[9] S-Plus, http://solutionmetrics.com.au/products/splus/, accessed on October 2013.
[10]Wikipedia, the free encyclopedia. http://en.wikipedia.org, accessed on October 2013.
[11]Date, C. J.An Introduction to Database Systems. 7th edition, Addition-Wesley, 2000.
[12]Oracle, http://www.oracle.com, accessed on October 2013.
[13]Ripley, B.Package RODBC. CRAN R-Project, 2013.
[14]R-bloggers, On R versus SAS. http://www.r-bloggers.com/on-r-versus-sas/, accessed on
December, 2013.
[15]Linkin,Advanced Business Analytics, Data Mining and Predictive Modeling.
http://www.linkedin.com/groups/SAS-versus-R-35222.S.65098787, accessed on
December, 2013.
[16]Clever Logic, MySQL vs. Oracle Security, http://cleverlogic.net/articles/mysql-vs-
oracle, accessed on December, 2013.

IJCSBI.ORG
[17]Find The Best, Oracle vs MySQL, http://database-management-
systems.findthebest.com/saved_compare/Oracle-vs-MySQL, accessed on December,
2013.
[18]Han, J., and Kamber, M. Data Mining Concepts and Techniques. Morgan Kaufmann,
2001.
[19]R system, The R Project for Statistical Computing. http://www.r-project.org, accessed
on October 2013.
[20]Spector, P. Data Manipulation with R, Springer, 2008.
[21]James, D. A., and DebRoy, S.Package RMySQL. CRAN R-Project, 2013.

IJCSBI.ORG
Pragmatic Approach to Component Based
Software Metrics Based on Static Methods
S. Sagayaraj
Department of Computer Science
Sacred Heart College, Tirupattur
M. Poovizhi
Department of Computer Science
Sacred Heart College, Tirupattur
ABSTRACT
Component-Based Software Engineering (CBSE) is an emerging technique for reuse of
software. This paper presents the component based software metrics by investigating the
improved measurement techniques. Two types of metrics are used: static metrics and
dynamic metrics. This research work presents the measured metric value for Complexity
metrics and Criticality metric. The static metrics applied to the E-healthcare application
which is developed with the reusable software components. The value of each metric is
analyzed with the application. The metric measured value is the evidence for the
reusability, good maintainability of component based software system.
Keywords
Component Based Software Engineering, Component Based Software Metrics, Component
Based Software System.
1. INTRODUCTION
The demand for new software applications is currently increasing at the
exponential rate. The number of qualified and experienced professionals
required for creating new software/applications is not increasing
commensurably [1]. Software Reuse applications are built from existing
components, primarily by assembling and replacing interoperable parts. So,
software professionals have recognized reuse as a powerful means of
potentially overcoming the above said software crisis and it promises
significant improvements in software productivity and quality [2].
There are two approaches for reuse of code: develop the reusable code from
scratch or identify and extract the reusable code from already developed
code [3]. The organizations have experience in developing software, there
exists extra cost to develop the reusable components from scratch to build
and strengthen their reusable software reservoir. The cost of developing the
software from scratch can be saved by identifying and extracting the
reusable components from already developed and existing software systems
or legacy systems [4]. But the problem of how to recognize reusable
components from existing systems has remained relatively unexplored. In

IJCSBI.ORG
both the cases, whether the organization is developing software from scratch
or reusing code from already developed projects, there is a need of
evaluating the quality of the potentially reusable piece of software. Metrics
is very essential to prove the quality of the components [5].
Software metrics are an essential part of the state-of-the-practice in software
engineering. Goodman describes software metrics as: "The continuous
application of measurement-based techniques to the software development
process and its products to supply meaningful and timely management
information, together with the use of those techniques to improve that
process and its products"[6].Software metrics can do one of four functions
such as understand, evaluate, control, predict.
Various attributes, which determine the quality of the software, include
maintainability, defect density, fault proneness, normalized rework,
understandability, reusability etc [5]. To achieve both the quality and
productivity objectives it is always recommended to go for the software
reuse that not only saves the time taken to develop the product from scratch
but also delivers the almost error free code, as the code is already tested
many times during its software development [7].
During the last decade, the software reuse and software engineering
communities have come to better understanding on component-based
software engineering. The development of a reuse process and repository
produces a base of knowledge that improves in excellence after every reuse,
minimizing the amount of development work necessary for future projects,
and ultimately reducing the risk of new projects that are based on repository
knowledge [8].
CBSD centers on building large software systems by integrating previously
existing software components. By enhancing the flexibility and
maintainability of systems, this approach can potentially be used to reduce
software development costs, assemble systems rapidly, and reduce the
spiraling maintenance burden associated with the support and upgrade of
large systems [9].
The paper is organized as follows: The related work on component based
software metric is provided in Section 2. The list of Component based static
and dynamic metrics in section 3. The detail of implementation is presented
in Section 4. The analysis of complexity metrics and criticality metrics
is described in section 5. Finally, the last section concludes the paper and
offers further research in this area.

IJCSBI.ORG
2. RELATED WORKS
Many works are carried out in the area of Component Based Software
Metrics. Some of the works are listed below:
Nael SALMAN focuses mainly on the complexity that results mainly from
factors related to system structure and connectivity in 2006 [10]. Also, a
new set of properties that a component-oriented complexity metric must
possess are defined. The metrics have been evaluated using the properties
defined. A case study has been conducted to detect the power of complexity
metrics in predicting integration and maintenance efforts. The results of the
study revealed that component oriented complexity metrics can be of great
value in predicting both integration and maintenance efforts.
Arun Sharma, Rajesh Kumar, and P. S. Grover presented survey few
existing component-based reusability metrics in 2007 [11]. These metrics
gave a border view of component’s understandability, adaptability, and
portability. It also expresses the analysis, in terms of quality factors related
to reusability, contained in an approach that helps significantly in assessing
existing components for reusability.
V. Lakshmi Narasimhan, P. T. Parthasarathy, and M. Das hearted a series of
metrics projected by various researchers have been analyzed, evaluated and
benchmarked using several large-scale openly available software systems in
2009[12]. A systematic analysis of the values for various metrics has been
carried out and several key inferences have been drawn from them. A
number of useful conclusions have been drawn from various metrics
evaluations, which include inferences on complexity, reusability, testability,
modularity and stability of the underlying components.
Misook Choi, Injoo J. Kim, Jiman Hong, Jungyeop Kim suggested
Component-Based Metrics Applying the Strength of Dependency between
Classes in 2009 to increase quality of components, they proposed the
component-based metrics applying the strength of dependency between
classes to measure precisely [13]. In addition, they proved the theoretical
soundness of the proposed metrics by the axioms of Briand et al. and
suggest the accuracy and practicality of the proposed metrics through a
comparison with the conventional metrics in component development phase.
Majdi Abdellatief, Abu Bakar Md Sultan, Abdul Azim Abd Ghani,
Marzanah A.Jaba presented dependency between components is considered
as a most important issue affecting the structural design of Component-

IJCSBI.ORG
Based Software System (CBSS) in 2011 [14]. Two sets of metrics that is,
Component Information Flow Metrics and Component Coupling Metrics are
proposed based on the concept of Component Information Flow from CBSS
designer’s point of view.
Jianguo Chen, Hui Wang, Yongxia Zhou, Stefan D. Bruda presented some
such efforts by investigating the improved measurement tools and
techniques, i.e., through the effective software metrics in 2011 [15].
Coupling, Cohesion and interface metrics are proposed newly and evaluated
those metrics.
The previous research explained the work done with varieties of Component
Based Software Metrics. This paper deals about the static and dynamic
metrics of component based software. This work is extended by developing
the E-Healthcare application and the results are carried out for the static
metrics.
3. COMPONENT BASED SOFTWARE METRICS
The traditional software metrics focus on non-CBSS and are inappropriate
to CBSS mainly because the component size is normally not known in
advance. Inaccessibility of the source code for some components prevents
comprehensive testing. So, the component based metrics are defined to
evaluate the component based application.
There are two types of metrics considered in this paper for measuring the
values.
 Static Metric
Static metrics cover the complexity and the criticality within an integrated
component. Static metrics are collected from static analysis of component
assembly. The complexity and criticality metrics are intended to be used
early during the design stage. The list of static metrics [16] is provided in
Table 1.
 Dynamic metric
Dynamic metrics are gathered during execution of complete application.
Dynamic metrics are meant to be used at implementation stage. The
dynamic metrics are listed in Table 2 [15].

IJCSBI.ORG
Table 1. Static Metrics
Sl.no Metric Name Formulae
1 Component Packing Density Metric
2 Component Interaction Density metric
3 Component Incoming Interaction Density
4 Component Outgoing Interaction Density
5 Component Average Interaction Density
6 Bridge Criticality Metrics CRIT bridge =#bridge_component
7 Inheritance Criticality Metrics CRIT inheritance =#root_component
8 Link Criticality Metrics CRIT link =#link_component
9 Size Criticality Metrics CRIT size =#size_component
10 #Criticality Metrics CRIT all = CRIT bridge+ CRIT inheritance
+ CRIT link + CRIT size
Table 2. Dynamic Metrics
4. IMPLEMENTATION
The E-Healthcare application is developed to measure the static metrics.
The application is designed with the number of components. The metrics are
applied with the application and the values are measured. There are five
modules in e-healthcare application.
Sl.no Metric Name Formulae
1 Number of Cycle (NC) NC = # cycles
2 Average Number of Active Components
3 Active Component Density (ACD)
4 Average Active Component Density
5 Peak Number of Active Components ACΔt = max { AC1,..,ACn}

IJCSBI.ORG
4.1 Admin
Admin module is used to store the user or doctor or admin details. Admin
has a responsibility to manage every record in the database.
4.2 Appointments and payments
This module is used to add, drop doctor details and help to get appointment
for users. Admin is a responsible person to add new doctor details. The
existing doctor also can be deleted by admin.
4.3 Diagnosis and health
Diagnosis and Health module is used to retrieve user’s diagnosis details.
The users who are all taking the treatment by using application, those users
information is store in the database.
4.4 First aid and E-certificate
This module is used to get blood bank details for the required blood group.
A first aid medicine detail for a particular disease is provided to the users.
The user can get treatment type which helps users for their emergency.
4.5 Symptoms and alerts
Symptoms and alerts module is used to check the BP level of the user. The
patient information is retrieved from database and their symptoms, causes
for disease are helps the users to prevent them from disease.
The pictographic representation of the modules in application is shown in
Figure 1.
Appointments
and payments
Admin Diagnosis and
health
First aid and
E-certificates
Symptoms
and Alerts
Figure 1. Modules in E-healthcare Application
Components are created to develop the whole application. The components
(admin, appointments and payment, diagnosis and health, firstaid and e-
certificate, symptoms and alerts, DBHelper, EhealthBL) are required to
complete the component based application called E-Healthcare. The static
metrics are applied with that component, and each component value is
measured according to the metric formula. The analysis of metric is carried

IJCSBI.ORG
out manually with the application. With the help of database table, web page
form, components the metric values are calculated.
5. ANALYSIS
The analysis made to prove the CBSS has good reusability, maintainability
and independence.
The Component Packing Density Metric, Component Interaction Metrics
(Incoming, outgoing, Average), Criticality metrics analyses are as follows:
5.1 Component Packing Density Metric
CPD is used to measure the number of operation in which each component
contains.
The CPD is defined as a ratio of #constituent (LOC, object/classes,
operations, classes and/or modules) and #component
#Constituent = one of the following: LOC, object/classes, operations,
classes and/or modules
#Component = number of components
For this metric the no. of operation of each component is listed in Table 3.
Table 3. Component packing Density
S.No Component Name No. of operations
1 Admin 3
2 Appointments and payments 4
3 Diagnosis and health 4
4 Firstaid and e-certificate 6
5 Symptoms and alerts 5
6 DBHelper 1
7 EhealthBL 19
= 3+4+6+5+4+1+19/7= 42/7= 6
Hence, the CPD metric is helps to know the average number of operations
in each component contains.
5.2Component Interaction Density Metric
The CID is defined as a ration of actual interactions over potential ones. A
higher interaction density causes a higher complexity in the interaction [17].
The CID metric is applied on the E-Healthcare application. The measured
value of actual interactions in each component of E-Healthcare is illustrated
in Table 4.

IJCSBI.ORG
#I = no. of actual interactions
#Imax = no. of maximum available interactions.
Table 4. Actual interactions
S.No Name of the page No. of actual interactions
1 Registration.aspx 4
2 Postquestion.aspx 2
3 Search.aspx 5 i/p, 5 o/p
4 Doctormanagement.aspx 6
5 Diagnosis.aspx 1
6 Searchmedicine.aspx 2 i/p, 3 o/p
7 Medicine.aspx 5
8 Bloodbank.aspx 4
9 Firstaidsuggestion.aspx 2
10 Medicalcertificate.aspx 3
11 Treatmenttype.aspx 1 i/p, 2 o/p
12 Symptoms.aspx 1 i/p, 3 o/p
Total 51
The actual interaction value between other components is 51.
The maximum no. of available interaction with other component is 87
=51/87 = 0.586
This metric brings out the number of incoming and outgoing interactions
available in each component. This metric helps to know which component
has greater connectivity with other component.
5.3 Component Incoming Interaction Density
CIID is defined as a ratio of number of incoming interactions and maximum
number of incoming interactions. A higher interaction density causes a
higher complexity in the interaction. The no. of actual incoming interactions
in each component is shown in the Table 5.
#I in = no. of incoming interactions
#Imax in = maximum no. of available incoming interactions.

IJCSBI.ORG
Table 5. Incoming Interactions
S.No Name of the page No. of incoming interactions
3 Search.aspx 5
5 Diagnosis.aspx 1
6 Searchmedicine.aspx 2
7 Medicine.aspx 5
8 Bloodbank.aspx 4
11 Treatmenttype.aspx 4
12 Symptoms.aspx 4
Total 37
The no. of incoming interaction value is 37.
The maximum no. of available incoming interaction value is 51. Out of 51
interactions only the 37 interactions are actually has link to the other
component.
= 37/51 = 0.725
CIID metric value 0.725 is clearly state the incoming interactions density
with other component is very high.
5.4 Component Outgoing interaction Density
COID is defined as a ratio of number of outgoing interactions and maximum
number of outgoing interactions. A higher interaction density causes a
higher complexity in the interaction. The number of outgoing interaction in
each component is shown in Table 6.
#I out = no. of outgoing interactions
#Imax out = no. of maximum no. of outgoing interactions.
Table 6. Outgoing Interactions
S.No Name of the page No. of outgoing interactions
3 Search.aspx 5

IJCSBI.ORG
5 Diagnosis.aspx 3
6 Searchmedicine.aspx 3
7 Medicine.aspx 1
8 Bloodbank.aspx 3
11 Treatmenttype.aspx 4
12 Symptoms.aspx 3
Total 28
The no. of outgoing interaction value is 28.
The maximum no. of available outgoing interaction value is 46. Only 28
outgoing interactions are actually connected with other components.
= 28/46 = 0.608
The calculated value is 0.608 proven that there is greater outgoing
interactions with the components.
5.5 Component Average Interaction Density
CAID represents the sum of CID for each component divided by the number
of components.
#components = Number of components in the system. (Sum of interaction
density of n component / no. of existing component)
Admin: The actual interfaces (incoming and outgoing) of admin component
are listed. Sum of interaction density value for admin component is shown
in Table 7.
Table 7. Sum of CID for admin component
S.No Name of the page Sum of CID for admin
component
1 Registration.aspx 4 out of 13 (only 4 interfaces
interact with other components out
of 13 interfaces)
2 Login.aspx 2 out of 2
3 Postquestion.aspx 1 out of 1

IJCSBI.ORG
Summation of CID for Component Admin is 7/16. Seven are actual
interactions out of sixteen. This component has a greater reliability.
Appointments and payments: Sum of interaction density of an
appointments and payments component is shown in Table 8. The sum is
considered both the incoming and outgoing interfaces in appointments and
payments component.
Table 8. Sum of CID for appointments and payments component
Summation of CID for Component Appointments and payments is 10/12.
The 10 interfaces has link with other component out of 12 interfaces.
Diagnosis and health: Sum of interaction density of a diagnosis and health
component is shown in Table 9.
Table 9. Sum of CID for diagnosis and health component
S.No Name of the page Sum of CID for diagnosis and health
component
1 Diagnosis.aspx : 1 out of 2
2
Searchmedicine.aspx
: 2 out of 2
3 Medicine.aspx : 4 out of 5
Summation of CID for Component Diagnosis and health is 7/9. The 7
interfaces are represents both interactions with added components out of 9
interfaces.
Firstaid and e-certificate: Table 10 shows the sum of CID value for
component called firstaid and e-certificates.
Table 10. Sum of CID for firstaid and e-certificates
S.No Name of the page Sum of CID for firstaid and
e-certificate
S.No Name of the page Sum of CID for appointments and
payments component
1 Search.aspx : 2 out of 2
2 To get appointment : 4 out of 4
3 Doctormanagement.aspx : 4 out of 6

IJCSBI.ORG
1 Bloodbank.aspx : 1 out of 1
: 3 out of 3
2 Firstaidsuggestion.aspx : 1 out of 1
3 Medical certificate.aspx : 2 out of 4
4 Treatmenttype.aspx : 1 out of 1
: 3 out of 7
Summation of CID for Component Firstaid and E-certificates is 11/17. Out
of 17 only 11 interactions are connected with the rest of the component.
Symptoms and alerts: Table 11 shows the sum of CID value for
component called symptoms and alerts.
Table 11. Sum of CID for symptoms and alerts.
S.No Name of the page Sum of CID for symptoms and alerts
1
Searchpatient.aspx
: 1 out of 1
: 3 out of 3
Summation of CID for Component Symptoms and alerts is 4/4. This
component completely connected with other components.
Component Average Interaction Density metric takes the ratio between sum
of each component and number of existing components.
= (7/16+10/12+7/9+11/17+4/4)/7
= 0.5279
The measured value for this metric proved that, greater reliability with the
components.
5.6 Bridge Criticality Metric
Bridge criticality metric is used to identify the bridge component. The
component which is acts as a bridge for components is a bridge component.
CRIT bridge =#bridge_component. Out of 7 components EhealthBL is acts
a bridge component between other component and from the code behind to
the database. It contains all the queries to store and retrieve the
information’s.

IJCSBI.ORG
So, the bridge_component value is 1.The value 1 is explicitly tells that, one
component is operates as a bridge component to all other component.
5.7 Inheritance Criticality Metric
Inheritance is deriving a new component from the existing component. The
existing component is called as root component.
CRIT inheritance =#root_component
The interface is inherited from the existing/ derived component.
Root components
 Symptoms and alerts (patient info inherited to diagnosis component)
 EhealthBL (query is inherited from the basequery)
So, the root component value is: 2, this value is shows that, object oriented
programming concepts utilized between the components.
5.8 Link Criticality Metric
Link criticality metric is used to identify link component. The component
which is providing link to other components is called as link component.
CRIT link =#link_component
The link component value is: 1 (DB helper).This value proved that the
component acts as link between code behind page to database.
5.9 Size Criticality Metric
Size criticality metric is used to identify the component which exceeds the
critical level, which is called size component.
CRIT size =#size_component
The size component value is: 0
Size critical level is: 60 lines in a component. No component exceeds the
critical level.
5.10 # Criticality Metric
The Sum of the bridge criticality, inheritance criticality, link criticality, size
criticality is known as Criticality Metrics.
CRIT all = CRIT bridge+ CRIT inheritance + CRIT link + CRIT size
CRIT all = 1+2+1+0
= 4
The compound value 4 proved that the huge criticality is arising.
Threshold Value
The threshold value is fixed as 0.5 and it is used to compare the computed
value of each meric. The comparison with this threshold value is to check

IJCSBI.ORG
the metric value is increased or decreased in it reusability and good
maintainability aspects. Table 12 shows the result of compared with the
threshold value.
Table 12. Comparison with threshold value.
{{{
Metric Name Comparison with Threshold
Value
Component Packing Density Metric Increasing
Component Interaction Density Metric Increasing
Component Incoming Interaction Density Increasing
Component Outgoing Interaction Density Increasing
Component Average Interaction Density Increasing
Bridge Criticality Metrics Increasing
Inheritance Criticality Metrics Increasing
Link Criticality Metrics Increasing
Size Criticality Metrics Decreasing
6. CONCLUSIONS
Building software systems with reusable components bring many
advantages to Organizations. Reusability may have several direct or indirect
factors like cost, efforts, and time. This paper discussed various aspects of
reusability for Component- Based systems. It has given an insight view of
various reusability metrics for Component-Based systems. The qualities of
components are correctly measured by applying metrics to an e-healthcare
in an electronic commerce domain. The component-based metrics result in
improving the quality of design components and developing the component
based system with good maintainability, reusability, and independence.
Most of the Metrics have future enhancement. That enhancements help to
add the features at the future. The demand of the new software applications
is currently increasing at the exponential rate. So the future enhancements
will help to fulfill those requirements. The Dynamic Metric analysis can be
applied to the component based software application and it can be validated.
Based on the applications the enhanced metrics can be proposed for the
component based software systems.
REFERENCES
[1] Dr. Nedhal A. Al Saiyd, Dr. Intisar A. Al Said, Ahmed H. Al Takrori, Semantic-Based
Retrieving Model of Reuse Software Component, IJCSNS International Journal of
Computer Science and Network Security, VOL.10 No.7, July 2010.
[2] Joaquina Martín-Albo, Manuel F. Bertoa, Coral Calero, Antonio Vallecillo, Alejandra
Cechich and Mario Piattini, CQM: A Software Component Metric Classification
Model, IEEE Transactions onJjournal Name.

IJCSBI.ORG
[3] Anas Bassam AL-Badareen, Mohd Hasan Selamat, Marzanah A. Jabar, Jamilah Din,
Sherzod Turaev, Reusable Software Component Life Cycle, International Journal of
Computers, Issue 2, Volume 5, 2011.
[4] Chintakindi Srinivas, Dr.C.V.Guru rao, Software Reusable Components With
Repository System, International Journal of Computer Science & Informatics, Volume-
1, Issue-1,2011
[5] Parvinder S.Sandhu, Harpreet Kaur, and Amanpreet Singh, Modeling of Reusability of
Object oriented Software System, World Academy of Science, Engineering and
Technology 56 2009.
[6] Sarbjeet Singh, Manjit Thapa, Sukhvinder singh and Gurpreet Singh, Sarbjeet Singh,
Manjit Thapa, Sukhvinder singh and Gurpreet Singh, International Journal of
Computer Applications (0975 – 8887) Volume 8– No.12, October 2010
[7] Linda L. Westfall, Seven steps to designing a software metrics, Principles of software
measurement services.
[8] K.S. Jasmine and R.Vasantha, DRE – A Quality metric for Component Based Software
Products, World Academy of Science, Engineering and Technology 34 2007.
[9] Iqbaldeep Kaur, Parvinder S. Sandhu, Hardeep Singh, and Vandana Saini, Analytical
Study of Component Based Software Engineering, World Academy of Science,
Engineering and Technology 50 2009.
[10]Nael Salman, Complexity metrics as predicators of maintainability and integrability of
software components, Journal of arts and science, May 2006.
[11]Arun Sharma, Rajesh Kumar, and P. S. Grover, A critical survey of reusability aspects
for component-Based systems, World academy of science, Engineering and
Technology 33 2007.
[12]V. Lakshmi Narasimhan, P. T. Parthasarathy, and M. Das, Evaluation of a suite of
metrics for CBSE, Issues in informing science and information technology, Vol 6,
2009.
[13]Misook Choi, Injoo J. Kim, Jiman Hong, Jungyeop Kim, Component-Based Metrics
Applying the Strength of Dependency between Classes, ACM Journal, March 2009.
[14]Majdi Abdellatief, Abu Bakar Md Sultan, Abdul Azim Abd Ghani, Marzanah A.Jabar,
Component-based Software System Dependency Metrics based on Component
Information Flow Measurements, ICSEA 2011.
[15]Jianguo Chen, Hui Wang, Yongxia Zhou, Stefan D.Bruda, Complexity Metrics for
Component-based Software Systems, International Journal of Digital Content
Technology and its Applications. Vol.5, No.3, March 2011.
[16]V. Lakshmi Narasimhan, and Bayu Hendradjaya, Theoretical Considerations for Software
Component Metrics, World Academy of Science, Engineering and Technology 10
2005.
[17]E. S. Cho, M.S. Kim, S.D. Kim, Component Metrics to Measure Component Quality,
the 8th Asia-Pacific Software Engineering Conference (APSEC), Macau, 2001, pp.
419-426.

IJCSBI.ORG
SDI System with Scalable Filtering of
XML Documents for Mobile Clients
Yi Yi Myint
Department of Information and Communication Technology
University of Technology (Yatanarpon Cyber City)
Pyin Oo Lwin, Mandalay Division, Myanmar
Hninn Aye Thant
Department of Information and Communication Technology
University of Technology (Yatanarpon Cyber City)
Pyin Oo Lwin, Mandalay Division, Myanmar
ABSTRACT
As the number of user grows and the amount of information available becomes even bigger,
the information dissemination applications are gaining popularity in distributing data to the
end users. Selective Dissemination of Information (SDI) system distributes the right
information to the right users based upon their profiles. Typically, the exploitation of
Extensible Markup Language (XML) representation entails the profile representation, and
the utilization of the XML query languages assist the employment of queries indexing
techniques in SDI systems. As a consequence of these advances, mobile information
retrieval is crucial to share the vast information from diverse data sources. However, the
inherent limitations of mobile devices require information to be delivered to mobile clients
to be highly personalized consistent with their profiles. In this paper, we address the issue
of scalable filtering of XML documents for mobile clients. We describe an efficient
indexing mechanism by enhancing XFilter algorithm based on a modified Finite State
Machine (FSM) approach that can quickly locate and evaluate relevant profiles. Finally, our
experimental results show that the proposed indexing method outperforms the previous
XFilter algorithm in time aspect.
Keywords
XML, FSM, scalable filtering, SDI.
1. INTRODUCTION
Nowadays the SDI System becomes increasingly an important research area
and industrial topic. Obviously, there is a trend to create new applications
for small and light computing devices such as cell phones and PDAs.
Amongst the new applications, mobile information dissemination
applications (e.g. electronic personalized newspapers delivery, ecommerce
site monitoring, headline news, alerting services for digital libraries, etc.)
deserve special attention.
Recently, there have been a number of efforts to build efficient large-scale
XML filtering systems. In an XML filtering system [4], constantly arriving
streams of XML documents are passed through a filtering engine that
matches documents to queries and routes the matched documents

IJCSBI.ORG
accordingly. XML filtering techniques comprise a key component of
modern SDI applications.
XML [3] is becoming a standard for information exchange and a textual
representation of data that is designed for the description of the content,
especially on the internet. The basic mechanism used to describe user
profiles in XML format is through the XPath query language. XPath is a
query language for addressing parts of an XML document. However, this
technique often suffers from restricted capability to express user interests,
being unable to rightly capture the semantics of the user requirements.
Therefore, expressing deeply personalized profiles require a querying power
just like SQL provides on relational databases. Moreover, as the user
profiles are complex in mobile environment, a more powerful language than
XPath is needed. In this case, the choice is XML-QL. XML-QL [7] has
more expressive power compared to XPath and it is also measured the most
powerful among all XML query languages. XML-QL’s querying power and
its elaborate CONSTRUCT statement allows the format of the query results
to be specified.
The rest of the paper is organized as follows: Section 2 briefly summarizes
the related works. Section 3 describes the proposed system architecture and
its components. The operation of the system that is how the query index is
created, the operation of the finite state machine and the generation of the
customized results are explained in Section 4. Section 5 gives the
performance evaluation of the system. Finally Section 6 concludes the
paper.
2. RELATED WORKS
We now introduce some existing XML filtering methods. XFilter [1] was
one of the early works. The XFilter system is designed and implemented for
pushing XML documents to users according to their profiles expressed in
XML Path Language (XPath). XFilter employs a separate FSM per path
query and a novel indexing mechanism to allow all of the FSMs to be
executed simultaneously during the processing of a document. A major
drawback of XFilter is its lack of expressiveness.
In addition, XFilter does not execute the XPath queries to generate partial
results. As a result, the whole document is pushed to the user when a
document matches a user’s profile. This feature prevents XFilter to be used
in mobile environments because the limited capability of the mobile devices
is not enough to handle the entire document. Also XFilter does not utilize
the commonalities between the queries, i.e. it produces a FSM per query.
This observation motivated us to develop mechanisms that employ only a
single FSM for the queries which have common element structure.

IJCSBI.ORG
YFilter [2] overcomes the disadvantage of XFilter by using
Nondeterministic Finite Automata (NFA) to emphasize prefix sharing. The
resulting shared processing provided tremendous improvements to the
performance of structure matching but complicated the handling of value-
based predicates. However, the ancestor/descendant relationship introduces
more matching states, which may result in the number of active states
increasing exponentially. Post processing is required for YFilter.
FoXtrot [5] is an efficient XML filtering system which integrates the
strengths of automata and distributed hash tables to create a fully distributed
system. FoXtrot also describes different methods for evaluating value-based
predicates. The performance evaluation demonstrates that it can index
millions of queries and attain an excellent filtering throughput. However,
FoXtrot necessitates the extensions of the query language to reach the full
XPath or the powerful expressiveness for user profiles.
NiagaraCQ system [6] uses XML-QL to express user profiles. It provides
the measures of scalability through query groups and cashing techniques.
However, its query grouping ability is derived from execution plans which
are different from our proposed method. The execution times of queries do
not make such planning a possible applicant for mobile environments.
Accordingly, our system will solve the above problems and reduce the
filtering time as much as possible.
3. PROPOSED SYSTEM ARCHITECTURE
We first present a high-level overview of our XML filtering system. We
then describe XML-QL language that we use to specify the user profiles in
this work. The overall architecture of the system is depicted in Figure 1.
Figure 1. Overall architecture of the system
User profiles describe the information preferences of individual users. These
profiles may be created by the users themselves, e.g., by choosing items in a
Graphical User Interface (GUI) via their mobile phones. The user profiles

IJCSBI.ORG
are automatically converted into a XML-QL format that can be efficiently
stored in the profile database and evaluated by the filtering system. These
profiles are effectively “standing queries”, which are applied to all incoming
documents. Filtered engine first creates query indices for user profiles and
then parses the incoming XML documents to obtain the query results. The
results are stored in a special content list, so that the whole document need
not be sent. Extracting parts of an XML document can save bandwidth in a
mobile environment. After that, filtered engine sends the filtered XML
documents to the related mobile clients.
3.1 Defining User Profiles with XML-QL
XML-QL has a SELECT WHERE construct, like SQL, that can express
queries, to extract pieces of data from XML documents. It can also specify
transformations that, for example, can map XML data between Document
Type Definitions (DTDs) and integrate XML data from different sources.
Profiles defined through a GUI are transformed into XML documents which
contain XML-QL queries as shown in Figure 2.
<Profile>
<XML-QL>
WHERE<course>
<major>
<name>ICT</name>
<program>First Year</program>
<syllabus>$n</syllabus>
</major></course> IN “course.xml”
CONSTRUCT<result><syllabus>$n</syllabus></result>
</XML-QL>
<PushTo> <address>…</address> </PushTo>
</Profile>
Figure 2. Profile syntax represented in XML containing XML-QL query
3.2 Filtered Engine
The basic components of the filtered engine are 1) An event-based XML
parser which is implemented using SAX API for XML documents; 2) A
profile parser that has an XML-QL parser for user profiles and creates the
Query Index; 3) A Query Execution Engine which contains the Query Index
which is associated with Finite State Machines to query the XML
documents; 4) Delivery Component which pushes the results to the related
mobile clients (see Figure 3).

IJCSBI.ORG
User Profiles Profile Parser
XML-QL Parser
Query Execution Engine
Query Index
XML Parser XML Document
Delivery
Query
Query nodes
Events
Results
Figure 3. Filtered engine
4. OPERATION OF THE SYSTEM
The system operates as follows: subscriber informs the filtered engine when
a new profile is created or updated; the profiles are stored in an XML file
that contains XML-QL queries and addresses to transmit the results (see
Figure 2). Profiles are parsed by the profile parser component and XML-QL
queries in the profile are parsed by an XML-QL parser. While parsing the
queries, the XML-QL parser generates FSM representation for each query if
the query does not match to any existing query group. Otherwise, the FSM
of the corresponding query group is used for the input query. FSM
representation contains state nodes of each element name in the queries
which are stored in the Query Index.
When a new document arrives, the system alerts the filtered engine to parse
the related XML document. The event based XML parser sends the events
encountered to the query execution engine. The handlers in the query
execution engine move the FSMs to their next states after the current states
have succeed level checking or character data matching. Meanwhile the data
in the document which matches the variables are kept in the content lists so
that all the necessary partial data for producing the results are formatted and
pushed to the related mobile clients when the FSM reaches its final state.
4.1 Creating Query Index
Consider an example XML document and its DTD given in Figure 4.

<!ELEMENT root (course*)>

IJCSBI.ORG
<!ELEMENT course (degree, major*)>
<!ELEMENT degree (#PCDATA)>
<!ELEMENT major(name, program, semester, syllabus*)>
<!ELEMENT name (#PCDATA)>
<!ELEMENT program (#PCDATA)>
<!ELEMENT semester (#PCDATA)>
<!ELEMENT syllabus (sub-code, sub-title, instructor)>
<!ELEMENT sub-code (#PCDATA)>
<!ELEMENT sub-title (#PCDATA)>
<!ELEMENT instructor (#PCDATA)>
<root> <course>
<degree>Bachelor</degree>
<major><name>ICT</name>
<program>First Year</program>
<semester>First Semester</semester>
<syllabus>
<sub-code>EM-101</sub-code>
<sub-title>English</sub-title>
<instructor>Dr. Thiri</instructor>
</syllabus>
</major>
</course>…</root>
Figure 4. An example XML document and its DTD (course.xml)
The example queries and their FSM representations are shown in Figure 5.
Note that there is a node in the FSM representation corresponding to each
element in the query, and the FSM representation’s tree structure follows
from XML-QL query structure.
Query 1: Retrieve all syllabuses of first year program for ICT major.
WHERE <major> <name>ICT</><program>First Year</><syllabus>$n</>
</> IN “course.xml”
CONSTRUCT<result><syllabus>$n</></>
Q1.1 Q1.2 Q1.3 Q1.4
Q1.1
Q1.2
Q1.3
Q1.4
FSM for Query 1
Query 2: Find the instructor name of the subject code EM-101.

IJCSBI.ORG
Q2.1 Q2.2 Q2.3
WHERE <syllabus> <sub-code>EM-101</><instructor>$s</>
CONSTRUCT<result><syllabus>$s</></>
Q2.1
Q2.2
Q2.3
FSM for Query 2
Query 3: Retrieve all the instructors for first year program in ICT major.
WHERE<major> <name>ICT</><program>First Year</><syllabus> <instructor>$s</></>
CONSTRUCT<result><syllabus>$s</></>
Q3.1 Q3.2 Q3.3 Q3.4 Q3.5
Q3.1
Q3.2
Q3.3
Q3.4 Q3.5
FSM for Query 3
Figure 5. Example queries and its FSM representation
We also substitute constants in a query with parameters to create
syntactically equivalent queries, which lead to the use of the same FSM for
them. The state changes of a FSM are handled through the two lists
associated with each node in the Query Index (See Figure 6). The current
nodes of each query are placed on the Candidate List (CL) of their related
element name. In addition, all of the nodes representing the future states are
stored in the Wait Lists (WL) of their related element name. A state
transition in the FSM is represented by copying a query node from WL to
the CL. Notice that the node copied to the CL also remains in the WL so
that it can be reused by the FSM in future executions of the query as the
same element name may reappear in another level in the XML document.
When the query index is initialized, the first node of each query tree is
placed on the CL of the index entry of its relevant element name. The
remaining elements in the query tree are placed in relevant WLs. Query
nodes in the CL designate that the state of the query might change when the
XML parser processes the relevant elements of these nodes. When the XML
parser catches a start element tag, the immediate child elements of this node
in the Query Index are copied from WL to CL If a node in the CL of the
element satisfies level checking or character data matching. The purpose of
the level checking is to make sure that this element name possibly will
reappear in the document.

IJCSBI.ORG
instructor
major
name
program
syllabus
sub-code
WL
CL
CL
CL
CL
CL
CL
WL
WL
WL
WL
WL
Q3.2
Q1.1 Q3.1
Q1.2
Q1.3 Q3.3
Q2.1
Q1.4 Q3.4
Q2.2
Q2.3 Q3.5
Figure 6. Initial states of the query index for example queries
4.2 Operation of the Finite State Machine
When a new XML document activates the SAX parser, it starts generating
events. The following event handlers hold these events:
Table 1. Sample SAX API
An XML Document SAX API Events
<?xml version=”1.0”>
<course>
<major>
<name>
ICT
</name>
</major>
</course>
start document
start element: course
start element: major
start element: name
characters: ICT
end element: name
end element: major
end element: course
end document
Start Element Handler checks whether the query element matches the
element in the document. For this purpose it performs a level and an
attribute check. If these are satisfied, it either enables data comparison or
starts variable content generation. As the next step, the nodes in the WL that
are the immediate successors of this node are moved to CL.

IJCSBI.ORG
End Element Handler evaluates the state of a node by considering the states
of its successor nodes. Moreover, it generates the output when the root node
is reached. It also deletes the nodes from CL which are inserted in the start
element handler of the node. This provides “backtracking” in the FSM.
Element Data Handler is implemented for data comparison in the query. If
the expression is true, the state of the node is set to true and this value is
used by the End Element Handler of the current element node.
End Document Handler signals the end of result generation and passes the
results to the Delivery Component.
4.3 Generating Customized Results
Results are generated when the end element of the root node of the query is
encountered. Therefore, content lists of the variable nodes are traversed to
obtain content groups. These content groups are further processed to
produce results. This process is repeated until the end of the document is
reached. The results require to be formatted as defined in the CONSTRUCT
clause. After all, the queries results are sent to the related mobile clients.
5. PERFORMANCE EVALUATION
In this section, we conducted three sets of experiments to demonstrate the
performance of the architecture for different document sizes and query
workloads. The graph shown in Figure 7 contains the results for different
query groups, that is, the queries have the same FSM representation but
different constants, for the document course.xml (1MB). When the number
of queries on the same XML document is very large, the probability of
having queries with the same FSM representation increases considerably.
Figure 7. Comparing the performance by varying the number of queries
The above experiment indicates that our proposed architecture is highly
scalable, and a very important factor on the performance is the number of

IJCSBI.ORG
query groups and that generating a single FSM per query group rather than
per query is well justified.
Figure 8. Comparing the performance by varying depth
The depth of XML documents and queries in the user profiles varies
according to application characteristics. Figure 8 shows the execution time
for evaluating the performance of the system as the maximum depth is
varied. Here, we fixed the number of profiles at 25000 and varied the
maximum depth of the XML document and queries from 1 to 10.
Figure 9. Execution time of queries for different number of query groups and
document sizes
Figure 9 shows the results for the execution times of queries which are
varied the number of query groups and the size of different documents. The
results indicate that performance is more sensitive to document size when
the number of query groups increases. Therefore, this result also confirms
the importance of the query grouping.
As final conclusion we can say that FSM approach proposed in this paper
for executing XML-QL queries on XML documents is a very promising
approach to be used in the mobile environments.

IJCSBI.ORG
6. CONCLUSIONS
Mobile communication is blooming and access to Internet from mobile
devices has become possible. Given this new technology, researchers and
developers are in the process of figuring out what users really want to do
anytime from anywhere and determining how to make this possible. In
addition, highly personalization is a very important requirement for
developing SDI services in mobile environment as the limited capability of
mobile devices is not enough to handle the entire documents. This paper
attempts to develop an efficient and scalable SDI system for mobile clients
based upon their profiles. We anticipate that one of the common uses of
mobile devices will be to deliver the personalized information from XML
sources. We believe that a querying power is necessary for expressing
highly personalized user profiles and for the system to be used for millions
of mobile users, it has to be scalable. Since the critical issue is the number
of profiles compared to the number of documents, indexing queries rather
than documents makes sense. We expect that the performance of the system
will still be acceptable for mobile environments for millions of queries since
the results of the experiments show that the system is highly scalable.
7. ACKNOWLEDGMENTS
The authors wish to acknowledge Dr. Soe Khaing for her useful comments
on earlier drafts of the paper. Our heart-felt thanks to our family, friends and
colleagues who have helped us for the completion of this work.
REFERENCES
[1] M. Altinel and M. Franklin, “Efficient filtering of XML documents for selective
dissemination of information,” Proc of the Int’l Conf on VLDB, pp. 53-64, Sept 2000.
[2] Y. Diao, M. Altinel, M. Franklin, H. Zhang and P.M. Fischer, “Path sharing and
predicate evaluation for high-performance XML filtering,” ACM Trans. Database
Syst., 28(4), Dec 2003, pp. 467–516.
[3] Extensible Markup Language, http://www.w3.org/XML/.
[4] I. Miliaraki, Distributed Filtering and Dissemination of XML Data in Peer-to-Peer
Systems, PhD Thesis, Department of Informatics and Telecommunications, National
and Kapodistrian University of Athens, July 2011.
[5] I. Miliaraki and M. Koubarakis, “FoXtrot: distributed structural and value XML
filtering”, ACM Transactions on the Web, Vol. 6, No. 3, Article 12, Publication date:
September 2012.
[6] J. Chen, D. DeWitt, F. Tian and Y. Wang, “NiagaraCQ: a scalable continuous query
system for internet databases”, ACM SIGMOD, Texas, USA, June 2000, pp.379-390.
[7] XML-QL: A Query Language for XML, http://www.w3.org/TR/1998/NOTE-xml-ql-
19980819.

Vol 8 No 1 - December 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (13)

Similar to Vol 8 No 1 - December 2013

Similar to Vol 8 No 1 - December 2013 (20)

More from ijcsbi

More from ijcsbi (20)

Recently uploaded

Recently uploaded (20)

Vol 8 No 1 - December 2013