ISSN: 1694-2507 (Print)
ISSN: 1694-2108 (Online)
International Journal of Computer Science
and Business Informatics
(IJCSBI.ORG)
VOL 14, NO 1
JULY 2014
Table of Contents VOL 14, NO 1 JULY 2014
Symmetric Image Encryption Algorithm Using 3D Rossler System........................................................1
Vishnu G. Kamat and Madhu Sharma
Node Monitoring with Fellowship Model against Black Hole Attacks in MANET.................................... 14
Rutuja Shah, M.Tech (I.T.-Networking), Lakshmi Rani, M.Tech (I.T.-Networking) and S. Sumathy, AP [SG]
Load Balancing using Peers in an E-Learning Environment ...................................................................... 22
Maria Dominic and Sagayaraj Francis
E-Transparency and Information Sharing in the Public Sector ................................................................ 30
Edison Lubua (PhD)
A Survey of Frequent Subgraphs and Subtree Mining Methods ............................................................. 39
Hamed Dinari and Hassan Naderi
A Model for Implementation of IT Service Management in Zimbabwean State Universities ................ 58
Munyaradzi Zhou, Caroline Ruvinga, Samuel Musungwini and Tinashe Gwendolyn Zhou
Present a Way to Find Frequent Tree Patterns using Inverted Index ..................................................... 66
Saeid Tajedi and Hasan Naderi
An Approach for Customer Satisfaction: Evaluation and Validation ....................................................... 79
Amina El Kebbaj and A. Namir
Spam Detection in Twitter – A Review...................................................................................................... 92
C. Divya Gowri and Professor V. Mohanraj
IJCSBI.ORG
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 1
Symmetric Image Encryption
Algorithm Using 3D Rossler System
Vishnu G. Kamat
M Tech student in Information Security and Management
Department of IT, DIT University
Dehradun, India
Madhu Sharma
Assistant Professor
Department of Computer Science, DIT University
Dehradun, India
ABSTRACT
Recently a lot of research has been done in the field of image encryption using chaotic
maps. In this paper, we propose a new symmetric block cipher algorithm using the 3D
Rossler system. The algorithm utilizes the approach used by Mohamed Amin et al.
[Commun. Nonlinear Sci. Numer. Simulat, (2010)] and Vinod Patidar et al. [Commun
Nonlinear SciNumerSimulat, (2009)]. The merits of these algorithms such as the encryption
structure and the diffusion scheme respectively are combined with an approach to split the
key for the three dimensions to use for encryption of color (RGB) images. The
experimentation results suggest an overall better performance of the algorithm.
Keywords
Image Encryption, Rossler System, Block Cipher, Security Analysis.
1. INTRODUCTION
Image encryption is relatively different from text encryption. Image is made
up of pixels and they are highly correlated; so different approaches are
followed for encryption of images [1-12]. One of the approaches is known
as chaotic cryptography. In this approach, for encryption we use chaotic
maps, which generate good pseudo-random numbers. Cryptographic
properties of these maps such as, sensitive dependence on initial parameters,
ergodic and random like behavior, make them ideal for use in designing
secure cryptographic algorithms. Many scholars have proposed various
chaos-based encryption schemes in recent years [4-12].
A scheme proposed by Mohamed Amin et al. [11] uses Tent map as the
chaotic map and the scheme is implemented for gray scale images. They
proposed a new approach of using the plaintext as blocks of bits rather than
block of pixels. Another scheme proposed by Vinod Patidaret al.[12] uses
chaotic standard and logistic maps and they introduce a way of spreading
the bits using diffusion to avoid redundancy. In this paper, we propose an
algorithm which utilizes the merits of the mentioned schemes. The
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 2
algorithm uses the Rossler system for the chaotic key generation. We
demonstrate a way to split the 3 dimensions of the key for the 3 image
channels i.e. Red, Green and Blue. The algorithm in [11] is used as a base
structure and the diffusion concept from [12] is used to spread the effect of
adding the key. The symmetric Feistel structure, diffusion method and key
splitting of the encryption scheme provide better results.
The rest of the paper is organized as follows: Section 2 provides a brief
overview of the Rossler system. Section 3 provides the algorithmic details.
The results of the security analysis are shown in section 4. Lastly, Section 5
concludes the paper.
2. BRIEF OVERVIEW OF 3D ROSSLER SYSTEM
Rossler system is a system of non-linear differential equations which has
chaotic properties [13]. Otto Rossler defined these equations in 1976. The
equations are as given below
Xn+1 = -Yn-Zn
Yn+1 = Xn + αYn (1)
Zn+1 = β + Zn (Xn-γ)
where, α, β and γ are real parameters. Rossler system's behavior is
dependent on the values of the parameters α, β and γ. For different values of
these parameters the system displays considerable changes. It may be
chaotic, converge toward a fixed point, follow a periodic orbit or escape
towards infinity. The Rossler system displays chaotic behavior for the
values of α=0.432, β=2 and γ=4.
The chaotic behavior refers to the fact that keeping the parameters constant,
even a slight change in the initial value would bring a significant change in
the subsequent values. For example the value of Z0 = 0.3 generates the value
of Z1 = 0.5. After changing the value of Z0to0.6it generates the value of Z1 =
-1. The same chaotic rule applies for the changes of other two dimensions
(X and Y). This chaotic behavior is known as deterministic chaos, i.e. the
knowledge of initial values and parameter values can help in recreating the
same chaotic pattern. Hence the initial conditions have to be shared between
the entities using the system for encryption/decryption process.
3. PROPOSED ALGORITHM
In this section we provide details of our algorithm. The algorithm is
designed to work with color images (RGB). In this scheme the plaintext
(image) is taken as blocks of bits. The block size is 8w, where ‘w’ is the
word size which is 32 bits. Each block of data is divided and stored into 8
w-bit registers and operations are performed on them. The key length
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 3
depends on the number of rounds ‘r’ i.e. Key length is 4r+8. The number of
rounds can vary from 1-255. We have taken ‘r’ to be 12 for our
experimentation.
The flowchart shown in Fig. 1 displays the various steps performed on the
image during the encryption process. The steps are explained in the
following subsections.
Figure 1. Flowchart of the Encryption Scheme
3.1 Padding
The processing of the image is done on block of data. 256 bits ie.32 bytes of
data are encrypted/decrypted at a time using eight 32-bit registers. The
image size should be a multiple of 256 bits to ensure that there is always a
full block size for encryption. Hence padding is added so as to make the
input block of size 32 bytes when the image size in bytes is not an integral
multiple of 32. A padding of all zeros (1-31 bytes) is appended to the end of
each row to make the bytes in each row a multiple of 32.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 4
For example if the image is of dimensions 252 x 252 pixels, a 4 byte
padding of zeros is appended at the end of each row. The last byte of the
image then stores the number of bytes used as padding as a pixel value i.e. 4
in this case. This pixel value is used to remove the padding after decryption.
After retrieving the number of bytes padded ‘n’, all rows are checked to
determine if zeros exist in all the last ‘n’ bytes and in ‘n-1’ bytes of the last
row. The padding is then removed to generate the original image.
3.2 Key Generation
The key is generated by the 3D chaotic Rossler system as shown in (1). The
number of key bytes ‘t’ depends on the number of rounds ‘r’ i.e. t=4r+8. We
use the three equations separately. The random sequence generated by each
equation of the map is used as a key separately during the encryption
process of the red, green and blue channel of the image respectively. The
key generation concept is as shown below. The steps repeat ‘t’ number of
times to generate necessary key bytes.
a. Iterate Rossler system of equations (1) ‘r’ times where ‘r’ is the
number of rounds.
b. Use the decimal part of the X, Y, Z values to generate the key byte.
Xn = abs (Xn - integer part); // decimal part of x
Yn = abs (Yn - integer part); // decimal part of y
Zn = abs (Zn - integer part); // decimal part of z
c. Key byte for each dimension (R,G,B) is taken as X, Y, Z values
respectively by mapping it to a value between 0-255.
d. For the next set of key bytes the number of iterations is changed to a
value obtained by performing exclusive-or on the current set of key
bytes.
Iterations for next key byte = XOR (Xn, Yn, Zn);
3.3 Vertical and Horizontal Diffusion
The diffusion process explained in [12] is used in the algorithm. The
horizontal diffusion in our algorithm is used in a slightly different way i.e. it
is performed separately on each channel after the encryption of the channel
rather than using it on the entire image. The diffusion ensures spread of the
key additions for the channel. The horizontal diffusion moves in the forward
direction from the first pixel of a channel to the last. The second pixel is the
exclusive or of first and second pixel of a channel, the third pixel is the
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 5
exclusive or of the new second pixel and the third pixel, and so on. Thus the
first pixel of the channel remains unchanged.
The Vertical Diffusion is performed before and after the entire encryption
and horizontal diffusion is performed on the 3 channels of the image. In
Vertical Diffusion the channels are treated collectively. The processing
occurs from the last pixel of the image to the first pixel. It starts by
performing XOR of the green and blue values of the last pixel of the image
with the red value of the second last pixel to form the new red value of the
second last pixel. The green value of the second last pixel is formed by
performing XOR operation on the red and blue values of the last pixel. The
blue value of the second last pixel is formed by XOR operation on the red
and green values of the last pixel. This continues in the backward direction.
Thus the last pixel remains unchanged.
3.4 Encryption/Decryption Scheme
The encryption is performed on 256 bits (32 bytes) of data at a time using
eight 32-bit registers. The algorithm is shown in Fig. 2. In the initial step
four bytes of the key are added to alternate registers. 2’s compliment
addition is performed. Then for ‘r’ rounds arithmetic operations are
performed on the image data. It uses a function ‘f’, the output of which is
used as the number of rotations to be performed on another block of data.
After the swapping operation of the last round, the last four key bytes are
added. The entire encryption structure is displayed in Fig. 3. For decryption
the algorithm follows reverse of the encryption process.
Figure 2.Encryption Algorithm for each Channel (R,G,B)
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 6
Figure 3. The Image Encryption Structure
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 7
4. EXPERIMENTATION RESULTS
We performed security analysis on six 256 x 256 color(RGB) images as
shown in Fig. 4. The statistical and differential analysis tests performed
display very favorable results. The results display the strength and security
of the algorithm. The results have been given in [14] to demonstrate the
overcoming of vulnerability in [11].
Figure 4.Plain images (clockwise from top left): Lena, Bridge, Lake, Plane,
Peppersand Mandrill
4.1 Statistical Analysis
Statistical analysis is performed to determine the correlation between the
plain image and the cipher image. For an encryption system to be strong the
cipher image should not be correlated to the plain image and the cipher
image pixels should not have correlation among them. In this section we
provide the histogram and correlation analysis.
4.1.1 Histogram Analysis
When the encrypted image and the plain image do not show high degree of
correlation we can consider the encryption to be secure form information
leakage. Histograms are used to plot the number of pixels at each intensity
level i.e. pixels having values 0-255. This helps in displaying how the pixels
are distributed.
Fig. 5 depicts the histogram for the red, green and blue channels of the plain
image ‘lena’ on the left side (from top to down) and the histograms of the
‘lena’ image after encryption for the three channels respectively on the right
side. They depict that the encryption does not leave any concentration of a
single pixel value.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 8
Figure 5.Left Side: Histogram of ‘lena’ plain image for red, green and blue channels
(top to down). Right Side: Histogram of encrypted ‘lena’ image for red, green and
blue channels (top to down).
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 9
4.1.2 Correlation of Adjacent Pixels
In a plain image the adjacent pixels show a high degree of correlation in
horizontal, vertical and diagonal directions. The encrypted image should
have a very small degree of correlation among its adjacent pixels. We select
1000 random pairs of pixels from an image and the following formula gives
the correlation coefficient.
𝑐𝑜𝑟𝑟𝑥𝑦 =
𝐶(𝑥,𝑦)
𝐷 𝑥 𝐷(𝑦)
(2)
where,
𝐶 𝑥, 𝑦 =
1
𝑁
(𝑥𝑖 − 𝐸(𝑥))(𝑦𝑖 − 𝐸(𝑦))𝑁
𝑖=1 (3)
𝐷 𝑥 =
1
𝑁
𝑥𝑖 − 𝐸 𝑥
2𝑁
𝑖=1 (4)
𝐸 𝑥 =
1
𝑁
𝑥𝑖
𝑁
𝑖=1 (5)
Here xi and yi form the pair of ith
adjacent pixels and N is the total number
of pairs.
Table 1 shows the correlation coefficient values of the six plain images (Fig.
4) between horizontal, vertical and diagonal adjacent pixels. It can be noted
that the adjacent pixels are highly correlated.
Table 1.Correlation Values of Plain-Images
Channels Plain Images Horizontal Vertical Diagonal
RED
Lena 0.9558 0.9781 0.9336
Bridge 0.8680 0.9070 0.8287
Lake 0.9234 0.9201 0.8886
Mandrill 0.8474 0.8032 0.7944
Peppers 0.9371 0.9392 0.9077
Plane 0.9205 0.9092 0.8546
GREEN
Lena 0.9401 0.9695 0.9180
Bridge 0.9055 0.9131 0.8700
Lake 0.9354 0.9272 0.8943
Mandrill 0.7285 0.6674 0.6487
Peppers 0.9657 0.9673 0.9451
Plane 0.8938 0.9174 0.8419
BLUE
Lena 0.9189 0.9495 0.8948
Bridge 0.9354 0.9411 0.9138
Lake 0.9377 0.9401 0.9099
Mandrill 0.8030 0.7914 0.7625
Peppers 0.9259 0.9330 0.8928
Plane 0.9179 0.8912 0.8563
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 10
Table 2 shows the correlation coefficient values for the Red, Green and Blue
channel of the cipher images formed by encrypting the plain images with the
proposed encryption algorithm. The cipher images bear very little
resemblance to the original images and that the adjacent pixels in the
horizontal, vertical and diagonal directions are correlated to a very small
degree.
Table 2.Correlation Values of Cipher-Images
Channels Plain Images Horizontal Vertical Diagonal
RED
Lena -0.0014 -0.0012 0.0004
Bridge -0.0040 -0.0066 -0.0010
Lake -0.0052 -0.0011 0.0018
Mandrill 0.0034 0.0001 0.0033
Peppers -0.0014 -0.0034 -0.0016
Plane -0.0024 -0.0043 0.0088
GREEN
Lena 0.0004 0.0067 -0.0026
Bridge -0.0053 -0.0017 0.0008
Lake 0.0044 -0.0025 0.0068
Mandrill -0.0031 -0.0041 0.0029
Peppers 0.0008 0.0027 0.0029
Plane 0.0026 -0.0003 0.0014
BLUE
Lena -0.0049 0.0014 -0.0005
Bridge 0.0023 0.0001 0.0037
Lake -0.0010 -0.0044 0.0002
Mandrill 0.0023 0.0001 -0.0014
Peppers -0.0016 -0.0006 0.0013
Plane 0.0040 -0.0007 0.0041
4.1.3 Correlation between plain and cipher image
The previous section showed correlation between adjacent pixels of plain
image or cipher image. But it is also necessary to have no relevant
correlation between the plain image and the corresponding cipher image.
Rather than using the pixel pairs of a single image, we use the pixels of the
plain and cipher image at the same grid position.
The 2D correlation coefficients of the images are calculated by pairing the
three channels of the plain image with the three channels of the cipher
image. These form nine different pairs i.e. correlation between; red channel
of plain image and red channel of cipher image, red channel of plain image
and green channel of cipher image, red channel of plain image and blue
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 11
channel of cipher image; and so on for the green and blue channels of the
plain image. These are represented as CRR, CRG, CRB, CGR, CGG, CGB, CBR,
CBG, CBB; where for any Cij, i represents a channel (R,G,B) of plain image
and j represents a channel (R,G,B) of cipher image. The coefficient values
given in Table 3 depict that there is little or practically no correlation
between the plain image and its corresponding cipher image. The cipher
image thus displays characteristics of a random image.
Table 3.Correlation Values between Plain Image and Cipher Image
Images CRR CRG CRB CGR CGG CGB CBR CBG CBB
Lena -0.0033 0.0016 0.0047 -0.0026 -0.0008 0.0006 -0.0029 0.0003 -0.0021
Bridge -0.0029 0.0005 0.0003 -0.0020 -0.0006 0.0011 0.0008 0.0007 0.0010
Lake -0.0012 0.0002 0.0005 -0.0041 -0.0007 0.0033 -0.0050 -0.0021 0.0039
Mandrill -0.0019 -0.0004 -0.0024 -0.0035 0.0011 -0.0036 -0.0034 0.0005 -0.0036
Peppers -0.0030 -0.0059 -0.0022 -0.0033 -0.0024 -0.0012 -0.0042 -0.0007 0.0005
Plane 0.0072 0.0014 -0.0003 0.0068 0.0025 0.0015 0.0057 0.0033 0.0033
4.2 Differential Analysis
Differential analysis displays the amount of change that the encryption
performs on the image. The encryption of two very similar images should
not have a similar distribution of pixels in the cipher image. In other words,
cipher images of two plain images having just a single pixel difference,
should not bear any pixel resemblance between them. An adversary should
not be able to extract any meaningful relationship between plaintext and
cipher text, by comparing the 2 different cipher text of similar plaintext.
NPCR (net pixel change rate) and UACI (unified average changing
intensity) are used as measures of differential analysis. NPCR indicates the
percentage of pixel change in the cipher image when a single pixel of plain
image is changed. UACI measures the average intensity of the change
between plain and cipher image.
Let us consider 2 cipher images X1 and X2, obtained by plain images P1 and
P2 having difference of a single pixel. The pixel values at the grid position
of ith
row and jth
column for the cipher images are denoted as X1(i,j) and
X2(i,j). A bipolar array B is defined as follows
𝐵(𝑖, 𝑗) =
0, 𝑖𝑓 X1 𝑖, 𝑗 = X2(𝑖, 𝑗)
1, 𝑖𝑓 X1 𝑖, 𝑗 ≠ X2(𝑖, 𝑗)
(6)
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 12
Values for NPCR and UACI are calculated as given in equations (7) and (8),
where W and H denote width and height of the cipher images, T denotes the
largest supported pixel value in the cipher images (255 in our case) and
abs() computes the absolute value. The NPCR and UACI values given in
Table 4 show that the encryption algorithm is secure against differential
attacks.
NPCR =
𝐵(𝑖,𝑗)𝑖,𝑗
W x H
x 100% (7)
UACI =
1
W x H
𝑎𝑏𝑠(x1 𝑖,𝑗 −x2 𝑖,𝑗 )
𝑇𝑖,𝑗 x 100% (8)
Table 4.NPCR and UACI Values Obtained for Encryption of 6 Plain images and Same
Images with 1 Pixel Changed
Plain Images NPCR UACI
Lena 99.6333 33.4706
Bridge 99.5722 33.4403
Lake 99.5900 33.5313
Mandrill 99.6089 33.4595
Peppers 99.6185 33.4657
Plane 99.6206 33.4539
5. CONCLUSION
In this paper we proposed a new image encryption algorithm. The merits of
the recent research, based on results, were combined along with a symmetric
approach of encryption to provide a secure algorithm. The diffusion
mechanism along with Feistel structure makes the algorithm stronger. The
3D Rossler system of equations is used for the random key generation. The
splitting of the three dimensions of the key for the three channels makes the
cryptanalysis to obtain the key more difficult. The experimentation
performed depict that the algorithm generates favorable results.
REFERENCES
[1] Chang, C.-C., Hwang, M.-S.and Chen, T.-S., 2001. A New Encryption Algorithm for
Image Cryptosystems. Journal of Systems and Software, Vol. 58, No. 2, pp. 83-91.
[2] Yano, K. and Tanaka, K., 2002. Image Encryption Scheme Based on a Truncated
Baker Transformation. IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, Vol. E85-A, No. 9, pp. 2025-2035.
[3] Gao, T. and Chen, Z., 2008. Image Encryption Based on a New Total Shuffling
Algorithm.Chaos, Solitons and Fractals, Vol. 38, No. 1, pp. 213-220.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 13
[4] Chen, G., Mao, Y. and Chui, C.K., 2004. A Symmetric Image Encryption Based on 3D
Chaotic Cat Maps. Chaos, Solitons and Fractals, Vol. 21, pp. 749-761.
[5] Mao, Y., Chen, G. and Lian, S., 2004. A Novel Fast Image Encryption Scheme Based
on 3D Chaotic Baker Maps. International Journal of Bifurcation and Chaos, Vol. 14,
No. 10, pp. 3613-3624.
[6] Guan, Z.-H., Huang, F. and Guan, W., 2005. Chaos Based Image Encryption
Algorithm. Physics Letters A, Vol. 346, pp. 153-157.
[7] Zhang, L., Liao, X. and Wang, X., 2005. An Image Encryption Approach Based on
Chaotic Maps. Chaos, Solitons and Fractals, Vol. 24, pp. 759-765.
[8] Gao, H., Zhang, Y., Liag, S. and Li, D., 2006. A New Chaotic Algorithm for Image
Encryption. Chaos, Solitons and Fractals, Vol. 29, pp. 393-399.
[9] Pareek, N.K., Patidar, V. and Sud, K.K., 2006. Image Encryption Using Chaotic
Logistic Map. Image and Vision Computing, Vol. 24, pp. 926-934.
[10]Wong, K.-W., Kwok, B.S.-H.and Law, W.-S., 2008. A Fast Image Encryption Scheme
Based on Chaotic Standard Map. Physics Letters A, Vol. 372, pp. 2645-2652.
[11]Amin, M., Faragallah, O.S. and Abd El-Latif, A.A., 2010. A Chaotic Block Cipher
Algorithm for Image Cryptosystems. Communications in Nonlinear Science and
Numerical Simulation, Vol. 15, pp. 3484-3497.
[12]Patidar, V., Pareek, N.K. and Sud, K.K.,2009. A New Substitution-Diffusion Based
Image Cipher Using Chaotic Standard and Logistic Maps. Communications in
Nonlinear Science and Numerical Simulation, Vol. 14, pp. 3056-3075.
[13]Rossler, O.E., 1976. An Equation for Continuous Chaos. Physics Letters A, Vol. 57,
No. 5, pp. 397-398.
[14]Kamat, V.G. and Sharma, M., 2014. Enhanced Chaotic Block Cipher Algorithm for
Image Cryptosystems. International Journal of Computer Science Engineering, Vol. 3,
No. 2, pp. 117-124.
This paper may be cited as:
Kamat V. G. and Sharma M., 2014. Symmetric Image Encryption
Algorithm Using 3D Rossler System. International Journal of Computer
Science and Business Informatics, Vol. 14, No. 1, pp. 1-13.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 14
Node Monitoring with Fellowship
Model against Black Hole Attacks in
MANET
Rutuja Shah, M.Tech (I.T.-Networking)
School of Information Technology & Engineering, VIT University
Lakshmi Rani, M.Tech (I.T.-Networking)
School of Information Technology & Engineering, VIT University
S. Sumathy, AP [SG]
School of Information Technology & Engineering, VIT University
Abstract
Security issues have been considerably increased in mobile ad-hoc networks. Due to absence of any
centralized controller, the detection of problems and recovery from such issues is difficult. The packet
drop attacks are one of those attacks which degrade the network performance. In this paper, we
propose an effective node monitoring mechanism with fellowship model against packet drop attacks
by setting up an observance zone where suspected nodes are observed for their performance and
behavior. Threshold limits are set to monitor the equivalence ratio of number of packets received at
the node and transmitted by node inside mobile ad hoc networks. This fellowship model enforces a
binding on the nodes to deliver essential services in order to receive services from neighboring nodes
thus improving the overall network performance.
Keywords: Black-hole attack, equivalence ratio, fair-chance scheme, observance zone, fellowship
model.
1. INTRODUCTION
Mobile ad-hoc networks are infrastructure less and self organized or configured
network of mobile devices connected with radio signals. There is no centralized
controller for the networking activities like monitoring, modifications and updating
of the nodes inside the network as shown in figure 1. Each node is independent to
move in any direction and hence have the freedom to change the links to other nodes
frequently. There have been serious security threats in MANET in recent years.
These usually lead to performance degradation, less throughput, congestion, delayed
response time, buffer overflow etc. Among them is a famous attack on packets
known as black-hole attack which is also a part of DoS(Denial of service) attacks. In
this, a router relays packets to different nodes but due to presence of malicious nodes
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 15
these packets are susceptible to packet drop attacks. Due to this, there is hindrance is
secure and reliable communication inside network.
Figure 1. MANET Scenario
Section 2 addresses the seriousness of packet drop attacks and related work done so
far in this area. Section 3 elaborates our proposal and defending scheme for packet
drop attacks. Section 4 provides concluding remarks.
2. LITERATURE SURVEY
The packet drop loss in ad-hoc network gained importance because of self-serving
nodes which fail to provide the basic facility of forwarding the packets to
neighboring nodes. This causes an occupational hazard in the functionality of
network. Generally there are two types of nodes- selfish and malicious nodes. Selfish
nodes are those nodes which act in the context of enhancing its performance while
malicious nodes are those which mortifies the functions of network through its
continual activity. The WATCHERS [1] from UC Davis was presented to detect and
remove routers that maliciously drop or misroute packets. A WATCHER was based
on the “principle of packet flow conservation”. But it could not differentiate much
between malicious and genuine nodes. Although it was robust against byzantine
faults, it could not be much effective in today’s internet world to reduce packet loss.
The basic mechanism of packet drop loss is that the nodes do not progress the
packets to other nodes selfishly or maliciously. Packet Drop loss could occur due to
Black hole attack. Sometimes the routers behave maliciously i.e. the routers do not
forwards packets, such kinds of attacks are known as “Grey Hole Attack”. In case of
routers, the attacks can be traced quickly while in the case of nodes it’s a
cumbersome task. Many researchers have worked in this field and have tried to find
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 16
solutions to this attack [2-6]. Energy level was one of the parameter on which the
researchers have shown their results. This idea works on the basis of the ratio of
fraction of energy committed for a node, to overall energy contributed towards the
network. The node is retained inside the network on the basis of energy level and the
energy level is decided by the activeness of node in a network through mathematical
computations. Mathematical computations are [7] too complicated to clench and
sometimes the results are catastrophic. It can be said that the computations are
accurate but they are very much prone to ambiguity in the case of ad-hoc networks.
Few techniques involve usage of routing table information which is modified after
detecting the MAC address of malicious node which uses jamming style DoS attack
to cease their activities [8]. Another approach to reduce attacks was using historical
evidence trust management based strategy. [9] Direct trust value (DTV) was used
amongst neighboring nodes to monitor the behavior of nodes depending on their past
against black hole attacks. However, there is high possibility that trust values may
get compromised by the malicious nodes. Also the third party used for setting the
trust values is also vulnerable to attacks. Recent methods included an [10]
introduction to a new protocol called RAEED (Robust formally Analyzed protocol
for wirEless sEnsor networks Deployment) which reduces this attack but not by a
considerable percentage. To overcome the issues faced in order to implement these
strategies there is a need of an effective mechanism to curb these attacks and make
network more secure.
3. PROPOSED APPROACH
In this paper, we put forth a mechanism to reduce these packet-drop attacks by
implementing “node monitoring with fellowship” technique. We introduce an
obligation on the nodes inside a particular network to render services to network. If
services are not rendered, the node will be expelled outside the performance.
However, we have kept a “fair-chance” scheme for all nodes which help to make out
whether it is genuine node or malicious node.
3.1 Fellowship of Network
The prime parameter we used in this to address packet drop attacks issue is by
maintaining the count of incoming packets, except the destined one on that node and
the count of outgoing nodes except the ones which are originated at that node,
should be same, referred to as “equivalence ratio”. If that count is same, there is
uniform distribution and forwarding of packets among the nodes inside network.
However, if the count is not same, then that particular node is kept under
“observance zone” in order to monitor its suspicious behavior. We suggest a
periodical reporting of all nodes about their equivalence ratio to neighboring nodes
inside the network.
This will help to decide whether to keep a particular node in “observance zone”
which could be done with polling techniques amongst each other. Inside, observance
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 17
zone, the suspected node is given “fair-chance” treatment. That is, during
observance zone period, the suspected node is required to submit its “status-
message” to neighboring nodes to prove its genuineness of performance inside
network. The genuine nodes will promptly provide their status-message to
neighboring nodes because they will be willing to stay inside the network to render
services under obligation for the network. However, the malicious nodes may or may
not reply their status-messages to neighboring nodes since they have to degrade
performance of network. But, for such status-messages only fair-chance is given.
That is, a standard threshold level is been set up unanimously amongst neighboring
nodes inside network. Status-messages will be entertained only up to threshold level.
So, even if malicious nodes produce and fake their own status-messages to
neighboring nodes in order to sustain inside network due to threshold limits it will
not degrade network performance much. When threshold is crossed, the neighboring
nodes will be intimated about the node which is under observance zone and a
unanimous decision will be taken to expel that suspected node out of the network.
Because of this scheme, there is possibility that the suspected node is expelled
outside the network under 2 circumstances: either its genuine node (which are
underperforming) or malicious nodes. In both cases, the suspected node needs to be
expelled out of network because it is leading to performance degradation of the
network. The “fair-chance” scheme ensures that genuine nodes are given fair chance
to justify themselves and repair itself soon to prove their genuineness to render
services to network under obligation.
3.2 Scenario Assumptions
Let the nodes inside MANET be connected through wireless links with each other.
Let number of packets be transmitted and received with each other by the nodes. Let
nodes be named alphabetically from A,B,C…and so on till Z. Let node X be
malicious node which drops packets and undergoes black hole attack and hence has
poor equivalence ratio while node Y be the genuine node but has poor equivalence
ratio due to network congestion or may be due to some other network issues. All
nodes inside the network follow the principle of “node monitoring with fellowship”.
Data structures used are the networking parameters which are as follows:
1)equi_ratio = denoting the equivalence_ratio of nodes
2)observance_zone= denoting list of suspected nodes inside observance zone.
3)threshold_value= denoting threshold value decided by the nodes inside MANET.
4)status_message= denoting the status messages exchanged amongst neighboring
nodes.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 18
Steps involved:
Step 1: All nodes calculate their own equivalence ratio(equi_ratio) and share it with
their neighboring nodes(let them be at one hop distance) periodically.
Step 2: All nodes unanimously agree upon a standard threshold level (in this case,
threshold_value=3) through exchange of messages using agreement protocols.
Step 3: All nodes will monitor their neighbor’s equi_ratio and if any node has
equi_ratio which is quite poor then that particular node will be kept under
“observance zone” list through mutual exchange of messages of nodes inside
network. These nodes may be suspected as malicious nodes or genuine nodes but
with poor performance.
Step 4: Once the suspected node is kept in “observance zone” list, it is made
mandatory for that node to report the “status_message” to the neighboring nodes to
justify their performance and behavior.
Step 5: If it’s a malicious node (node X) it may either fake its status_message to
show its genuineness and stay inside network or it may just avoid sending its
status_message since it wishes to continue its malicious activities in future too. If it
is genuine node (node Y) it will send status_message in order to prove its
genuineness and try to improve its performance by repairing itself with the network
issues it is facing while sending the packets. However, in both the cases , we have
limited the frequency of justification through status_message by the nodes using fair
chance scheme wherein nodes are allowed to justify themselves only till certain
threshold_value(here, value=3 .i.e. only 3 times the suspected nodes are allowed to
send status_message in order to justify their performance). In short, malicious
nodes and genuine nodes which are underperforming both are kept under
surveillance to observe their behavior.
Step 6: Thus, the nodes which cross the limits of threshold_value, will be
immediately expelled outside the network through exchange of protocols and
messages between the neighboring nodes. In this way, packet-drop attacks can be
considerably reduced. Figure 2 explains the workflow mechanism.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 19
Figure 2. Flowchart of proposed mechanism
Set the threshold_value
unanimously and exchange
equi_ratio with neighboring nodes
periodically
Check whether
equi_ratio is
acceptable or
unacceptable
Place the suspected node under
observance_zone
Check if below
threshold_value
Suspected node is expelled
outside network
Continue normal
network activities
Exchange of
status_message
Acceptable
Unacceptable
Above threshold_value
Less than or equal
to threshold_value
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 20
3.3 Advantages:
1. Fair chance scheme ensures genuineness of innocent nodes.
2. No complex mathematical computations of energy levels at each node.
3. Periodical reporting ensures removal of both underperforming and malicious
nodes from the network.
4. Up gradation of network performance in MANET.
3.4 Disadvantages
However, there is an overhead of exchanging more number of messages among the
neighboring nodes. Optimization on number of messages exchanged during
communication can be addressed and worked upon in future research.
4. CONCLUSION
In this paper, we have proposed a novel scheme to reduce packet drop attacks and
enhance the network performance. However, we anticipate our “node-monitoring
with fellowship” model may lead to increase in number of exchanged messages
amongst neighboring nodes during the agreement protocols inside network but at the
same time it will be robust against attacks and thus increase the availability of nodes
in mobile ad-hoc networks. The outcomes of minimizing packet drop loss have better
utility of channel, resources and QoS guaranteed which results in productive priority
management and a considerable controlled traffic by periodic surveillance over
nodes. The future research on this would be to reduce the exchange of messages
amongst the nodes, minimize the overhead and achieve optimization inside mobile
ad-hoc networks.
5. REFERENCES
[1] K. A. Bradley, S. Cheung, N. Puketza, B. Mukherjee and R. A. Olsson, Detecting Disruptive
Routers: A Distributed Network Monitoring Approach, in the 1998 IEEE Symposium on Security and
Privacy, May 1998.
[2] Y.C. Hu, A. Perrig and D. B. Johnson, Ariadne: A Secure On-demand Routing Protocol for Ad
Hoc Networks, presented at International Conference on Mobile Computing and Networking, Atlanta,
Georgia, USA, pp. 12 - 23, 2002.
[3] P. Papadimitratos and Z. J. Haas, Secure Routing for Mobile Ad hoc Networks, presented at SCS
Communication Networks and Distributed Systems Modeling and Simulation Conference, San
Antonio, TX, January2002.
[4] K. Sanzgiri, B. Dahill, B. N. Levine, C. Shields and E. M. Belding-Royer, A Secure Routing
Protocol for Ad Hoc Networks, presented at 10th IEEE International Conference on Network
Protocols (ICNP'02), Paris, pp. 78 - 89, 2002.
[5] V. Balakrishnan and V. Varadharajan, Designing Secure Wireless Mobile Ad hoc Networks,
presented at Proceedings of the 19th IEEE International Conference on advanced information
Networking and Applications (AINA 2005). Taiwan, pp. 5-8, March 2005.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 21
[6] V. Balakrishnan and V. Varadharajan, Packet Drop Attack: A Serious Threat to Operational
Mobile Ad hoc Networks, presented at Proceedings of the International Conference on Networks and
Communication Systems (NCS 2005), Krabi, pp. 89-95, April 2005.
[7] Venkatesan Balakrishnan and Vijay Varadharajan Short Paper: Fellowship in Mobile Ad hoc
Networks presented at Proceedings of the First International Conference on Security and Privacy for
Emerging Areas in Communications Networks (SECURECOMM’05) IEEE.
[8] Raza, M., and Hyder, S.I. A forced routing information modification model for preventing black
hole attacks in wireless Ad Hoc network presented at Applied Sciences and Technology (IBCAST),
2012, 9th International Bhurban Conference, Islamabad, pp. 418-422, January 2012.
[9] Bo Yang , Yamamoto, R., Tanaka, Y. Historical evidence based trust management strategy
against black hole attacks in MANET published in 14th International Advanced Communication
Technology(ICACT), 2012 on pp. 394 – 399.
[10] Saghar, K., Kendall, D.and Bouridane, A. Application of formal modeling to detect black hole
attacks in wireless sensor network routing protocols .Applied Sciences and Technology (IBCAST),
2014, 11th International Bhurban Conference, Islamabad, pp. 191-194, January 2014.
This paper may be cited as:
Shah, R., Rani, L. and Sumathy, S. 2014. Node Monitoring with Fellowship Model
against Black Hole Attacks in MANET. International Journal of Computer Science
and Business Informatics, Vol. 14, No. 1, pp. 14-21.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 22
Load Balancing using Peers in an
E-Learning Environment
Maria Dominic
Department of Computer Science,
Sacred Heart College, India
Sagayaraj Francis
Department of Computer Science and Engineering,
Pondicherry Engineering College, India
ABSTRACT
When an e-Learning System is installed on a server, numerous learners make use of it and
they download various learning objects from the server. Most of the time the request is for
same learning object and downloaded from the server which results in server performing
the same repetitive task of locating the file and sending it across to the requestor or the
client. This results in wasting the precious CPU usage of the server for the same task which
has been performed already. This paper provides a novel structure and an algorithm which
stores the details of the various clients who have already downloaded the learning objects
in a dynamic hash table and look up that table when a new request comes in and sends the
learning object from that client to the requestor thus saving the precious CPU time of the
server by harnessing the computing power of the clients.
Keywords
Learning Objects, e-Learning, Load Distribution, Load Balancing, Data Structure, Peer –
Peer Distribution.
1. INTRODUCTION
1.1 e-Learning
Education is defined as the conscious attempt to promote learning in others
to acquire knowledge, skills and character [1]. To achieve this mission
different pedagogies were used and later on with the advent of new
information communication technology tools and popularity gained by
internet were used to enhance the teaching learning process and gave way to
the birth of e-learning [2]. This enabled the learner to learn by breaking the
time, geographical barriers and it allowed them to have individualized
learning paths [3]. The perception on e-Learning or electronic learning is
that it is a combination of internet, electronic form and network to
disseminate knowledge. The key factors of e-learning are reusing, sharing
resources and interoperability [4]. At present there are various organizations
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 23
providing e-learning tools of multiple functionalities and one such is
MOODLE (Modular Object Oriented Dynamic Learning Environment) [5]
which is used in our campus. This in turn created difficulty in sharing the
learning objects between heterogeneous sites and standards such as SCORM
& SCORM LOM [6], IMS & IMS DRI [7], AICC [8] and likewise were
proposed by different organizations. In Berner-Lee’s famous architecture for
Semantic Web, ontology’s are used for sharing and interoperability which
can be used to build better e-learning systems [9]. In order to define
components for e-learning systems the methodology used is the principle of
composibility in Service Oriented Architecture [10] since it enables us to
define the inter-relations between the different e-learning components. The
most popular model used nowadays in teaching learning process is Felder-
Silverman learning style model [11]. The e-Learning components are based
on key topics, topic types and associations and occurrences. VLE – Virtual
Learning Environment is the software which handles all the activities of
learning. Learning Objects are the learning materials which promotes a
conscious attempt to promote visual, verbal, logical and musical intelligence
[12] through presentations, tutorials, problem solving and projects. By the
multimedia, gaming and simulation kin aesthetic intelligence are promoted.
Interpersonal, intrapersonal and naturalistic intelligence are promoted by
means of chat, SMS, e-mail, forum, video, audio conference, survey, voting
and search. Finally assessment is used to test the knowledge acquired by the
learner and the repository is the place which will hold all the learning
materials.
This algorithm is useful when the learners access the learning objects which
are stored in the repository. It reduces the server’s response rate by the
directing a client to respond to the requestor with the file it has already
downloaded from the server.
1.2 Load Balancing
The emergence of large and faster networks with thousands of computers
connected to it provided a challenge to provide effective sharing of resource
around the computers in the network. Load balancing is a critical issue in
peer to peer network [14]. The existing load balancing algorithms for
heterogeneous, P2P networks are organized in a hierarchical fashion. Since
P2P have gained popularity it became mandatory to manage huge volume of
data to make sure that the response time is acceptable to the users. Due to
the requirement for the data from multiple clients at the same instance may
cause some of the peers to become bottleneck, and thereby creating severe
load imbalance and the response time to the user. So to reduce the
bottlenecks and the overhead of the server there was a need to harness the
computing power of the peers [15]. Much work has been done on harnessing
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 24
the computing power of the computer in the network in high performance
computing and scientific application, faster access to data and reducing the
computing time is still to be explored. In a P2P network the data is de-
clustered across the peers in the network. When there is requirement for a
popular data from across the peer then there occurs a bottleneck and
degrading the system response. So to handle this, a new strategy using a
new structure and an algorithm are proposed in this paper.
2. PROPOSED DATA STRUCTURE AND THE ALGORITHM
The objective of this architecture is to harness the computational power of
the clients in the network. This architecture is with respect to the clients
available in the e-learning network. The network comprises of Master
degree students of Computer Applications accessing learning materials for
their course. The degree programme is a three year programme. So the
clients are categorized to three different clusters namely I MCA, II MCA, III
MCA. We shall name it as class cluster. Every class cluster contains many
clusters inside it, let us name it as file clusters, one cluster for one type of
file since the learning objects can be made up of presentation, video, audio,
picture, animation etc [13]. An address table, named file address table holds
the address of each file cluster in the class cluster. When a request for a file
is received the corresponding cluster is identified by reading the address
from the address table. The following algorithm represents the working
logic of the concept. The data structure is represented in Figure 1. Every file
cluster holds a Dynamic Hash Table (DHT), Linked List and a Binary Tree.
The dynamic hash table holds the address of the linked list which holds the
file names that are already downloaded from the server. The hashing
function used to identify an index in the DHT is as follows,
1. Represent every character in the filename with its position in the
alphabet list and its position in the filename.
Eg: File Name - abc.ppt = 112233, the value for a is got as 11 since
the position of it in the alphabet list is 1 and its position in the file
name is 1.
2. Sum all the digits calculated from step 1.
Eg: 112233 = 12
3. Divide the sum by length of the file name, so 12/3 = 4, which becomes
the index for the file in DHT. The above three steps are mathematical
formulated in equation 1.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 25
Address of
PPT File
Cluster
Address of
Audio File
Cluster
Address of
Video File
Cluster
Cluster of
Presentational
Files
Cluster of
Audio
Files
Cluster of
Video
Files
abc.ppt
bac.ppt NUL
L
*
0
1
2
Dynamic Hashing Table Linked List
Binary
Tree
3
**
* *
*
A
b
c
.
p
p
t
b
a
c
.
p
p
t
N
U
L
L
*
0
1
2
Dynamic Hashing Table Linke
d List
Binar
y
Tree
3
**
* *
*
Address Table
 - IP, CPU usage
Time
Figure 1. Proposed Data Structure
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 26
Every index of DHT holds the starting address of a linked list in which
every node stores the file names that are already downloaded. The linked list
structure is used to avoid index collision between filenames generating the
same index in DHT. The index collision is avoided by creating a new node
in the linked list for the new file name. As shown in Figure 1, every node in
the linked list holds three values namely file name, address of the binary tree
and the last part holds the address of the next node in the list. The nodes of
the binary tree holds the active clients IP’s and its current CPU processing
status. The binary tree is used to identify the leased CPU used client to
transfer the file to the requestor. This activity will harness the computing
power of the least used CPU. The binary tree structure was used to reduce
the search time for the least used client. If the file has not been downloaded
by any clients i.e. when the last node of the linked list is reached, then the
file is transferred from the server.
Algorithm 1 SEND ( ) {
Request directed to the File Cluster
Address of the File Cluster taken from the Address Table
Index Location of the File = HASHED (File Name)
If the index is out of bound
{
The file has not been downloaded by any client
It is sent from the server to the client
}
Else
{
While (not end of Linked List OR the node is not found)
{
If (Node. data == File Name)
{
Node found = true
While (end of binary tree}
{
Least usage time CPU IP = LEASTUSEDCPU()
}
}
If (Node found == true)
Send the requested file from the IP to the Requestor
Else
The file has not been downloaded by any client
It is sent from the server to the client
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 27
}
End of the SEND ( )
Algorithm 2 int LEASTUSEDCPU ( ) {
While (not end of Binary Tree)
{
Compare the CPU Usage Time of the first
node with the CPU Usage Time of every
other Node
If CPU Usage Time of the new node is
Lesser
Leastusedcpu = IP
}
return ( Leastusedcpu)
End of LEASTUSEDCPU ( )
Algorithm 3 int HASHED (String Filename) {
Len = StringLength(Filename)
While (End of String)
{
IndexString += (Position of the Character in the Filename
+ Position of the Character in the alphabet list)
}
IndexInt = ConverttoInteger(IndexString)
return (IndexInt = IndexInt/Length of the Array)
End of HASHED ( )
3. MATHEMATICAL FORMULATION
Mathematical formulation of the above dealt problem is as follows,
l
index = ( Σ ( a (i+1)+j ) / l ) (1)
i=0
n
Σ index ( Z K (f3)) = X(f2) goto 4, 5 (2)
K=1
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 28
n
Σ index ( Z K (f3)) ≠ X(f2) goto 6 (3)
K=1
n
Min Σ Y = C (m, m+1) (4)
m = 1
f2 = X -1
Y (f1) (5)
f2 = X -1
S (f1) (6)
where,
index is the index in the Dynamic Hash Table
l is the length of the file name
i is the character position in the file name
j = { 1,2,3,4,5….26 }
k is the number of nodes in the linked list
Z is the node in the linked list
f3 is the file name in the node in the linked list
f2 is the targeted file
m is the nodes in the Binary Tree
Y is the node in the Binary Tree with the minimum C
C is the CPU usage time of the specified IP
S is the Server
4. CONCLUSION
The main advantage from this architecture is that the server time is saved by
harnessing the computational power of the clients who have already
downloaded the file to send across it to the requestor. Another advantage of
the architecture is the file search, which has been fastened due to the
Dynamic Hashing table and Binary tree structures. This algorithm is been
currently implemented using PHP and the results of it will be published in
the further publications. Initial results indicate that there is substantial
reduction of the server’s CPU processing time when this algorithm is
executed on the server.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 29
REFERENCES
[1] Lavanya Rajendran, Ramachandran Veilumuthu., 2011. A Cost Effective Cloud
Service for E-Learning Video on Demand, European Journal of Scientific Research,
pp.569-579.
[2] Maria Dominic, Sagyaraj Francis, Philomenraj., 2013. A Study On Users On Moodle
Through Sarasin Model, International Journal Of Computer Engineering And
Technology, Volume 4, Issue 1, pp 71-79.
[3] Maria Dominic, Sagyaraj Francis., 2013. Assessment Of Popular E-Learning Systems
Via Felder-Silverman Model And A Comphrehensive E-Learning System,
International Journal Of Modern Education And Computer Science, Hong Kong,
Volume 5, Issue 11, pp 1-10.
[4] Zhang Guoli, Liu Wanjun, 2010. The Applied Research of Cloud Computing. Platform
Architecture in the E-Learning Area, IEEE.
[5] www.moodle.org
[6] SCORM(Sharable Courseware Object Reference Model), http://www.adlnet.org
[7] IMS Global Learning Consortium, Inc., “Instructional Management System (IMS)”,
http://www.imsglobal.org.
[8] http://www.aicc.org
[9] Uschold, Gruninger., 1996. Ontologies, Principles , Methods and Applications,
Knowledge Engineering Review, Volume 11, Issue 2.
[10]Papazoglou, Heuvel., 2007. Service Oriented Architectures: Approaches,
Technologies, and research issues, The VLDB Journal, , Volume 16, Issue 3, pp. 389-
415.
[11]Graf, Viola, Kinshuk., 2006. Representative Characterestics of Felder-Silverman
Learning Styles: an Empirical Model, IADIS, pp. 235-242.
[12]Lorna Uden, Ernesto Damiani., 2007. The Future of E-Learning: E-Learning
ecosystem, Proceeding of IEEE Conference on Digital ecosystems and Techniques,
Australia, pp. 113-117.
[13]Maria Dominic, Sagyaraj Francis., 2012. Mapping E-Learning System To Cloud
Computing, International Journal Of Engineering Research And Technology, India,
Volume1, Issue 6.
[14]Chyouhwa Chen, Kun-Cheng Tsai., 2008. The Server Reassignment Problem forload
Balancing In Structured P2P Systems, IEEE Transactions On Parallel And Distributed
Systems, Volume 19, Issue 2.
[15]A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica., 2006. Load
Balancing in Structured P2P Systems. Proc. Second Int’l Workshop Peer-to-Peer
Systems (IPTPS ’03).
This paper may be cited as:
Dominic, M. and Francis, S. 2014. Load Balancing using Peers in an E-
Learning Environment. International Journal of Computer Science and
Business Informatics, Vol. 14, No. 1., pp. 22 -29.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 30
E-Transparency and Information
Sharing in the Public Sector
Edison Lubua (PhD)
Mzumbe University, P.O. Box 20266,
Dar Es Salaam, 255, Tanzania
ABSTRACT
This paper determines the degree of information sharing in government institutions through e-
transparent tools. First the basis for the study is set through the background, problem statement
and objectives. The discussion then proceeds by focusing on ICT tools for information sharing.
An information sharing model is proposed and the extent of information sharing in the public
sector of Tanzania through online media is discussed; furthermore, the correlation that exists
between the extent of information sharing and factors such as accessibility, understandability,
usability and reliability is established. The paper concludes by providing recommendations on
information sharing and how it can be enhanced through e-transparency systems for public
service delivery in an open society.
Keywords
E-transparency, E-Governance, Information Sharing, Public Sector, ICT.
1. BACKGROUND OF THE STUDY
Generally, information services are an important pillar for any democratic
government. Citizens rely on information for making decisions which impact
upon their social, political and economic lives. In this regard, there are laws
which govern the right to access, and disseminate information locally and
internationally (Hakielimu, LHRC, REPOA, 2005). Locally, government
authority reflects international agreements through different legislations including
the National Constitution (United Republic of Tanzania, 1995).The constitution
in Tanzania entitles every citizen the right of access to information and empowers
citizens with the right to disseminate information.
In his study Onyach-Olaa (2003) commended government authorities, which
make an effort to enhance information sharing with citizens. The government has
to improve interaction with those it governs while addressing information sharing
as its core function. Furthermore, information sharing and transparency in
government operations must become the culture of any democratic republic,
including Tanzania (Mkapa, 2003). Transparency in government operations,
improve the confidence of citizens toward their government, while reminding
government leaders that their decisions and associated impact are transparent to
citizens (Navarra, 2006). Traditionally, information services have been either
provided or received through physical means; mostly, people use oral/listening
and writing/reading methods to issue and receive information. In many cases, the
traditional method of information sharing is characterised by delays, high cost,
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 31
low transparency and bureaucracy (Im & Jung, 2001); as the result this method
allows for subversion of accountability (Lubua, 2014).
Arguably, communication developments brought by the use of Information and
Communication Technology tools provide a better platform for information
sharing. Instant communication is enhanced through tools such as emails, online
telephoning, video conferencing, chat rooms and social websites. As the result of
these tools challenges that relate to delays, high communication costs and
bureaucratic procedures are addressed.
Apart from the platform provided by online media in enhancing communication,
it is equally important to understand that the efficiency of information sharing is
directly related to the size of the network connecting individuals, groups of
people and organisation (Hatala & Lutta, 2009). The higher the intensity of
networks the more the information received; the organisation enjoys these
benefits if it form a strategic alliance with partners which allow free flow of
information to both ends. This is the reason why the e-governance agency was
instituted in Tanzania.
The appropriate use of e-transparency tools is perhaps the best strategy for the
organisation to enhance information sharing with their stakeholders. The
organisation has to emphasize good qualities of information sharing such as
timely response, accessibility of systems, reliability of data, online security,
completeness of online procedures and openness in service processes. Basically,
this paper discusses different issues, including the need of online information
sharing in the public sector and the extent to which government institution
applies online media for information sharing and service provision. The study is
based on opinions from clients who are consumers of such services.
2. PROBLEM STATEMENT
Business competition compels organisations to invest in information systems to
improve the efficiency of their operations (Barua, Ravindran, & Whinston, 2007).
This investment is made possible through the knowledge of employees, suppliers,
customers, and other key stakeholders. In this regard the organization that shares
its information with stakeholders more efficiently earns a competitive advantage
(Drake, Steckler, & Koch, 2004).
Information sharing is an important resource which should be embraced in order
to enhance the performance of an organisation (Hatala & Lutta, 2009).
Depending on the type of organisation, the extent of information sharing is partly
influenced by organisational policies and practices. The management team,
employees and partners have to work together to foster organisational
information sharing, which guarantees the future existence of the organisation
(Drake, Steckler, & Koch, 2004).
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 32
The government of Tanzania acknowledges the importance of ICTs in promoting
information sharing in the society. It uses methods such as conferences,
workshops, public portals, and so on to show its intention for maximizing
information sharing. With the growth of the number of users of ICTs, the degree
of information sharing is expected to increase. Therefore, the study intends to
establish the extent to which uses of ICTs have enhanced information sharing.
Further, the correlation between the extent of information sharing and factors
which negatively influence the perception of users will also be established by the
study.
3. OBJECTIVES
This study is designed to cover the following objectives;
i. To determine the extent of information sharing through e-transparency in
the Tanzanian public sector?
ii. To establish the extent to which information usefulness,
understandability, reliability and accessibility influences information
sharing through e-transparency systems.
4. METHODOLOGY
This study was conducted through a mixed research method. First, the study
reviewed a number of literatures to establish its relevance. Then, the Tanzanian
Revenue Authority’s Custom Online System was identified as a case for study,
followed by survey procedures. Data were collected from twenty (20) clearing
and forwarding companies that operate under Custom regulations of the Tanzania
Revenue Authority. The study received and analysed a total of 40 responses. The
study collected data from original sources to enhance validity and relevance. The
analytical models used include the Spearman’s Correlation Model and Regression
Analysis.
5. ICT TOOLS AND GOVERNMENT INFORMATION SHARING
Transparency is one of the pillars of good governance that promotes openness in
conditions and activities; eventually, transparency ensures that the stakeholders
have the information necessary for them to make decisions necessary for the
progress of business and their lives. In this case, information forms the
cornerstone of transparency, more especially in civic organisations.
In the management of civic institutions, information dissemination provides
guidance and education to stakeholders in different matters that influence their
lives; these issues include political, socio-economic and cultural. This availability
of information is clearly influenced by the media used in the capturing, storage
and dissemination process. While electronic media are effective in raising the
level of transparency in the society; the government should take advantage of
these tools in building its relationship with citizens through sharing information,
and hence engage them is supporting planned public development goals (Abu-
Dhabi-Government, 2011; Lubua & Maharaj, 2012).
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 33
In the Republic of Tanzania, usage of ICT tools for communication and
information sharing increases in a daily bases; users of the internet increased by
450% between 2001 and 2010. Additionally, about 50% of the population of
Tanzania are reported to use either internet or a mobile phones (Kasumuni,
2012). With this increase, understanding the extent to which information from
government institutions is shared enables the government to know how effective
the media are utilized to promote national developments.
6. AN INFORMATION SHARING MODEL
This paper summarises information sharing using a model presented in figure 1.
The abundance and availability of information means that the user needs skill to
determine what it is that they want. In this case, the user of information has the
key role to play in effecting information sharing. The user must be able to use
relevant tools in searching for information and be able to determine the relevance
of accessed data to his/her operations. The ability to use such tools is attained
through learning. Having the knowledge to use the tools for searching the
information, the user must be aware of the problem that they need to solve.
Figure 1: Information Sharing Model
Source: Research Data (2012)
The choice of the information is dictated by the gap which has to be covered.
When this gap is expressed, it becomes a need. Upon responding to the need, the
user of information consults the source which is either electronic or physical. It is
possible that the source may not have the type of information requested or the
information may not be satisfying. Regardless of the status of satisfaction, the
user of information takes action towards covering the gap. In case the public
seeks information from government institutions, dissatisfaction may influence
Information User
Information Needs
Information Source
No Information in the
Source
Satisfaction/
Dissatisfaction
Action
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 34
members of the public to take action, even against the government, on the other
hand satisfaction influences more support from the government (Lubua, 2014).
The satisfied user of the information applies it to solve the problem identified in
the gap. A good example could be of a farmer, who was searching for a good
market of his/her harvests; s/he will eventually use the information to choose a
better market.
Similarly, the recent Arab uprising represents a possible negative response by
users of the information in case of low satisfaction (Maharaj & Vannikerk, 2011).
The government should therefore respond adequately to inquiries from citizens to
alleviate the possibility of negative response from citizens. It must ensure the
adequate availability of information that address citizens’ daily challenges.
7. INFORMATION SHARING USING E-TRANSPARENCY TOOLS IN
PUBLIC INSTITUTIONS
The introduction of ICT tools brings more opportunities for information sharing
in the organisation through allowing users to receive and send information more
easily (Kilama, 2013). In other cases, stakeholders are able to discuss issues of
different interests through tools such as social networks, chat rooms, e-mail
systems and video/teleconferencing. In other places, the organisation is able to
solicit stakeholders’ opinions before making different decisions (Im & Jung,
2001; Lubua & Maharaj, 2014).
Together with the progress made in information sharing, there is the need to
know the extent which government institutions apply online media for
information sharing. This study is based on opinions from clients who are
consumers of online information from a government institution.
Based on the response from clients of Tanzania Revenue Authority, it was found
that, 70% of respondents agree that the Tanzanian revenue authority, sufficiently
shares its information through online media. These respondents are clients of
Custom services who benefit from the Custom Online System (CULAS). The
following factors influenced a successful deployment of this system:-
a.) Good ICT infrastructure
The ICT infrastructure of Tanzania Revenue Authority is well established; it is
characterised by good interface, reliable data backup systems, power backups and
reliable internet connection. In addition, the revenue authority is among the
organizations benefiting from the massive flow of internet through the National
ICT Backbone (NICTBB). Nevertheless, the study observed that not all
respondents had access to the infrastructure of the revenue authority. Some
lacked computers to access such systems; the presence of the computer room for
clients would be an important ingredient to extension of services offered by the
revenue authority in its custom section. This will equally, facilitate users who are
not based in Dar Es Salaam, but visit for custom services.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 35
b.) Technical Skills and Competency
The infrastructure of the information system requires competent staff to maintain
and operate its functions (Badillo-Amador, García-Sánchez, & Vila, 2005;
Cohen, 2012). In many cases, the revenue authority uses its staff to run its
operations; in the case where advanced knowledge is required the institution uses
partnerships with non-governmental organisations to offer technical services. To
a large extent, the revenue authority use trainings to equip its employees.
Nevertheless, the study noted that there were cases where training was not
effective as expected. In fact the analysis that desired to know the degree of
association between training and skills possessed by staff through the Pearson
Correlation Model observed an insignificant association (r = 0.101, p =. 316),
EXCEPT where a follow-up program was instituted (r= 0.292, p=0. 003).
Therefore, it is necessary to incorporate follow-up programs after trainings for
enhanced competency.
c.) Institutional Will
Installing a good ICT infrastructure has to be complemented by the willingness of
members of staff to use the new system exclusively for service provision. The
management of the Tanzania Revenue Authority, Custom Department dedicates
its online system to be the only method for the issuance of services to clients. So
far the experience of operational staff is reported to be outstanding. However, the
lack of important equipment such as computers to some employees and
occasional system breakdown affects the use.
d.) Customer Satisfaction
Changes are to be managed carefully in order to avoid frustrating clients.
Together with implementing new changes for service provision, the Tanzania
Revenue Authority Custom Department established a help desk that attends to
queries from clients about different applications of the new system. Additionally,
documentation is provided that addresses steps to be taken in using the system. In
this study it was discovered that 95% of respondents recommend or strongly
recommend the use of Tanzania Revenue Authority Custom Online systems for
securing services from the institution; these results shows that the extent of users’
satisfaction with the online system is high.
8. INFORMATION USEFULNESS, UNDERSTANDABILITY,
RELIABILITY AND ACCESSIBILITY AND THE EXTENT OF
INFORMATION SHARING
As shown in the previous section, respondents from Tanzania Revenue Authority
have confidence on the extent that government institution shares information with
stakeholders through online media. While this extent is influenced by a number
of factors, this study is interested in the following: Information accessibility,
Information Usefulness, Information Reliability and Information
understandability. This part of the study identifies how information sharing is
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 36
influenced by these factors and a linear regression model is used to demonstrate
the relationship. The Linear regression analysis was used to establish the
relationship between these variables as shown in Table 1 below.
Table 1:Model Summary
Regression Model R R Square
1 0.724a
0.524
a. Predictors: (Constant), Government Online Information is reliable, Government online
information is useful, the use of Internet has enhanced access to information, and Government
online information is easily understood
According to data reported by clients of Tanzania Revenue Authority, the
value for Coefficient of Relatedness (R) is 0.724a
; this value suggests the
presence of correlation between the variables. At the Tanzania Revenue
Authority, information usefulness, understandability, reliability and accessibility
are important segments of the information provided to users. This is because the
online system is the only means for users to access custom services. The
appreciation of these variable influences the extent of information sharing among
stakeholders. Below is a brief explanation on how these variables are enhanced at
the Tanzania Revenue Authority.
a.) Information Accessibility
The Tanzania Revenue Authority’s Custom online System provides users1
with
credentials which provide access to the system. Within the system, users are able
to trace every stage of their application. Besides, to ensure that the system is
constantly accessible to clients, the link to the online system is published on the
website and supported by servers which are constantly running with the support
of information and power backups. Although accessibility is better compared to
other public institutions, users report that there were cases where they failed to
launch their service applications due to extended system downtime.
b.) Data Reliability
The online system of the Tanzania Revenue Authority ensures reliability by
dedicating few officials who are experts in custom services to manage queries
and applications by clients to the system. Furthermore, employees of the revenue
authority verifies the information sent by clients before they effect the transaction
to ensure the reliability of information involved in transactions. This ensures that
only the information which is both relevant and correct is provided to consumers
through the online media. Moreover, to ensure that the information from users of
an online system is reliable, the system provides guidance to users on different
stages involved in an application for services. The system also dictates the format
1
Who are clearing and forwarding experts
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 37
of the information to be entered to ensure consistency; further, it grants the user
with the opportunity to proof read their data entry before the information is
completely submitted.
c.) Information Usefulness
The Custom Online System is dedicated to the Customs department only, and is
tailored to meet the needs of Clearing and Forwarding argents to simplify their
tax paying processes. The authority receives feedback from clients on different
aspects of the system, this includes its usefulness for intended use. Although
many respondents agree that the information they receive is useful, the study
noted that a number of users were not comfortable with the use of the English
language for communication. Swahili is the Tanzania’s national language, its
adequate use would improve the ability of users to understand information
context, hence improve usefulness.
d.) Information Understandability
The issue of understanding the information provided through online systems is
critical; the fact that users of Tanzania Revenue Authority are of diverse nature,
suggests differences in analytical and language skills. While Tanzania uses
Kiswahili as the National language, English is used for academic and business
operations. Due to differences in education and analytical skills some clients of
Tanzania Revenue Authority need to make some language consultation before
they understand the content of information. Understanding the presence of this
challenge, the revenue authority has a dedicated helpdesk to clarify issues which
users find difficult to understand.
9. CONCLUSION
The purpose of the study was to establish the degree to which the Tanzanian
public sector uses ICTs to enhance transparency. The assessment was guided by
the fact that Tanzania advocates good governance, of which information sharing
is an important component. Also, the study recognises that ICTs play an
important role in the business sector to ensure that the client access services
efficiently with maximum transparency. The same business experience could be
adopted by the government to raise the level of satisfaction of citizens about
government services. The study observed that many people are aware of the
importance of ICTs in ensuring transparency in government operations. However,
there are several cases where the performance in government operations did not
meet users’ expectations. Factors such as low reliability of the system and
ineffectiveness of officials operating the system were among those which affected
the use of ICTs for enhanced transparent services. While training was identified
to be important is equipping users with the required technical skills; it was
occasionally observed to be the opposite. Training required follow-ups to ensure
that it meets expected goals. Equally to this, information accessibility, reliability,
usefulness and user understanding ability have great impact on the experience of
users towards online media.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 38
REFERENCES
[1] Badillo-Amador, L., García-Sánchez, A., & Vila, L. E. (2005). Mismatches In The Spanish
Labor Market: Education Vs. Competence Match. International Advances in Economic
Research, Vol 11, 93-109.
[2] Barua, A., Ravindran, S., & Whinston, A. (2007). Enabling Information Sharing Within
Organizations. Information Technology and Management, Vol (3), 31 - 45 .
[3] Cohen, J. (2012). Benefits Of On Job Training. Retrieved February 7, 2013, from
http://jobs.lovetoknow.com
[4] Drake, D., Steckler, N., & Koch, M. (2004). Information Sharing In And Across Government
Agencies: The Role And Influence Of Scientist, Politician, And Bureaucratic Subcultures.
Social Science Computer Research, 22(1), , 67–84.
[5] HAKIELIMU, LHRC, REPOA. (2005). Access To Information In Tanzania: Is Still A
Challenge. Retrieved September 11, 2012, from
http://www.tanzaniagateway.org/docs/Tanzania_Information_Access_Challenge.pdf
[6] Hatala, J.-P., & Lutta, J. (2009). Managing Information Sharing Within an Organisation
Settings: A social Network Perspective. Retrieved September 13, 2012, from
http://www.performancexpress.org/wp-content/uploads/2011/11/Managing-Information-
Sharing.pdf
[7] Im, B., & Jung, J. (2001). Using ICT For Strengthening Government Transparency. Retrieved
May 10, 2011, from http://www.oecd.org/dataoecd/53/55/2537402.pdf
[8] Kilama, J. (2013). Impacts Of Social Networks In Citizen Involvements To Politics . Dar es
Salaam: Mzumbe University.
[9] Mkapa, B. (2003). Improving Public Communication Of The Government Policies And
Enhancing Media Relations. Bagamoyo.
[10]Navarra, D. D. (2006). Governance Architecture Of Global ICT Programme: The Case Of
Jordan. London: London School of Economics and Political Science.
[11]United Republic of Tanzania. (1995). The Constitution of United Republic of Tanzania. Dar
Es Salaam, Tanzania: Government Printer.
[12]Van Niekerk, B., Pillay, K., & Maharaj, M. (2011). Analyzing the Role of ICTs in the
Tunisian and Egyptian Unrest. International Journal of Communication, 5(1406–1416).
This paper may be cited as:
Lubua, E. 2014. E-Transparency and Information Sharing in the Public Sector.
International Journal of Computer Science and Business Informatics, Vol. 14,
No. 1, pp. 30 -38.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 39
A Survey of Frequent Subgraphs
and Subtree Mining Methods
Hamed Dinari and Hassan Naderi
Department of Computer Engineering
Iran University of Science and Technology
Tehran, Iran
ABSTRACT
A graph is a basic data structure which, can be used to model complex structures and the
relationships between them, such as XML documents, social networks, communication
networks, chemical informatics, biology networks, and structure of web pages. Frequent
subgraph pattern mining is one of the most important fields in graph mining. In light of
many applications for it, there are extensive researches in this area, such as analysis and
processing of XML documents, documents clustering and classification, images and video
indexing, graph indexing for graph querying, routing in computer networks, web links
analysis, drugs design, and carcinogenesis. Several frequent pattern mining algorithms have
been proposed in recent years and every day a new one is introduced. The fact that these
algorithms use various methods on different datasets, patterns mining types, graph and tree
representations, it is not easy to study them in terms of features and performance. This
paper presents a brief report of an intensive investigation of actual frequent subgraphs and
subtrees mining algorithms. The algorithms were also categorized based on different
features.
Keywords
Graph Mining, Subgraph, Frequent Pattern, Graph indexing.
1. INTRODUCTION
Today we are faced with ever-increasing volumes of data. Most of these
data naturally are of graph or tree structure. The process of extracting new
and useful knowledge from graph data is known as graph mining [1] [2]
Frequent subgraph patterns mining [3] is an important part of graph mining.
It is defined as “process of pattern extraction from a database that the
number frequency of which is greater than or equal to a threshold defined by
the user.” Due to its wide utilization in various fields, including social
network analysis [4] [5] [6], XML documents clustering and classification
[7] [8], network intrusion [9] [10], VLSI reverse [11], behavioral modeling
[12], semantic web [13], graph indexing [14] [15] [16] [17] [18], web logs
analysis[19], links analysis[20], drug design [21] [22] [23], and
Classification of chemical compounds[24] [25] [26], this field has been
subject matter of several works.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 40
The present paper is an attempt to survey subtree and subgraph mining
algorithms. A comparison and classification of these algorithms, according
to their different features, is also made. The next section discusses the
literature review followed by section three that deals with the basic ideas
and concepts of graphs and trees. Mining algorithms, frequent subgraphs are
discussed in section four from different viewpoint such as criteria of
representing graphs (adjacency matrix and adjacency list), generation of
subgraphs, number of replications, pattern growth-based and apriori-based
classifications, classification based on search method, classification based
on transactional and single inputs, classification based on type of output,
and also Mining based on the logic. Fifth section focuses on frequent
Mining algorithm from different angles such as trees representation method,
type of algorithms input, tree-based Mining, and Mining based on
Constraints on outputs.
2. RELATED WORKS
H.J.Patel1, R.Prajapati,et al. [27] Classified graph mining and mentioned
two types of the algorithms, apriori-based and pattern growth based.
K.Lakshmi1,T.Meyyappan [28] studied apriori based and pattern growth
based, taking into account aspects such as input/output type, how to display
a graph, how to generate candidates, and how many times a candidates is
repeated in the graph dataset. In [29] D.Kavitha, B.V.Manikyala, et al.
suggested the third type of graph mining algorithms named as inductive
logic programming. Here a complete survey of graph mining concepts and a
very useful set of examples to ease the understanding of the concept come
next.
3. BASIC CONCEPTS
3.1 Garph
A graph G (V, E) is composed of a set of vertices (V) connected to each
other by and a set of edges (E).
3.2 Tree
A tree T is a connected graph that has no cycle. In other words, there is only
and only one path between any two vertices.
3.3 Subgraph
A subgraph G '(V', E') is a subgraph of G (V, E), which vertices and edges
are subsets of V and E respectively:
 V’⊆V
 ⊆
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 41
One may say that a subgraph of a graph is a pattern of that graph.
Concerning trees two types of patterns can be defined:
3.3.1 Induced pattern
The definition is exactly the same as the definition of subtree in a tree
(Figure.1.a, Figure.1.c). It means that the vertices and the edges of
Figure.1.a. Can be seen in Figure.1.c as well
3.3.2 Embedded pattern:
Almost the same as induced pattern, except that there may be one or more
supplementary vertices between the two parents and child nodes of pattern,
For example vertex A in Figure.1.c is parent of vertex D; and in Figure.1.b
an embedded pattern of Figure 1.c is seen.
Figure.1. An example of the Induced and embedded subtree pattern
3.3.3 Isomorphism
Two graphs are isomorph, if there are one to one relationships among their
vertices and edges.
3.3.4 Frequent Subgraph
Suppose a graph G and a set of graphs D = {g1, g2, g3,…, gn} are given,
support(G) is:
Support (G) =
A graph G in a dataset D is called Frequent if its support is not less than of
a predefined threshold.
4. AN OVERVIEW OF FREQUENT SUBGRAPH MINING
ALGORITHM ACCORDING TO DIFFERENT CRITERIA
This section discusses different criteria for classification of frequent graph
mining algorithms, including: graph representation, input type, constraint-
based, inductive logic programming, search strategy, and completeness/-
incompleteness of outputs.
4.1 Graph Representation
4.1.1 Adjacency Matrix
A graph can be demonstrated as an adjacency matrix, in this case the row
and the column represent vertex of graph and the entries represents edges of
graph (i.e. when there is an edge between two vertices, entries constituted
by the junction of the row and the column is filled by “1” and otherwise by
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 42
”0”). Furthermore, the nodes are represented on the main diameter of the
matrix (Figure.2).To Show the graph as a string a combination of nodes and
edges as a sequence in particular order can be used, and since every
permutation of the nodes may generate a specific string, a series of
maximum or minimum canonical adjacency matrix (CAM) must be taken
into account. An advantage of this is that two isomorphism graphs will have
the same maximum/minimum CAM.
Figure.2. Left side a graph and right side corresponding adjacency matrix
4.1.2 Adjacency List
Another way to represent a graph is adjacency list. When the graph is
sparse, several “zeros” are generated in the adjacency matrix, which is a
great waste of memory and to avoid this, adjacency list is an answer as it
assigns memory dynamically
4.2 Subgraph Generation
Two subgraphs can be mixed to generate candidate subgraph and the result
will be a new subgraph. However, given that many copied subgraphs might
be generated in the mixing process, the way of generating candid subgraphs
is critical. Among the available methods are extension and right most
expansion. In the latter case, the subgraphs are expanded in one direction
and no duplicate candidate is generated.
4.3 Frequency Counting
To check if the generated candidates is a duplicate or not, the frequency of
each must be determined and compared with the support value. Of the data
structures, which are used to count the frequency of each candidate are
embedding list and TSP tree.
5. A SURVEY OF FREQUENT SUBGRAPH MINING
ALGORITHMS
5.1 Classification Based on Algorithmic Approach
5.1.1 Apriori-Based (Breadth First Search)
This category of algorithms uses generates and test method and surface
search to find a subgraph from the network that consist the database.
Therefore, before the subgraph with length of k +1 ((k+1)-candidate), all
frequent subgraphs with length of k must be found. Thus, each candidate
with length of k +1 is obtained by connecting two frequent subgraphs with
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 43
length of k. However, in this method, all state of candidate subgraph
generated is considered .Maintenance and processing need plenty of time
and memory, which tackles the performance [30] [2].
5.1.2 Pattern Growth-Based
In FP-growth-based methods a candidate subgraph with length of k+1 is
obtained by extending a frequent pattern with length of k. Since extending a
frequent subgraph with length of k may generate several candidate of length
k+1, thus the way a frequent subgraph is expanded is critical in reducing
generation of copied subgraphs.Table1 lists apriori and pattern growth
algorithms [2].
Table 1. Frequent Subgraph Mining Algorithms
AprioriPattern Growth
FARMER [31]
FSG [3]
HSIGRAM
GREW [32]
FFSM [4]
ISG
SPIN [33]
Dynamic GREW [34]
AGM [35]
MUSE [36]
SUBDUE [37]
AcGM [38]
DPMine
gFSG [39]
MARGIN [40]
GSPan [41]
CloseGraph [42]
Gaston [43]
TSP [44]
MOFa [45]
RP-FP [46]
RP-GD [46]
JPMiner [47]
MSPAN
VSIGRAM [48]
FPF [49]
Gapprox [50]
HybridGMiner
FCPMiner [51]
RING [52]
SCMiner [53]
GraphSig [54]
FP-GraphMiner [55]
gPrune [56]
CLOSECUT [57]
FSMA [58]
5.2 Classification Based on Search Strategy
There are two search strategies to find frequent subgraphs. These two
methods include breadth first search (BFS) and depth first search (DFS).
5.3 Classification Based on Nature of the Input
Depending on input type of algorithms, here tried to be divided two
categories presented as following:
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 44
5.3.1 Single Graph Database
Database consists of a single large graph
5.3.2 Transactional Graph Database
Database consists of a large number of small graphs. Figure.3 shows a
database consist of a set of graphs and two subgraphs and their frequency.
In Figure.3 (left side, g, g2, and g2) demonstrates a transactional graph
database and frequency of two frequent subgraphs (right side).
Figure.3. A database consisting of three graph g1, g2, g3 and two subgraph
and frequency of each
5.4 Classification Based on Nature of the Output
5.4.1 Completeness of the Output
While, some algorithms find all frequent patterns, some other only mines
part of frequent patterns. Frequent patterns mining is closely related to
performance. When the total size of dataset is too high, it is better to use
algorithms that are faster in execution so that reduction of the performance
is avoided, although, not all frequent patterns are minded. Table 2 lists the
completeness of output [29].
Table 2. Completeness of Output
Complete OutputIncomplete Output
FARMER
gSpan
FFSM
Gaston
FSG
HSIGRAM
SUBDUE
GREW
CloseGraph
ISG
5.4.2 Constraint-Based
With increase of size database, the number of frequent pattern is increased.
This makes maintenance and analyzing more difficult as it needs more
memory space. Reducing the number of frequent patterns without losing the
data is achievable through mining and maintains more comprehensive
patterns. Given that each pattern satisfies the condition of being frequent the
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 45
whole subset satisfies the condition, to achieve more comprehend patterns
we can use the following terms:
5.4.2.1 Maximal Pattern
Subgraph g1 is maximal pattern if the pattern is frequent and does not
consist of any super-pattern, so that g2 ‫با‬ g2⊃g1.
5.4.2.2 Closed Pattern
Subgraph g1 is closed if it is frequent and does not consist of any frequent
super-pattern such as g2, g2⊃g1 (i.e. support).Table3 lists maximal and
closed subgraph algorithms.
Table 3. Frequent Subgraph Mining (Constraintd)
MaximalClosed
SPIN
MARGIN
ISG
GREW
CloseGraph
CLOSECUT
TSP
RP-FP
RP-GD
5.5 Logic-Based Mining
Also known as inductive logic programming, which also an area of machine
learning, mainly in biology. This method uses inductive logic to display
structured data. ILP core uses the logic to display for search and the basic
assumptions of that structured way (e.g. WARMR, FOIL, and C-PROGOL),
which is derived from background knowledge [29]. Table 4 lists the Pattern
Growth and Table 5 indicates apriori-based algorithms categorized from
different aspects [59] [27] [60] [61] [62] [28] [30] [63].
Table 4. Frequent Subgraph Mining Algorithms (Pattern Growth-based)
Frequency
Counting
Subgraph
Generation
Graph
Representation
Input TypeAlgorithms
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 46
DFS
DFS
DFS
TSP Tree
DFS
DFS
DFS
DFS
DFS
DFS
M-DFSC
Normalize Matrix
R-tree, DFS
DFS
Rightmost Extension
Rightmost Extension
Extension
Extension
Rightmost Extension
Rightmost Extension
Rightmost Extension
Rightmost Extension
Rightmost Extension
Extension
Iteration
Extension
Extension
Merge and Extension
Adjacence Matrix
Adjacence Matrix
Hash Table
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
BitCode
Adjacency matrix
incident matrix
Invariant vector
Feature vector
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
GSpan
CloseGeaph
Gaston
TSP
MOFA
RP-FP
RP-GD
JPMiner
MSPAN
FP-Graph-Miner
gPrune
FSMA
RING
GraphSig
Table 5. Frequent Subgraph Mining Algorithms (Apriori-based)
Frequency CountingSubgraph GenerationGraph
Representation
Input TypeAlgorithms
MDFS
Trie data structure
TID list
Maximal independent set
Maximal independent set
Suboptimal canonical
adjacency matrix tree
TID List
Canonical Spanning Tree
Suffix trees
Canonical Labeling
DFS coding
CAM
CAM
Hashtree
Level-wise Search
Level-wise Search, ILP
One Edge Extension
Iterative Merging
Iterative Merging
Merging and Extension
Edge Triple Extension
Join Operation
Iterative Merging
Vertex Extension
Disjunctive Normal
Form
Join
Join
Iterative Merging
Adjacence Matrix
Trie structure
Adjacency List
Adjacency Matrix
Sparse graph
Adjacency Matrix
Edge Triple
Adjacency Matrix
Sparse graph
Adjacency Matrix
Search Tee
Lattice
Adjacency Matrix
Adjacency Matrix
Single Large
Graph
Set of graphs
Set of graphs
Single large Graph
Single large Graph
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
SUBDUE
FARMER
FSG
HSIGRAM
GREW
FFSM
ISG
SPIN
Dynamic
GREW
AGM
MUSE
MARGIN
AcGM
gFSG
Here several algorithms related to graph/tree mining are discussed in more
details.
 Gp-Growth Algorithm
The algorithm consists of three main steps:
1. Candidate generation by join operation.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 47
2. Using a new method for tree representation and look up table that allows
quick access to the information nodes in the candidate generation phase
without having to read the trees of the database.
3. using right most expansion to candidate generation that guaranteed not
generate duplicate candidate.
This algorithm uses lookup table that is implemented as Hash table to store
input trees information. It is the key part, represented as the pair of (T,pos),
where T is identification of input tree and pos is number in preorder
traversal, and value part, represented as (l,s), where l is label and s is scope
of node. In this algorithm a new candidate is generated using scope of each
node That means, first node, which is added to the other node should be
added along the right most expansion and that within the scope of the first
node to be added continually this process other frequent pattern is found
[64].
 Fp-Graph Miner Algorithm
This algorithm uses FP-growth method to find frequent subgraphs, Its input
is a set of graphs (Transactional database). First a BitCode for each edge is
defined, then a set of edge is defined for each edge .When, edge is found in
the each of graphs, the BitCode is ‘1’ and otherwise ‘0’. Then a frequency
table is sorted in ascending order based on equivalent BitCode belongs to
each edge and afterward, FP tree is constructed and frequent subgraphs are
obtained through depth traversal [55].
6. FREQUENT SUBTREES MINING ALGORITHMS
CLASSIFICATION
6.1 Trees Representation
A tree can be encoded as a sequence of nodes and edges. Some of most
important ways of encoding trees are introduced below:
6.1.1 DLS (Depth Label Sequence)
Let T be a Labeled Ordered Tree and depth-label pairs including labels and
depth for each node are belonged to V. For example, (d(vi),l(vi)) are added
to string s throughout DFS traversal of tree T. Depth-label sequence of tree
T is obtained as { d(v1), l(v1)), …,(d(vk), l(vk) }. For instance, DLS for tree
in Figure.4 can be presented as follow:
{(0,a),(1,b),(2,e),(3,a),(1,c),(2,f),(3,b),(3,d),(2,a),(1,d),(2,f)(3,c)}
6.1.2 DFS – LS (Depth First Sequence)-(Label Sequence)
Assumed a labeled ordered tree, Labels is added to string of s
during the DFS traversal of Tree T. During backtrack ‘-1’or‘$’or ‘/’ is added
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 48
to the string, DFS-LS code for tree T is illustrated in in Figure.4
{abea$$$cfb$d$$a$$dfc$$$}
6.1.3 BFCS (Breadth First Canonical String)
Let T be an unordered tree. Several sequence encoded string can be
generated using the BFS method and through changing the order of children
of a node. Thus, one may say that BFCS tree T equals to the smallest
lexicographic order of this encoded string. BFCS of tree T is showed in
Figure.4. {a$bcd$e$fa$f$a$bd$$c#}
6.1.4 CPS (Consolidate Prufer Sequence)
Let T be a labeled tree T, and CPS encoding method consists of two parts:
NPS as extended prufer sequence, which uses vertex numbers traversal as
set of unique label is obtained; an d LS (Label Sequence) as a sequence
consisting of labels in prefix traversal after the leafs is removed is achieved.
Both NPS and LS generate a unique encoding for labeled tree. NPS and LS
obtained for the tree presented in Figure.4 is as follow respectively:
{ebaffccafda-}, {aebbdfaccfda}. To obtain NPS, a leaf from the tree is
removed in each step and the parent of the leaf is taken get as output. This is
repeated until only the roots remain and ‘-’ is added to note as the end of the
string. Regarding LS (Label Sequence) the same postfix traversal of the tree
is taken as LS. Table9 remarks this category of trees [65].
Figure.4. A Tree Example
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 49
Table 6. Frequent subtree Mining Algorithms (Tree Representation)
Tree RepresentationAlgorithms
DLS
DFS-LS
DLS
FST-Forest
BFCS
DLS
DFS-LS
DLS
DLS
DFS string
DFS-LS
CPS
BFCS
DFS-LS
BFCS
DFS-LS
uFreqt
SLEUTH
Unot
Path Join
RootedTreeMiner [66]
FREQT
TreeMiner
Chopper
XSPanner
AMIOT
IMB3Miner
TRIPS
FreeTreeMiner
CMTreeMiner
HybridTreeMiner [67]
GP-Growth
6.2 Input Types
6.2.1 Rooted Ordered Trees
Rooted Ordered sub-tree is a kind of tree in which a single node is considered
as the “root” of the tree and there is a relationship between children of each
node so that each child is greater than or equal to its siblings that are placed at
the left hand side of it; moreover it is less than or equal to ones that are
placed at its right hand side. If we elate or definition of rooted ordered tree
such that there was no need to consider the relationship between siblings we
have a rooted unordered sub-tree.in Table7 rooted ordered tree mining
algorithms is shown.
Table 7. Rooted Ordered Tree mining Algorithms
InducedEmbedded
FREQT [68]
AMIOT [69]
IMB3Miner [70]
TRIPES [65]
TIDES [65]
TreeMiner [71]
Chopper [72]
XSPanner [72]
IMB3-Miner
6.2.2 Rooted Unordered Trees
In this type of trees, a node is considered as the root, however, there is no
particular order between the descendants of each node,In Table 8 rooted
unordered tree mining algorithms is listed.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 50
Table 8. Rooted unordered Tree mining Algorithms
InducedEmbedded
uFreqT [73]
Unot [74]
PathJoin [65]
Rooted TreeMiner [75]
TreeFinder [76]
TreeFinder
Cousin Pair [77]
SLEUTH [78]
6.3 Tree Base Data Mining
Frequent subtrees mining algorithm can be categorized into two major
categories, aprior-based and pattern growth-based. Table 9 lists the apriori
and pattern growth algorithms of trees [79] [76] [80].
Table 9. Frequent Subtree Mining Algorithms
7. CONCLUSIONS AND FUTURE WORKS
Frequent subgraph Mining algorithms were first examined from different
viewpoints such as different ways of representing a graph (e.g. adjacency
matrix and adjacency list), generation of subgraphs, frequency counting,
pattern growth-based and apriori-based algorithm classification, search
based classification, input-based classification (single, transactional), output
based classification. Furthermore, Mining based on logic was discussed.
Afterward, frequent subtrees traversal algorithms were examined from
different viewpoints such as trees representation methods, type of inputs,
tree-based traversal, and also Mining based on Constraints of outputs. Given
the results, it is concluded that in absence of generating patterns by pattern-
AprioriPattern Growth
TreeFinder
AMIOT
FreeTreeMiner
TreeMine [81]
SLEUTH
CMTreeMiner [82]
Pattern Matcher [71]
W3Miner [83]
FTMiner [84]
CFFTree [85]
IMB3-Miner
uFreqt
Unot
FREQT
TRIPS
TIDES
Path Join
XSPanner
Chopper
PrefixTreeISpan [86]
PCITMiner [87]
F3TM [88]
GP-Growth [64]
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 51
growth, it is featured with less computation work and needs smaller memory
size. Moreover, these algorithms are specifically designed for trees and
graphs and cannot be used for other purposes. On the other hand, as they
work on variety of datasets, it is not easy to find tradeoffs between them.
The same frequent patterns can be used for searching similarity, indexing,
classifying graphs and documents in future studies. Parallel methods and
technologies such as Hadoop can also be needed when working with
excessive data volume.
8. ACKNOWLEDGMENTS
Authors are thankful to Mohammad Reza Abbasifard for their support of the
investigations.
REFERENCES
[1] A.Rajaraman, J.D.Ullman, 2012. Mining of Massive Datasets, 2nd ed.
[2] J.Han, M.Kamber, 2006, Data Mining Concepts and Techniques. USA: Diane
Cerra.
[3] Kuramochi, Michihiro, and G.Karypis., 2004. An efficient algorithm for
discovering frequent subgraphs, in IEEE Transactions on Knowledge and
Data Engineering, pp. 1038-1051.
[4] J.Huan, W.Wang, J. Prins, 2003. Efficient Mining of Frequent Subgraph in the
presence of isomorphism, in Third IEEE International Conference on Data
Minign (ICDM).
[5] (2013, Dec.) Trust Network Datasets - TrustLet. [Online].
http://www.trustlet.org
[6] L.YAN, J.WANG, 2011. Extracting regular behaviors from social media
networks, in Third International Conference on Multimedia Information
Networking and Security.
[7] Ivancsy,I. Renata, I.Vajk., 2009. Clustering XML documents using frequent
subtrees, Advances in Focused Retrieval, Vol. 3, pp. 436-445.
[8] J.Yuan, X.Li, L.Ma, 2008. An Improved XML Document Clustering Using
Path Features, in Fifth International Conference on Fuzzy Systems and
knowledge Discovery, Vol. 2.
[9] Lee, Wenke, and Salvatore J. Stolfo, 2000. A framework for constructing
features and models for intrusion detection systems, in ACM transactions on
Information and system security (TiSSEC), pp. 227-261.
[10] Ko, C, Logic induction of valid behavior specifications for intrusion detection
, 2000. in In IEEE Symposium on Security and Privacy (S&P), pp. 142–155.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 52
[11] Yoshida, K. and Motoda, 1995. CLIP: Concept learning from inference
patterns, in Artificial Intelligence, pp. 63–92.
[12] Wasserman, S., Faust, K., and Iacobucci. D, 1994. Social network analysis :
Methods and applications. Cambridge university Press.
[13] Berendt, B., Hotho, A., and Stumme, G., 2002. semantic web mining, in In
Conference International Semantic Web (ISWC), pp. 264–278.
[14] S.C.Manekar, M.Narnaware, May 2013. Indexing Frequent Subgraphs in
Large graph Database using Parallelization, International Journal of Science
and Research (IJSR), Vol. 2 , No. 5.
[15] Peng, Tao, et al., 2010. A Graph Indexing Approach for Content-Based
Recommendation System, in IEEE Second International Conference on
Multimedia and Information Technology (MMIT), pp. 93-97.
[16] S.Sakr, E.Pardede, 2011. Graph Data Management: Techniques and
Applications, in Published in the United States of America by Information
Science Reference.
[17] Y.Xiaogang, T.Ye, P.Tao, C.Canfeng, M.Jian, 2010. Semantic-Based Graph
Index for Mobile Photo Search," in Second International Workshop on
Education Technology and Computer Science, pp. 193-197.
[18] Yildirim, Hilmi, and Mohammed Javeed Zaki., 2010. Graph indexing for
reachability queries, in 26th International Conference on Data Engineering
Workshops (ICDEW)IEEE, pp. 321-324.
[19] R.Ivancsy and I.Vajk, 2006. Frequent Pattern Mining in Web Log Data, in
Acta Polytechnica Hungarica, pp. 77-90.
[20] G.XU, Y.zhang, L.li, 2010. Web mining and Social Networking. melbourn:
Springer.
[21] S.Ranu, A.K. Singh, 2010. Indexing and mining topological patterns for drug,
in ACM, Data mining and knowlodge discovery, Berlin, Germany.
[22] (2013, Dec.) Drug Information Portal. [Online]. http://druginfo.nlm.nih.gov
[23] (2013, Dec.) DrugBank. [Online]. http://www.drugbank.ca
[24] Dehaspe,Toivonen, and King, R.D., 1998. Finding frequent substructures in
chemical compounds, in In Proc. of the 4th ACM International Conference on
Knowledge Discovery and Data Mining, pp.30-36.
[25] Kramer, S., De Raedt, L., and Helma, C., 2001. Molecular feature mining in
HIV data, in In Proc. of the 7th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-01), pp. 136–143.
[26] Gonzalez, J., Holder, L.B. and Cook, 2001. Application of graph-based
concept learning to the predictive toxicology domain, in In Proc. of the
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 53
Predictive Toxicology Challenge Workshop.
[27] H.J.Patel, R.Prajapati, M.Panchal, M.Patel, Jan. 2013. A Survey of Graph
Pattern Mining Algorithm and Techniques, International Journal of
Application or Innovation in Engineering & Management (IJAIEM), Vol. 2,
No. 1.
[28] K.Lakshmi, T. Meyyappan, 2012. FREQUENT SUBGRAPH MINING
ALGORITHMS - A SURVEY AND FRAMEWORK FOR
CLASSIFICATION, computer science and information technology, pp. 189–
202.
[29] D.Kavitha, B.V.Manikyala Rao and V. Kishore Babu, 2011. A Survey on
Assorted Approaches to Graph Data Mining, in International Journal of
Computer Applications, pp. 43-46.
[30] C.C.Aggarwal,Wang, Haixun, 2010. Managing and Mining Graph Data.
Springer,.
[31] B.Wackersreuther, Bianca, et al. , 2010. Frequent subgraph discovery in
dynamic networks, in ACM, Proceedings of the Eighth Workshop on Mining
and Learning with Graphs, Washington DC USA, pp. 155-162.
[32] Kuramochi, Michihiro, and G.Karypis, 2004. Grew-a scalable frequent
subgraph discovery algorithm, in Fourth IEEE International Conference on
Data Mining (ICDM), pp. 439-442.
[33] Huan, Jun, SPIN: mining maximal frequent subgraphs from graph databases,
2004. in Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining.
[34] Borgwardt, Karsten M., H-P. Kriegel, and P.Wackersreuther, 2006. Pattern
mining in frequent dynamic subgraphs, in Sixth International Conference on
Data Mining (ICDM), pp. 818-822.
[35] Inokuchi, Akihiro, T.Washio, and H.Motoda, 2000. An apriori-based
algorithm for mining frequent substructures from graph data, in Principles of
Data Mining and Knowledge Discovery, pp. 13-23, Springer Berlin
Heidelberg.
[36] Zou, Zhaonian, et al, 2009. Frequent subgraph pattern mining on uncertain
graph data, in Proceedings of the 18th ACM conference on Information and
knowledge management, pp. 583-592.
[37] Ketkar, N.S, Lawrence B.Holder, and D.J.Cook, 2005. Subdue: compression-
based frequent pattern discovery in graph data, in ACM, Proceedings of the
1st international workshop on open source data mining: frequent pattern
mining implementations, pp. 71-76.
[38] A. Inokuchi, T. Washio, and H. Motoda, 2003. Complete mining of frequent
patterns from graphs: Mining graph data, in Machine Learning, pp. 321-354.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 54
[39] Kuramochi, Michihiro, and G.Karypis, 2007. Discovering frequent geometric
subgraphs, in Information Systems, pp. 1101-1120.
[40] Thomas, Lini T, Satyanarayana R. Valluri, and K.Karlapalem, 2006. Margin:
Maximal frequent subgraph mining, in IEEE Sixth International Conference
on Data Mining (ICDM), pp. 1097-1101.
[41] Yan, Xifeng, and J.Han, 2002. gspan: Graph-based substructure pattern
mining, in Proceedings International Conference on Data Mining.IEEE, pp.
721-724.
[42] Yan, Xifeng, and Jiawei Han, 2003. CloseGraph: mining closed frequent
graph patterns, in Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, pp. 286-295.
[43] Nijssen, Siegfried, and J.N. Kok., 2005. The gaston tool for frequent subgraph
mining, in Electronic Notes in Theoretical Computer Science, pp. 77-87.
[44] Hsieh, Hsun-Ping, and Cheng-Te Li, 2010. Mining temporal subgraph patterns
in heterogeneous information networks, in IEEE Second International
Conference on Social Computing (SocialCom), pp. 282-287.
[45] Wörlein, Marc, et al, 2005. A quantitative comparison of the subgraph miners
MoFa, gSpan, FFSM, and Gaston, in Knowledge Discovery in Databases:
PKDD , Springer Berlin Heidelberg, pp. 392-403.
[46] S.J.Suryawanshi,S.M.Kamalapur, Mar 2013. Algorithms for Frequent
Subgraph Mining, International Journal of Advanced Research in Computer
and Communication Engineering, Vol. 2, No. 3.
[47] Liu, Yong, Jianzhong Li, and Hong Gao, 2009. JPMiner: mining frequent
jump patterns from graph databases, in IEEE, Sixth International Conference
on Fuzzy Systems and Knowledge Discovery, pp. 114-118.
[48] Reinhardt, Steve, and G.Karypis, 2007. A multi-level parallel implementation
of a program for finding frequent patterns in a large sparse graph, in IEEE
International Parallel and Distributed Processing Symposium (IPDPS), pp. 1-
8.
[49] Schreiber, Falk, and H.Schwobbermeyer., 2005. Frequency concepts and
pattern detection for the analysis of motifs in networks, in Transactions on
computational systems biology III, pp. 89-104, Springer Berlin Heidelberg.
[50] Chent, Chen, et al., 2007. gapprox: Mining frequent approximate patterns
from a massive network, in Seventh IEEE International Conference on Data
Mining (ICDM), pp. 445-450.
[51] Ke, Yiping, J.Cheng, and Jeffrey Xu Yu, 2009. Efficient discovery of frequent
correlated subgraph pairs, in Ninth IEEE International Conference on Data
Mining (ICDM), pp. 239-248.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 55
[52] Zhang, Shijie, J.Yang, and Shirong Li, 2009. Ring: An integrated method for
frequent representative subgraph mining, in Ninth IEEE International
Conference on Data Mining (ICDM), pp. 1082-1087.
[53] Fromont, Elisa, Céline Robardet, and A.Prado, 2009. Constraint-based
subspace clustering, in International conference on data mining, pp. 26-37.
[54] Ranu, Sayan, and Ambuj K. Singh., 2009. Graphsig: A scalable approach to
mining significant subgraphs in large graph databases, in IEEE 25th
International Conference on Data Engineering (ICDE), pp. 844-855.
[55] R. Vijayalakshmi,R. Nadarajan, J.F.Roddick,M. Thilaga, 2011. FP-
GraphMiner, A Fast Frequent Pattern Mining Algorithm for Network Graphs,
Journal of Graph Algorithms and Applications, Vol. 15, pp. 753-776.
[56] Zhu, Feida, et al., 2007. gPrune: a constraint pushing framework for graph
pattern mining, in Advances in Knowledge Discovery and Data Mining, , pp.
388-400, Springer Berlin Heidelberg.
[57] Yan, Xifeng, X. Zhou, and Jiawei Han, 2005. Mining closed relational graphs
with connectivity constraints, in Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in data mining, pp. 324-
333.
[58] Wu, Jia, and Ling Chen, 2008. A fast frequent subgraph mining algorithm, in
The 9th International Conference for Young Computer Scientists (ICYCS), pp.
82-87.
[59] Krishna, Varun, N. N. R. R. Suri, G. Athithan, 2011. A comparative survey of
algorithms for frequent subgraph discovery, Current Science(Bangalore), pp.
1980-1988.
[60] K.Lakshmi, T. Meyyappan, Apr. 2012. A COMPARATIVE STUDY OF
FREQUENT SUBGRAPH MINING ALGORITHMS, International Journal
of Information Technology Convergence and Services (IJITCS), Vol. 2, No. 2.
[61] C.Jiang, F.Coenen, M.Zito, 2004. A Survey of Frequent Subgraph Mining
Algorithms, The Knowledge Engineering Review, pp. 1-31.
[62] M.Gholami, A.Salajegheh, Sep. 2012. A Survey on Algorithms of Mining
Frequent Subgraphs, International Journal of Engineering Inventions, Vol. 1,
No. 5, pp. 60-63.
[63] V.Singh, D.Garg, Jul. 2011. Survey of Finding Frequent Patterns in Graph
Mining: Algorithms and Techniques, International Journal of Soft Computing
and Engineering (IJSCE), Vol. 1, No. 3.
[64] Hussein, M.MA, T. H.Soliman, O.H. Karam, 2007. GP-Growth: A New
Algorithm for Mining Frequent Embedded Subtrees. 12th IEEE Symposium on
Computers and Communications.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 56
[65] Tatikonda, Shirish, S.Parthasarathy,T.Kurc., 2006. TRIPS and TIDES: new
algorithms for tree mining, in Proceedings of the 15th ACM international
conference on Information and knowledge management.
[66] Tung, Jiun-Hung, 2006. MINT: Mining Frequent Rooted Induced Unordered
Tree without Candidate Generation.
[67] Chi, Yun, Y.Yang, and Richard R. Muntz., 2004. HybridTreeMiner: An
efficient algorithm for mining frequent rooted trees and free trees using
canonical forms, in Proceedings 16th International Conference on Scientific
and Statistical Database Management.
[68] T.Asai, H.Arimura, T.Uno, S.Nakano and K.Satoh, 2008. Efficient tree
mining using reverse search.
[69] S.Hido, and H. Kawano., 2005. AMIOT: Induced Ordered Tree Mining in
Tree-structured Databases, in Proceedings of the Fifth IEEE International
Conference on Data Mining (ICDM’05).
[70] H.Tan, T.S. Dillon, F.Hadzic, E.Chang, and L.Feng, 2006. IMB3-Miner:
Mining Induced/Embedded Subtrees by Constraining the Level of Embedding,
in Advances in Knowledge Discovery and Data Mining, Springer Berlin
Heidelberg, pp. 450–461.
[71] M.J.Zaki, 2002. Efficiently mining frequent trees in a forest, in In Proceedings
of the 8th International Conference on Knowledge Discovery and Data
Mining (ACM SIGKDD), pp. 71-80.
[72] C.Wang, M.Hong, J.Pei, H.Zhou, W.Wang, 2004. Efficient pattern-growth
methods for frequent tree pattern mining, in Advances in Knowledge
Discovery and Data Mining, Springer Berlin Heidelberg, pp. 441-451.
[73] S.Nijssen and J.N.Kok, 2003. Efficient Discovery of Frequent Unordered
Trees, in Proc. First Intl Workshop on Mining Graphs Trees and Sequences,
pp. 55-64.
[74] T. Asai, H. Arimura, T.Uno and S. Nakano., 2003. Discovering Frequent
Substructures in Large Unordered Trees, in procceding sixth conference on
Discovery Science, pp. 47-61.
[75] Y.Chi, Y.Yang, and R. Muntz., May 2004. Canonical Forms for Labeled
Trees and Their Applications in Frequent Subtree Mining, Knowledge and
Information Systems, No. 8.2, pp. 203-234.
[76] Chi, Yun, et al.,2005. Frequent subtree mining-an overview, in Fundamenta
Informaticae, pp. 161-198.
[77] Shasha, Dennis, J.Tsong-Li Wang and Sen Zhang.,2004. Unordered tree
mining with applications to phylogeny, in IEEE Proceedings 20th
International Conference on Data Engineering, pp. 708-719.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 57
[78] M.J.Zaki., 2005. Efficiently Mining Frequent Embedded Unordered Trees, in
IOS Press, pp. 1-20.
[79] Jimenez, Aida, F.Berzal,J.Cubero.,2008. Mining induced and embedded
subtrees in ordered, unordered, and partially-ordered trees, in EEE
Transactions on Knowledge and Data Engineering, Springer Berlin
Heidelberg, pp. 111-120.
[80] Jimenez, Aida,F. Berzal Juan-Carlos Cubero.,2006. Mining Different Kinds of
Trees: A Tree Mining Overview, in Data Mining.
[81] B.Bringmann.,2004. Matching in Frequent Tree Discovery, in Fourth IEEE
International Conference on Data Mining.
[82] Chi, Yun, et al. Mining.,2004. Cmtreeminer: Mining both closed and maximal
frequent subtrees, in Advances in Knowledge Discovery and Data , Springer
Berlin Heidelberg, pp. 63-73.
[83] AliMohammadzadeh, Rahman, et al., Aug 2006. Complete Discovery of
Weighted Frequent Subtrees in Tree-Structured Datasets, International
Journal of Computer Science and Network Security (IJCSNS ), Vol. 6, No. 8,
pp. 188-196.
[84] J.HU, X.Y.LI., Mar 2009. Association Rules Mining Including Weak-Support
Modes Using Novel Measures," WSEAS Transactions on Computers, Vol. 8,
No. 3, pp. 559-568.
[85] Zhao, Peixiang, and J.X.Yu.,2007. Mining closed frequent free trees in graph
databases, in Advances in Databases: Concepts, Systems and Applications,
Springer Berlin Heidelberg, pp. 91-102.
[86] Zou, Lei, et al.,2006. PrefixTreeESpan: A pattern growth algorithm for mining
embedded subtrees, in Web Information Systems (WISE), Springer Berlin
Heidelberg, pp. 499-505.
[87] Kutty, Sangeetha, R.Nayak, Y.Li., 2007. PCITMiner: prefix-based closed
induced tree miner for finding closed induced frequent subtrees, in
Proceedings of the sixth Australasian conference on Data mining and
analytics, Vol. 70, Australian Computer Society.
[88] Zhao, Peixiang, and J.X.Yu., 2008. Fast frequent free tree mining in graph
databases, in Springer World Wide Web, Hong Kong, pp. 71-92.
This paper may be cited as:
Dinari, H. and Naderi, H. 2014. A Survey of Frequent Subgraphs and
Subtree Mining Methods. International Journal of Computer Science and
Business Informatics. Vol. 14, No. 1, pp. 39-57.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 58
A Model for Implementation of IT
Service Management in
Zimbabwean State Universities
Munyaradzi Zhou, Caroline Ruvinga,
Samuel Musungwini and Tinashe Gwendolyn Zhou
Department of Computer Science and Information Systems
Gweru, Zimbabwe
ABSTRACT
Several IT service management (ITSM) frameworks have been deployed and are being
adopted by companies and institutes without redefining the framework to a model which
suits their IT departments’ operating environment and requirements. An IT service
management model is proposed for Zimbabwean universities and is a holistic approach
through integration of Operational Level Agreements (OLAs), Service Level Agreement
(SLAs) and IT Service Catalogues (ITSCs). OLA is considered as the domain for
describing IT Service management and its attainment is geared by organizational
management and IT section personnel in alignment with the mission, vision and values of
the organization. Explicitly defining OLAs will aid management in identification of key
services and processes in both qualitative and quantitative form (SLAs). After defining
SLAs then ITSCs can be formulated, a measure which is both customer and IT service
provider centric and acts as the nucleus of the model. Redefining IT Service Management
from this this perspective will result in deriving value from IT service management
frameworks and customer satisfaction.
Keywords: SLAs, OLAs, ITSCs, ITSM.
1. INTRODUCTION
The IT service management is a modern concept adopted by the IT
community for improved IT services delivery and productivity to attain
customer satisfaction and control costs. IT Service Management is an
integration of IT services provisioning between service providers and end
users to arrive at end-to-end service through the implementation of
measures such as Service Level Agreements (SLAs), Operations Level
Agreements (OLAs) and IT Service Catalogues (ITSCs) (Almeroth &
Hasan, 2002). Service management frameworks in IT industry have been
developed such as Control Objectives for Information and related
Technology (COBIT), and IT Infrastructure Library (ITIL) but have not
been related for a specific IT sections given its operating environment and
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 59
constraints. IT Service is the nucleus in accomplishment of business
processes at a University, thus it supports academic research, learning and
teaching. Universities offers IT Services to staff, researchers and students,
visitors and partners on platforms such as Electronic Learning (ELearning),
library services, staff directory and email, learning resources which are
crucial to learning, teaching, and collaboration as the community becomes
global. The IT department must offer better services to these stakeholders in
a resource constraint environment (staff and financial resources) (University
of Birmingham, 2014).
2. RELATED WORKS
An ITS service consists of three key elements namely, a Service Level
Agreements (SLAs), Operational Level Agreements (OLAs) and Service
Catalogue page/s. Operational Level Agreements (OLAs) are agreements
between the ITS teams and such as hardware, software and networking
teams on how they will collaborate to ensure the appropriate service level is
met for a particular service under supervision of a coordinator and it defines
the expectations and commitments needed to deliver Service Level
Agreements (SLAs) (University of California, 2012). Service Level
Agreements (SLAs) are agreements between the Information Technology
Services (ITS) team or teams and their clients which define the level of
service the client should receive. An IT service catalogue is a mapping
database of an institute’s available technological resources and products/ IT
services in offer and about to be rolled out (Griffiths, Lawes, & Sansbury,
2012; Moeller, 2013). The ITS Service Catalogue is the division of services
offered at an institute into components with policies, guidelines and
responsibilities of parties involved, SLAs and delivery conditions (Bon et
al., 2007).
The service level catalogue should be readily accessible to authorised users
and facilitate them to create a service request on behalf of themselves and
others, and contain facilities to approve service requests. IT service
catalogues should be tested by both IT and key users so that the product
complies with the prescribed technical functionality and usability metrics.
The IT catalogue should be developed in such a way that it facilitates
effective communication between IT management and stakeholders
involved and acts as an effective tool for good governance (Griffiths et al.,
2012; Moeller, 2013).
Basically an IT service catalogue is divided into business service catalogue
and technical service catalogue. A business service catalogue is client
centric and must meet users’ requirements thus the user community should
be engaged in requirement gathering and design. Alternatively, a technical
service catalogue is service provider centric and focuses on specific services
description in IT terms including services constructs and their
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 60
interrelationships. IT managerial and technical staff work processes are
explicitly defined and the technical service catalogue access is mainly
restricted to organizational (Troy, Rodrigo, & Bill, 2007).
A SLA should consist of the following elements namely, placement of
services into categories (sections for catalogue), listing of each category as a
service catalogue section, establishing integrated/packaged/bundled service
products, identification of modular service products , definition of each
service product, establishing service owner and supplier, defining
procurement procedures(how and the cost), specifying service level metrics
(availability, reliability, response), defining limits of service and defining
customers responsibilities thus it provides a basis for managing the
relationship between the service provider and the customer, describing the
agreement between the service provider and customer for the service to be
delivered, including how the service is to be measured (Hiles, 2000).
A service must provide a bridge from the developers and engineers’ point of
view to the end-user’s perspective and identifies internal processes
necessary to offer and maintain the services. Services change management
and continuous processes improvement is important in addressing
stakeholders’ needs (University of California, 2012). A Service Lifecycle
basically focuses on defining a service strategy thus maintaining it and
implementing it, service designing which focuses on the methodology and
architectural design to offer the service, thirdly, service transition, which
focuses on testing and integration of services offered for quality and control
compliance, and finally service operation focuses on smooth running of
daily IT services and continuous improvement which aligns the life cycle
stages thus offering room for best practices and improvement in value
delivery (Office of Government Commerce, 2010).
A Service Level Agreement (SLA) is a blue-print which governs service
provision parameters between the service provider and the client (University
of California, 2012). Mainly, a SLA consists services being provided by the
IT service provider and how they will deliver them (they must meet user
requirements and standards agreed upon by parties involved and be
attainable thus communication is key in all processes), definition of key
performance parameters, assigning IT service providers personnel and users
to measure specific performance using specific metrics (continuously
monitor, manage and measure Service Level commitments), identification
of rewards or penalties levied if service delivery is being offered effectively
or they’re failing to render the services (SLA matrices should have
performance buffers to allow for the recovery from breaches) (Dube &
Gulati, 2005; Lahti & Peterson, 2007).
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 61
4. METHODOLOGY
The research questions in this study examine ITS personnel services
delivery in relation to SLAs, OLAs and ITSCs. Research approach is the
way the researcher approaches the research either by gathering data and
formulates a theory or the researcher develops a theory and hypotheses and
then tests or validate it. An inductive approach was adopted since it allowed
the researchers to develop a theory during data analysis of the collected data
(Saunders, Lewis, & Thornhill, 2009). The researchers used questionnaires
to carry out the research since they facilitated saturation, the questionnaires
were distributed in proportion to personnel in each ITS department team. 20
Questionnaires were distributed in the Hardware Section, 7 in the Software
section and 7 in the Networking section. The response rates were 80
percent, 71.43% and 85.71% percent respectively. The data was coded
manually.
5. RESULTS
The hardware section team is not aware of any agreements with the
software team and the networking department which ensure appropriate
service level is met for particular services within the ITS department. If
OLAs agreements are in place personnel felt that the ITS department
Director and or other senior officers should facilitate and maintain these
agreements since they increase efficiency and they allow alignment of work
processes with organizational objectives.
The hardware section team is not aware of any agreements with the
software and networking team which define the level of service the students
and staff members should receive and this should be led by the chief
technician. Personnel act on intuition to work and tasks when called upon or
infer to those which are his/her job description. All respondents agreed to
the notch that the adoption of SLAs will improve service delivery to the
clients and helps in setting boundaries on personnel’s duties and how they
would execute them with confidence. Furthermore, it results in process
standardization and improved accuracy in execution of tasks. 10% of the
respondents strongly agree, 60% Agree, 15% are Neutral and also 15%
Disagree that the use SLAs will improve and differentiate services by
defining performance and its measures and this will help in building
actionable performance tracking and controls.
There is no policy about IT services currently in offer and ready to be
delivered which the respondents felt they should be monitored by
supervisors responsible for a specific services being offered. In hardware
maintenance, personnel from other departments are called upon to offer all
related activities on ad-hoc basis. ITSCs offers a platform to evaluate
services being offered if they’re meeting the required standard. Top
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 62
management such as directors and supervisors are key stakeholders in
implementation of IT service management.
The networking section team do not have any agreements with the software
and hardware teams to ensure appropriate service level is met for particular
services in the ITS department. The service level, which students and staff
should receive is not defined such as the uptime and download speed
available in both the wireless and wired network. Staff portal services and
the students’ electronic learning (E-Learning) accounts being monitored by
the software team are dependent upon network availability and the server
capacity which is the responsibility of the networking and hardware section
respectively even though there are no OLAs among the departments
concerned. Staff and students are informally consulted on their requirements
on the services being offered by the ITS department. Students and staff
members should be given a platform to request additional functionalities
‘add-on’ on their E-Learning and staff portal services accounts.
IT service management model
A University-wide IT service management model was developed which
consists of the Operational Level Agreements which is viewed as the
cornerstone of IT service management implementation, the service level
agreements which is the sub-domain linking OLAs and ITSCs, and finally
the IT service catalogues which is referred to as the nucleus of IT service
management. Leadership support is important from personnel such as IT
directors, projects managers and Chief IT technicians since they will initiate
setting of specific benchmarks for performance measurement and facilitate
effective feedback mechanism and communication. Top management will
help in organizing seminars or workshops in form of refresher courses or
awareness campaigns about execution of their work processes.
Explicitly defining OLAs will aid management in identification of key
services and processes in both qualitative and quantitative form while
monitoring them and taking corrective measures where necessary (SLAs).
After defining SLAs then ITSCs can be formulated, a measure which is both
customer and IT service provider centric and acts as the nucleus of the
model. Services being offered should be end-user centric rather than the
provider’s point of view such as the website should be navigated easily and
there must be a distinction between administrative issues and other
information to be displayed on the homepage. Support services including
how to access the website using mobile phones and those which are
supported or compatible mobile browsers should be availed to clients.
Additionally, key future plans such as general upgrade of the site (time it
will be expected to be down during maintenance should be communicated),
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 63
upgrading to mobile site, modification of functionalities on the webpage,
and phasing out of specific service should be communicated. Figure 1
overleaf shows the developed model.
Figure 1: IT service management implementation model
OPERATIONAL LEVEL
AGREEMENTS (IT service provider
centric)
SERVICE LEVEL
AGREEMENTS
Identify key services and processes to
achieve the required goal.
Define services in qualitative and
quantitative form.
Monitor the key services and processes
while corrective measures are being taken
where necessary.
SERVICE CATALOGUE
(Customer centric)
Details of services and products
offering
Give reports on website availability
(response time, uptime percentage
etc.)
Support services (e.g. installation
of preliminary software, mobile
browser support/types of mobile
phones compatible)
Key policies
Terms and conditions
Service Level Agreements (SLAs)
Key future plans (upgrading to
mobile, modification of
functionality etc. or phasing out of
a service
OLAs DRIVING
FORCES
Leadership support
Setting specific
performance benchmarks
Rewards and recognition
or penalties in
relationship response on
adopting OLAs
Education and awareness
campaigns to ITS
department sections
personnel
Ensure effective
feedback mechanism and
communication.
Definition of services
required to deliver
services
Explicitly define
responsibilities of IT
service provider and
recipient
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 64
6. CONCLUSIONS
An enabling collaborative approach to quality improvement should be
explored by the ITS teams while involving their clients (staff and students)
so that their needs are satisfied. In achieving ITSM, goals must be
benchmarked and reviewed by the monitoring and evaluation committee
being steered by the project manager. The committee must ensure
availability of human and financial resources for example through lobbying
top management support and training of employees. In addition, the
committee should facilitate a cyclical communication system with
stakeholders and top management so as to ensure their support and
commitment even during the review process. The institutional goals, vision
and mission should be aligned with ITSM strategy adopted. A service
catalogue which acts as a blue-print to clients in understanding and making
an informed decision about the services they use or intends to use must
always be availed to clients, and also it acts as a benchmark for quality
assurance on services the ITS department offers to clients.
OLA between IT service provider and a procurement department or other
departments to obtain hardware or other resources in agreed times and
between a service desk and a support group to provide incident resolution in
agreed times should be defined to ensure appropriate service level is met
(Rudd, 2010). Adoption of OLAs will result in better service delivery and
management of duties and responsibilities. Universities must integrate
various IT teams within departments across the various campuses while
explicitly defining implementation of SLAs, OLAs and ITSCs and also
emphasise on performance reporting which must be facilitated by a team
leaders from all IT sections. Additionally, institutes must identify
facilitating and clogging conditions for successful ITSM and this can be
necessitated through conducting seminars and or workshops on relevant IT
aspects. Conducting post training evaluation on deliberations on ITSM will
help in continuous improvement in service delivery. Relating COBIT and
ITIL to IT service management constructs (OLAS, SLAs and ITSCs)
presents an interesting area for further research.
REFERENCES
[1] Almeroth, K.C. and Hasan, M., 2002. Management of Multimedia on the Internet: 5th
IFIP/IEEE Proceedingds of the International Conference on Management of
Multimedia Networks and Services, MMNS 2002, Santa Barbara, CA, USA, October 6-
9, 200. CA: Springer, p.356.
[2] Bon, J. van et al., 2007. IT Service Management: An Introduction. Van Haren
Publishing, p.514.
[3] Dube, D.P. and Gulati, V.P., 2005. Information System Audit and Assurance. Tata
McGraw-Hill Education, p.671.
[4] Griffiths, R., Lawes, A. and Sansbury, J., 2012. IT Service Management: A Guide for
ITIL Foundation Exam Candidates. BCS, The Chartered Institute for IT, p.200.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 65
[5] Hiles, A., 2000. Service Level Agreements: Winning a Competitive Edge for Support &
Supply Services. Rothstein Associates Inc, p.287.
[6] Lahti, C.B. and Peterson, R., 2007. Sarbanes-Oxley IT Compliance Using Open Source
Tools. Syngress, p.466.
[7] Moeller, R.R., 2013. Executive’s Guide to IT Governance: Improving Systems
Processes with Service Management, COBIT, and ITIL. John Wiley & Sons, p.416.
[8] Office of Government Commerce, 2010. Introduction to the ITIL service lifecycle. The
Stationery Office, p.247.
[9] Rudd, C., 2010. ITIL V3 Planning to Implement Service Management. The Stationery
Office, p.320.
[10] Saunders, M., Lewis, P., & Thornhill, A. (2009). Research methods for business
students. (5th Edition, Ed.) Pearson Education Limited, Essex, England.
[11]Troy, D. M., Rodrigo, F., & Bill, F. (2007). Defining IT Success Through the Service
Catalog: A Practical Guide about the Positioning, Design and Deployment of an
Actionable Catalog of IT Services (1st Edition, Ed.). US: Van Haren Publishing.
[12]University of Birmingham, 2014. IT Services - University of Birmingham. [Online]
Available at: <http://www.birmingham.ac.uk/university/professional/it/index.aspx>
[Accessed 18 Mar. 2014].
[13] University of California, 2012. ITS Service Management: Key Elements. [online]
Available at: <http://its.ucsc.edu/itsm/servicemgmt.html> [Accessed 18 Mar. 2014].
This paper may be cited as:
Zhou, M., Ruvinga, C.,Musungwini, S. and Zhou, G., T., 2014. A Model for
Implementation of IT Service Management in Zimbabwean State
Universities. International Journal of Computer Science and Business
Informatics, Vol. 14, No. 1, pp. 58-65.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 66
Present a Way to Find Frequent
Tree Patterns using Inverted Index
Saeid Tajedi
Department of Computer Engineering
Lorestan Science and Research Branch, Islamic Azad University
Lorestan, Iran
Hasan Naderi
Department of Computer Engineering
Iran University of Science and Technology
Tehran, Iran
ABSTRACT
Among all patterns occurring in tree database, mining frequent tree is of great importance.
The frequent tree is the one that occur frequently in the tree database. Frequent subtrees not
only are important themselves but are applicable in other tasks, such as tree clustering,
classification, bioinformatics, etc. In this paper, after reviewing different methods of
searching for frequent subtrees, a new method based on inverted index is proposed to
explore the frequent tree patterns. This procedure is done in two phases: passive and active.
In the passive phase, we find subtrees on the dataset, and then they are converted to strings
and will be stored in the inverted index. In the active phase easily, we derive the desired
frequent subtrees by the inverted index. The proposed approach is trying to take advantage
of times when the CPU is idle so that the CPU utilization is at its highest in in evaluation
results. In the active phase, frequent subtrees mining is performed using inverted index
rather than be done directly onto dataset, as a result, the desired frequent subtrees are found
in the fastest possible time. One of the other features of the proposed method is that, unlike
previous methods by adding a tree to the dataset is not necessary to repeat the previous
steps again. In other words, this method has a high performance on dynamic trees. In
addition, the proposed method is capable of interacting with the user.
Keywords: Tree Mining, Inverted Index, Frequent pattern mining, tree patterns.
1. INTRODUCTION
Data mining or knowledge discovery deals with finding interesting patterns
or information that is hidden in large datasets. Recently, researchers have
started proposing techniques for analyzing structured and semi-structured
datasets. Such datasets can often be represented as graphs or trees. This has
led to the development of numerous graph mining and tree mining
algorithms in the literature. In this article we present an efficient algorithm
for mining trees.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 67
Data mining has evolved from association rule mining, sequence mining, to
tree mining and graph mining. Association rule mining and sequence mining
are one-dimensional structure mining, and tree mining and graph mining are
two-dimensional or higher structure mining. The applications of tree mining
arise from Web usage mining, mining semi-structured data, and
bioinformatics, etc.
Basic and fundamental ideas of tree mining, roughly since the early '90s
were seriously discussed and during this decade were completed. These can
be stated that the origin and beginning of these ideas is their application
especially on the web. First, some essential and basic concepts are
described, and then describe the proposed method and finally the results will
be evaluated.
2. Related Works
2.1 Pre Order Tree Traversal
There are several ways to navigate through the ordered trees; pre order
traversal is one of the most important and most widely used of them. In this
way, we are acting like Depth First Search algorithm. This means that on the
tree like T starting from the root, then the left child and finally the right
child is navigating; this method is done recursively on all nodes of the tree.
2.2 Post Order Tree Traversal
It is also among the most important and widely used methods of ordered
trees traversal. In this method, we first on the tree like T starting from the
left child, then right child and finally the root is navigating, the operation is
performed recursively on all nodes of the tree.
Using either method, the above display, we can assign a number to each of
the nodes that in fact, it is represents a time to meet each node. If we use the
Post Order Traversal, that number is called PON.
2.3 RMP, LMP
LMP is the acronym Left Most Path represents a path from the root to the
leftmost leaf and the RMP is the acronym Right Most Path represents a path
from the root to the rightmost leaf.
2.4 Prüfer Sequence [23]
This algorithm was introduced in 1918 and used to convert the tree to string.
The algorithm works as follows in the tree like T, in every step the node
with the smallest label has been removed and label the parent node of this
tree is added to the Prüfer Sequence. This process is repeated n-2 times to 2
nodes remain.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 68
2.5 Label Sequence
The next concept is Label Sequence. This sequence is produced according to
the Post Order Traversal. In other words, in the Post Order Traversal, label
of each node that will be scanned, add to the sequence.
2.6 Support
Simply this implies that the S pattern has been repeated several times in the
tree T.
(1)
Where S is a tree pattern and D is a database of trees. This concept for
determining the number of occurrences of each subtree in a set of trees is
being used.
2.7 Inverted Index [24]
Inverted Index is a structure that used to indexing frequent string elements
in set of documents and is consists of two main parts: Dictionary and
Posting List. Frequent string elements uniquely stored in Dictionary and the
number of occurrences of each of these elements in total documents is
determined. Informations about the frequent elements such as the document
name, number of occurrences in each document are determined in Posting
List.
3. An overview of research history
In recent years, much research about the frequent subtrees mining has been
done. Yongqiao Xiao et al in 2003 used the Path Join Algorithm and a
compact data structure, called FST-Forest to find frequent subtrees [25]. In
this way, we first find frequent root path in all directions and then with
integrate these paths, frequent subtrees are reached. Shirish Tatikonda et al
published an article in 2006 on the basis of the pattern growth[26]; In this
way that all trees in the database tree are converted to strings, that is done
with the two different methods: Prüfer Sequence and DFS algorithms; then
scroll all strings in which there is a subtree or pattern such as S, we are
seeking a new edge can be added to S. Then, concurrently with the previous
step, as production of the candidate subtrees, the threshold values are
evaluated for be frequent. In 2009, Federico Del Razo Lopez et al presented
an idea to make flexible the tightly constrained tree mining In non-
fuzzy[27]. This paper used the principle of Partial Inclusion; that to say that
there is a pattern S in a tree T, it is no need to exist all the pattern nodes in
the tree. The proposed algorithm uses Apriori property for pruning
undesirable patterns.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 69
4. The proposed approach
This procedure is done in two phases: Passive and Active. In the Passive
phase, first we need to find all subtrees available in all trees and then must
be store in Inverted Index. In the Active phase, simply use it and will extract
frequent tree patterns.
4.1 Passive Phase
This phase is done in two stages, in first stage; we must find all subtrees of
every tree in dataset then they will be converted to a string called that tree
and in second stage; produced strings in the first stage should be stored in
the Inverted Index.
4.1.1 First stage of Passive phase
The first important point is that in each tree, each node label in the tree can
be repeated many times, but every node in every tree has a unique label; to
solve this problem, we use the method of Prüfer Sequence. This means that
each tree can be traced to Post Order and In fact, Prüfer Sequence Algorithm
works based on the PON. As a result, each node label of a tree will be
marked with a unique number.
The next issue is that the Prüfer Sequence able to cover all the nodes,
therefore, the algorithm implementation process rather than n-2 steps
process will continue until n steps and rather than the parent label of the last
node, put the number 0. In Figure 1 you can see an example of this method
is that the purpose of the NPS is Prüfer Sequence that has been achieved
using Post Order.
The next thing is that every subtree should be displayed uniquely; to this
end, must obtain CPS for each node. In fact, CPS will merge Prüfer
Sequence and Label Sequence. In other words, CPS(T) = ( NPS,LS )(T).
CPS can uniquely display a rooted and labeled tree. As you can see in
Figure 1, the T1 tree can be displayed uniquely using two strands that are
complementary.
Figure 1. An example of the Prüfer Sequence and Label Sequence for T1 tree
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 70
The next thing that we need to ensure that in each tree can produce all
subtrees and each subtree is created only once, for this purpose, we use the
LMP to generate the subtree. This means that if we show the T tree using
Prüfer sequence and n is the subtree, A node such as v that is to be added to
the n should be included in the LMP of the T tree and since the PON is built
on the Prüfer sequence, just the v node should be after the last node of the n
and attached to that in the Prüfer sequence of the T tree. So is guaranteed to
be generated only once for each subtree and if it is done for all the nodes,
entire subtree of each tree will produce.
Now we will introduce the algorithm. The proposed algorithms for
generating subtrees and convert them into a string can be seen in Figure 2.
Insert CPS(T) in Array A
For i=n downto 1 do
{
Subtree=A[n]
Insert CPS(A[n]) in Treestringi
Sub(subtree,i,A, stack1,stack2)
}
Sub(subtree,index,A[],stack1,stack2)
c=0
t=0
For j= 1 to index-1 do
If index in A[j] then
{
stack3=stack1
stack4=stack2
subtree2=subree
while stack3 not empty
{
t++
Pop x from stack1
Pull y from stack2
Subtree2=subtree2+x
if t>0 then
{
Insert CPS(subtree2) in treestringi
Sub(subtree2,y, A[],stack3,stack4)
}
}
If c>0 then
{
Temptree push in stack1
TempIndex push in stack2
}
Temptree= a[j]
TempIndex=j
c++
Subtree=subtree+a[j]
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 71
Insert CPS(subtree) in treestringi
Sub(subtree,j,A[],stack1,stack2)
while stack1 not empty
{
c--
Pull x from stack1
Pull y from stack2
Insert CPS(subtree+x) in treestringi
Sub(subtree+x,y, A[],stack1,stack2)
}
}
Figure 2. The algorithm of the subtrees generation and convert them to a string
In the following we examined the algorithm works with an example. In the
beginning starting from the first tree and CPS (T) are stored in the array A.
As a result, the array will be completed for T1 according to Figure 3.
Figure 3. Production of the array using CPS (T)
In this step we identify all existing subtrees and store them in a string. To do
this, we start from the root node of T1, therefore the last element of the
array namely A0 and respectively, the branching subtrees from this node
should be stored in the string. As a result, at first A0 is stored in the string
according to the algorithm, next we run sub function. Considering that the
index of previous node is equal to 9 to find the subtrees with two nodes,
respectively start from the first element of the array and review to the
element with the pre Index of the previous node namely 8, If the value
contains the index of the previous node namely 9, It has added to the
previous tree namely A and CPS of the found subtree can be inserted in the
string of this tree that here A0C2 and A0E2 are stored in the string and
recursively repeat the same steps for new generated subtrees. Given that
both produced subtrees branched from a node, adding node with smaller
index from Stack1 and its index from stack2 are extracted and added to the
subtree with a larger index and its CPS is stored in the string,therefore in
this step is also added A0E3C3 and also this is repeated for the whole
produced subtrees with larger index in the next step. Similarly, the work
continues recursively until all subtrees branching from the first node of the
array to be stored in string. Then do the same procedure for the next
elements of the array, until complete the string of the subtrees of the tree
and then we proceed next trees until for each tree, the string is created for all
subtrees of the tree.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 72
4.1.2 First stage of Passive phase
In the second stage of this phase we use the Inverted Index. Thus, the strings
created in the previous stage are inserted into the Inverted Index. CPS and
the number of occurrences of each subtree in the all trees are stored in the
Dictionary and the name of the trees that are containing the subtree will be
stored in the corresponding Posting List.
Figure 4. Part of the Inverted Index made for the collection of trees T1, T2
As can be seen in the subtrees are stored in the Dictionary and the parent
trees of the corresponding subtrees are stored in the Posting List.
4.2 Active Phase
In this phase, simply use inverted index made in the previous phase and will
extract frequent tree patterns. Simply types of queries about frequent subtree
mining to be answered quickly by using inverted index made in the previous
phase. Then we will examine several types of different queries.
4.2.1 Find the occurrence of the desired pattern in tree set
First, we are achieved CPS of the desired pattern and then search it into
Dictionary of the inverted index and easily extract the number of the
occurrence and name of trees that contain the desired pattern from the
Posting List of the inverted index. For example, to find the number of
occurrences of the S pattern on the collection of trees T1, T2 in Figure 5,
should search CPS (S) ie A0C3B3 into Inverted Index that T1 and T2 will
be the result.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 73
Figure 5. Part of the Inverted Index made for the collection of trees T1, T2
4.2.2 Find frequent subtrees in considering the Support
If we want to find some subtrees that them Support are greater than a
threshold, must find the subtrees with their occurrence compared to the total
trees is greater than the Support. So we can search in the inverted index and
easily find subtrees that the length of them Posting List compared to the
total trees is at least equal to Support.
4.2.3 Find frequent subtrees in considering the Support and minimum nodes
In this case, in addition to Support, the number of nodes is also the criterion,
so easily search in Inverted Index and only show the subtrees with the
following conditions. First, in Dictionary Length the subtree is greater than
the minimum number of nodes and Second, length of corresponding Posting
List compared to the total trees is at least equal to Support.
5. Evaluation
In this section, the proposed method will be evaluated from various aspects.
We present the experimental evaluation of the proposed approach on
synthetic datasets. In the following discussion, dataset sizes are expressed in
terms of number of trees. In the graphs is used from symbolizes Algorithm
to display proposed method. Name and details of synthetic datasets are
shown in Table 1.
Table 1. Name and details of synthetic datasets
Name Description
DS1 -T 10 -V 100
DS2 -T 10 -V 50
As shown in Table 1, the synthetic datasets DS1 and DS2 are generated
using the PAFI[28] toolkit developed by Kuramochi and Karypis (PafiGen).
Since PafiGen can create only graphs we have extracted spanning trees from
these graphs and used in our analysis. We also used minsup to analyze the
various factors. This means if the number of replicated subtree is less than
minsup value, the tree won't be indexed in Inverted Index. Minsup value is
from 1 to infinity, which is the default value is equal to 1 in the proposed
algorithm. In addition, we also use from maxnode in evaluations. Maxnode
is the symbol to specify the maximum number of nodes in each subtree in
Inverted Index. This means if the number of subtree nodes reach maxnode
amount in the proposed algorithm, production of its subtree will halt.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 74
Maxnode value is from 1 to infinity, and the default value is equal to
infinity.
5.1 Evaluating the performance of the proposed method
At the beginning, we evaluated our proposed algorithm on two synthetic
datasets DS1 and DS2. The performance of the proposed algorithm for
frequent tree minig on synthetic datasets is shown in Diagram 1. In this
experiment, the minsup equal to one and the maxnode is equal to infinity.
Given that the Subtrees are indexed in passive phase at times when the
system is idle, mining time in Inverted Index rises with a gentle slope By
increasing the number of trees. So clearly spelled out the introduced
algorithm is scalable.
Diagram 1: The performance of the algorithm on synthetic datasets
5.2 Evaluating effect of minsup on the number of indexed patterns
We examine effect of minsup on the number of indexed patterns in Diagram
2. This experiment has been done on synthetic datasets DS1 and DS2
generated by Pafi and with size 50K. In this experiment the maxnode is the
default value ie infinity. As can be seen in the diagram, the number of
indexed patterns is increasing exponentially by decreasing minsup.
Diagram 2: Effect of minsup on the number of indexed patterns
0
1
2
3
4
5
6
7
8
9
10
10K 20K 30K 40K 50k
miningtime
# of trees
DS1
DS2
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
2,500 500 250 50 25 5 1
#ofindexedpatterns
Thousands
minsup
DS1
DS2
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 75
5.3 Evaluating effect of maxnode on usage memory
We examine the effect the maximum number of nodes in the indexed
subtrees on usage memory in passive phase. This experiment has been done
on synthetic datasets DS1 and DS2 generated by Pafi and with size 50K. In
this experiment the minsup is the default value ie 1. As can be seen, the
usage memory of the algorithm is increasing by increasing the number of
indexed nodes in each subtree.
Diagram 3: Effect of maxnode on the usage memory
5.4 Evaluation of CPU utilization compared with the Tree Miner
In diagram 4, the comparison is performed between the proposed algorithm
with Tree Miner that was introduced by Zaki and is one of the best
algorithms for tree mining[29]. This experiment has been done on synthetic
dataset DS1 generated by Pafi and with size 50K. Given that in passive
phase the proposed algorithm is searching for subtrees and adding them to
inverted index, consequently, as can be seen in the diagram, CPU utilization
is close to 100 percent in most situations while the average CPU utilization
on TreeMining algorithm is approximately 90%.
Diagram 4: Comparison CPU utilization between TreeMiner and the algorithm
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 5 25 50 250 500 2,500
VirtualMemory(MB)
Maximum # of node in Subtrees
DS1
DS2
0
10
20
30
40
50
60
70
80
90
100
10K 20K 30K 40K 50K
Cpuutilization(%)
# of trees
TreeMiner
Algorithm
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 76
6. Conclusions and Recommendations
In this paper, a new method based on the Inverted Index in order to frequent
pattern mining was introduced to overcome many of the disadvantages of
previous methods. One problem with existing approaches is that mainly act
as a static on the set of trees and if a new tree is added to the set of trees, all
mining operations must be done from scratch again. This problem has been
overcome by the Inverted Index in the proposed approach. This means that
all the trees are indexing in the Passive phase and if a new tree is added to
the treeset at any stage, just the tree is indexed and there is no need to repeat
the previous operations. This algorithm will result in a high performance on
a collection of dynamic trees. Another advantage of this method compared
to other methods is that it is scalable. As listed in Section 5.1, the
performance of this algorithm is not slowed by increasing the treeset. As
listed in Section 5.4, one of the most striking features of this algorithm is
efficient use of CPU. In this method, the user interaction is also present.
As Listed in Section 5.2, the number of indexed patterns increases
exponentially by decreasing minsup, while the generally patterns with low
occurrences doesn't matter to us. As a result, we can speed up indexing in
passive phase with determining the appropriate amount of the minsup. As
Listed in Section 5.3, the usage memory increases by increasing the
maximum number of nodes in the indexed subtrees, while the usually
subtrees with very large number of nodes doesn't matter to us. As a result,
we can manage the usage memory with determining the appropriate amount
of the maxnode.
REFERENCES
[1] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets
based on WIT-trees," International Journal of Advanced Computer Research, p. 9,
2013.
[2] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of
Semi Structured data: a Survey," International Journal of Advanced Computer
Research, p. 5, 2013.
[3] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large
networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013.
[4] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear
Prefix tree," International Journal of Advanced Computer Research, p. 15, 2014.
[5] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for
Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and
Data Mining, p. 13, 2013.
[6] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining
Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic
Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 77
[7] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets
from Sparse Data," Web-Age Information Management, p. 7, 2013.
[8] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent
pattern mining over data streams," Advances in Knowledge Discovery and Data
Mining, p. 15, 2014.
[9] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering
Patterns of Reposting Behavior in Microblog," Advanced Data Mining and
Applications, p. 13, 2013.
[10] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering
weight conditions over data streams", Advances in Knowledge Discovery and Data
Mining, 2014.
[11] B. Kimelfeld and P. G. Kolaitis ," The complexity of mining maximal frequent
subgraphs," Proceedings of the 32nd symposium on Principles of database systems, p.
12, 2013.
[12] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets
based on WIT-trees," International Journal of Advanced Computer Research, p. 9,
2013.
[13] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of
Semi Structured data: a Survey," International Journal of Advanced Computer
Research, p. 5, 2013.
[14] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large
networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013.
[15] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear
Prefix tree" International Journal of Advanced Computer Research, p. 15, 2014.
[16] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for
Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and
Data Mining, p. 13, 2013.
[17] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining
Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic
Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.
[18] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets
from Sparse Data," Web-Age Information Management, p. 7, 2013.
[19] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent
pattern mining over data streams," International Journal of Advanced Computer
Research, p. 15, 2014.
[20] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering
Patterns of Reposting Behavior in Microblog," Advanced Data Mining and
Applications, p. 13, 2013.
[21] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering
weight conditions over data streams," International Journal of Advanced Computer
Research, 2014.
[22] B. Kimelfeld, and P. G. Kolaitis, "The complexity of mining maximal frequent
subgraphs," Proceedings of the 32nd symposium on Principles of database systems,
p. 12, 2013.
[23] H. Prüfer. Prüfer sequence. Available:
http://en.wikipedia.org/wiki/Pr%C3%BCfer_sequence
[24] C. D. Manning, P. Raghavan, and H. Schütze, An Introduction to Information
Retrieval. Cambridge, England: Cambridge University Press, 2008.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 78
[25] Y. Xiao, J.-F. Yao, Z. Li, and M. H. Dunham, "Efficient data mining for maximal
frequent subtrees," Proceedings of 3rd IEEE International Conference on Data
Mining, p. 8, 2003.
[26] S. Tatikonda, S. Parthasarathy, and T. Kurc, "TRIPS and TIDES: New Algorithms for
Tree Mining," Proceedings of 15th ACM International Conference on Information
and Knowledge Management (CIKM), p. 12, 2006.
[27] F. D. R. Lopez, A.Laurent, P.Poncelet, and M.Teisseire, "FTMnodes: Fuzzy tree
mining based on partial inclusion," Advanced Data Mining and Applications, pp.
2224–2240, 2009.
[28] Kuramochi and Karypis. Available: http://glaros.dtc.umn.edu/gkhome/pafi/overview/
[29] M. J. Zaki, "Efficiently Mining Frequent Trees in a Forest," Proceedings of the 8th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(SIGKDD), Edmonton, Canada, p. 10, 2002.
This paper may be cited as:
Tajedi, S. and Naderi, H., 2014. Present a Way to Find Frequent Tree Patterns
using Inverted Index. International Journal of Computer Science and Business
Informatics, Vol. 14, No. 1, pp. 66-78.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 79
An Approach for Customer Satisfaction:
Evaluation and Validation
Amina El Kebbaj and A. Namir
Laboratory of Modeling and Information Technology, Department of
Mathematics and Computer Science,
Faculty of Sciences Ben M'sik, Hassan2-Mohammedia University
Casablanca - 7955, Morocco
ABSTRACT
The main objective of this work is to develop a practical approach to improve customer
satisfaction, which is generally regarded as the pillar of customer loyalty to the company.
Today, customer satisfaction is a major challenge. In fact, listening to the customer,
anticipating and properly managing his claims are stone keys and fundamental values for
the enterprise. From a perspective of the quality of the product, skills, and mostly, the
service provided to the customer, it is essential for organizations to differentiate
themselves, especially in a more competitive world, in order to ensure a higher level of
customer satisfaction. Ignoring or not taking into account customer satisfaction can have
harmful consequences on both the economic performances and the organization’s image.
For that, it is crucial to develop new methods and have new approaches to THE
PROBLEMATIC customer dissatisfaction, by improving the services quality provided to
the costumer. This work describes a simple and practical approach for modeling customer
satisfaction for organizations in order to reduce the level of dissatisfaction; this approach
respects the constraints of the organization and eliminates any action that can lead to loss of
customers and degradation of the image of the organization. Finally the approach presented
in this document is tested and evaluated.
Keywords: Approach, Evaluation, Quality, Satisfaction, Test of homogeneity, Validation.
1. INTRODUCTION
“Does the company have the most meaningful information at the right time
to make the best possible business decisions?” is the question most
companies want to answer. “The purpose of a company is to create and
keep a customer (Levitt, 1960)”: through this declaration, the important
phases of the life cycle of the customer management, which are acquiring
costumers and ensuring their loyalty are clearly identified. Companies are
moving towards “customer oriented” management and focus on the life
cycle of their customers. According to “Moisand 2002”, the life cycle of the
customer is defined as the time interval between the moments for a costumer
to change its status from being a “new costumer” to the status of a
“lost/former customer”.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 80
In a context of globalized and very competitive market, where the
departments moved from a more classic level of management (cost
centered) to a value centered approach, the mission of the decision-makers
has evolved from proposing services and strategic partnerships to value
creation. To achieve this goal it’s necessary to have all the data to enlighten
the past, to clarify the present in order to predict the future by avoiding to be
confronted with gray areas (caused by lack of information). Business
intelligence includes all IT solutions (methods, facilities and tools) used to
pilot the company and help to make decisions.
This approach can be modeled by the three systems below:
1. Decision System: think, decide and control;
2. Effective System: transform and produce;
3. Information System: links the “Decision System” with the “Effective
System”. Its main purposes are:
 Generating information
 Memorizing information
 Broadcasting information
 Processing information.
Figure 1. The information system
The information system is a subsystem of the organization that is
responsible for collecting, storing, processing and Broadcasting
informations in effective system and decision system. In effective system,
the information is a current view of business data (invoice, purchase orders
...), in decision system, the information is more synthetic because it should
allow decision making (The list of 3 products less sold in January 2014). So
the information system links these two subsystems and must bring to all
organizational actors of the company, the information they need to act and
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 81
decide. So IS is a representation of reality, it leads to coordinate the
activities of the company.
This work is situated in this spirit, it consist to give a contribution to
maximize customer satisfaction of the company, meaning to propose an
approach that eliminates any form of loss of customer inside an
organization, then, to evaluate and validate the approach. Finally, to test the
homogeneity of the problem in order to measure customer satisfaction to
conduct corrective actions based on two dimensions of quality:
 The "made" quality 𝐐 𝐫: the product, process or service are
conform to what are defined as expected? It is composed of the
different evaluation to judge the achievement of target processes, to
measure the effects and check if the desired results were achieved.
 The "perceived" quality 𝐐 𝐩: what level of satisfaction generated
from the customer? It is defined by excellence of the product
(Zeithaml, 1988).
The ultimate goal is to have 𝑄𝑟 =𝑄 𝑝
Figure 2. Company's qualities
The introduction has defined the conceptual framework of the work. It
presented the issue addressed and contributions in the domain of company’s
governance. The following is composed of 3 sections:
In the 2nd paragraph, we expose the approach and then the latter is
statistically evaluated from concrete examples. In the 3rd paragraph, we test
the homogeneity of the problem. The conclusion shows the outline of this
study and our contribution. It also shows the various extensions and possible
future works.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 82
2. PROPOSED APPROACH
Standish Group (Valery, 2001) did a study which was conducted
internationally and evaluated the success and failure of IT projects.
Accumulated data over the past ten years are based on a sample of 50,000
projects. This study has identified three levels of evaluation of a project:
 The success of project : it is characterized by a system delivered in
hours and on time, for a cost within budget and fully compliant to
the specifications;
 The failure of project : it is characterized by the cessation of the
project ;
 Finally, the partial success or partial failure of a project: it is
characterized by the late delivery of a system partially responsive,
especially in terms of business scope, the specifications and a cost of
up to 200% of the original budget.
Only 29% of projects were successful, 53% partial success or half
failure and 18% failed. The proportion of abandoned projects outside the
budget or out of time reaches 71%.
2.1 Statement
This study shows that the customer satisfaction is not always reached,
perceived quality tend towards a desired quality presents a real challenge.
Within the company, quality is increasingly focused on customer
satisfaction. To win contracts, business leaders rely more on quality than
price advantages. Staff involvement, with listening to the customer, is a key
element for the success of a quality approach. The latter is the
implementation of all the resources available to an establishment to provide
a service that meets the needs and expectations of customers. From the
customer perspective, a warm welcome and quality service is "normal", it is
lack of quality which is penalizing to him.
To attract the customer, we must establish standards within the company by
identifying the market need. There are international standards that ensure
safe products and services, reliable and with high quality. These standards
are called ISO Standards. For companies, there are strategic tools for
lowering costs, increasing productivity and reducing waste and errors. For
companies, getting a certification is the preferred way of knowing the
quality of their organization to their customers and their suppliers.
2.2 Steps of the approach
Below the 7 best practices for customer satisfaction:
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 83
a) To develop team’s skills: do additional training on IT tools to
mount the team’s skills.
b) To make customer satisfaction a challenge for all the company:
the company can use the dissatisfaction of their customers to
improve our products and services. Bill Gates, Microsoft CEO, said
that "the unhappy customers are the best sources of information."
Because customers who express dissatisfaction enable companies to
identify and resolve defects services faster.
Dissatisfied customers are very expensive for companies, the cost of
recruiting a new customer is usually five times higher than the cost
of acquired customer retention. It is far better to work to keep its
customers than to recruit new ones to replace those who leave. So,
according to Jacques-Antoine Granjon, founder of Vente-
privee.com, the treatment of customer dissatisfaction should not only
be considered as a cost but as an investment.
c) To motivate teams: to mark clearly the importance of customer
satisfaction, some companies have introduced a variable part in pay
for some employees, calculated on the basis of indicators related to
customer satisfaction.
d) To facilitate contacts customers: there are 5 types of
communications channels:
 Telephone: Availability (24/24 7/7), Saving time ;
 Face to face : Immediate Response, Human Contact ;
 E-mail : Traceability (written proof) ;
 Website: simplicity ;
 Postal mail.
e) To anticipate the dissatisfaction : Whatever the quality of claims
processing, it may be better to move this claim and make a gesture to
customers who had a bad experience product - or where this risk
exists - without waiting for them to occur.
f) To measure customer satisfaction (evaluate to improve): today it is
essential to regularly assess the level of achievement of the final goal
of customer satisfaction. For example by sending to all customers
who have experienced dissatisfaction after the close of the case, a
satisfaction survey designed by the customer service and measuring
the accessibility of the service, reception, understanding and
treatment of dissatisfaction,
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 84
g) To reach out to customers on the Internet : The benefit may also
be provided on the Internet by another customer, a social network
(Twitter, Facebook ...), Make social media a true extension of
customer service, with employees able to participate in discussions
and respond directly to customer requests on these media.
3. EVALUATION AND VALIDATION OF THE APPROACH
Consider the case of a service company that manages the work of many
potential customers as "France Gas". the latter signed a contract with the
host company specifying the clauses that must be respected and among the
latter is the rate of customer satisfaction which should reach 92% and this
percentage is established post-agreement between two parties, and if that
percentage is not met, a penalty will be done due to customer
dissatisfaction. A development team of the host company supports the
realization of applications for "France Gas". This team should produce 22
applications monthly with the dissatisfaction rate should not exceed 8% (2
applications per month). The cause of client dissatisfaction is due to the
following:
 Application does not answer the need or generate unexpected errors
after delivery
 Timeout
To avoid these situations, companies have an interest in implementing
continuous improvement process which ultimate goal is the elimination of
all forms of waste, such as customer dissatisfaction. The problem to be
solved is, for Pn period, to maximize the number of satisfied customers. To
evaluate the approach we will need to test it in a sample for evaluation and
validation.
We start by making our Statistical hypothesis ( 𝐻0 and 𝐻1 ).
 The first - the null hypothesis or Ho. note: H0 : "Qr=Qp".
Qr : the proportion of customer satisfaction desired
Qp : the real percentages of satisfaction.
 The second, the alternative hypothesis H1 : "Qp<Qr"
3.1 Before the approach
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 85
3.1.1 Example1: April 2013
The team was unable to process only 10 simple applications. The customer
sent feedback to present his degree of satisfaction. There are 3 kind of
response: NS (Not Satisfied, S: Satisfied, N: Neutral)
Table 1. Customer’s feedback of April 2013
APPLICATIONS CUSTOMER
SATISFACTION
(S, NS, N)
REASONS OF
DISSATISFACTION
1 PipRep 2.0 FR NS timeout
2 Contextor 2.8 FR NS timeout
3 Contextor 2,2,3 S ------
4 Hermes Horizon S ------
5 Agent SSR 2011 Ns Application does not
work correctly
6 Plugin SSR 2011 Ns Application does not
work correctly
7 Agent Altiris 2011 S
8 GECO 1.17.3 FR Ns timeout
9 Nexthink collector S --------
10 Cosmocom 4 FR 1.0 S --------
Once the feedback is received, we proceed to calculate the percentage of the
monthly satisfaction as shown in the following table:
Table 2. Satisfaction rates of April 2013
Satisfaction
type
Customer
Satisfaction
Satisfaction
rates
S (satisfied) 5 50%
NS (unsatisfied) 4 40%
N (neutral) 1 10%
This table above can be modeled by the following figure:
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 86
Figure3. Customer satisfaction of April 2013
Ps (t0)=P(Xt0 = S)=0.5
PNS (t0)=P(Xt0 = NS)=0.4
PN (t0)=P(Xt0 = N)=0. 1
As Qr =92% and the hypothesis H0 = "Qr=Qp" and H1="Qr<Qp». We use
here one-tailed left test.
If
Qp − Qr
Qr(1− Qr )
n
> −t′
so we accept the hypothesis H0 and we reject H1with
error risk α =5%
“t” is calculated using the table of the normal distribution:
P (−tα ≤ T≤ tα)=1-α=0.95=>tα=1.645 using the table of normal distribution
and tα = 1.833 using the table of Student distribution.
We have Qr=92% and from the example =50%
f − Qr
Qr (1−Qr )
n
=
0.5 − 0.92
0.92(1−0.92)
10
=
−0.42
0.0857
= −4.9 < −1.645
So we accept the hypothesis H1="Qp<Qr" and we reject H0 = "Qr=Qp" with
error risk α =5%. And the observed difference is significant.
3.2 After the approach
3.2.1 Example2: December2013
The team treated 22 applications as shown the following figure:
50%
40%
10%
S (satisfied)
NS (unsatisfied)
N (neutral)
Customer satisfaction of April 2013
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 87
Table 3. Customer’s feedback of December 2013
APPLICATIONS SATISFACTION
(S, NS, N)
REASONS OF
DISSATISFACTION
1 MSC_CASP69 NS timeout
2 MSC_MDX NS timeout
3 Woodmac S ------
4 Whoswho s ------
5 Adobe Air Installer s ------
6 WinZip s -------
7 MSC_SetupDemdet s -------
8 Jabber s --------
9 TrendMicro_Office s --------
10 ORG+ s --------
11 QlikView s ---------
12 Q4- Engica N
13 TMS N
14 MSCLink_Core s ---------
15 MIPS s ---------
16 Rsclientprint NS Application does not work
correctly
17 TextPad s ---------
18 MSC_DMX s ---------
19 MSC_MSCOMCT2 NS timeout
20 Add-in Excel S ----------
21 Pre-req Excel S ----------
22 Ios S ------
We proceed to calculate the percentage of the monthly satisfaction as shown
in the following table:
Table 4. Satisfaction rates of December 2013
Satisfaction
type
Customer
Satisfaction
Satisfaction
rates
S (satisfied) 16 72.72%
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 88
NS (unsatisfied) 4 18.18%
N (neutral) 2 9.09%
This table above can be modeled by the following figure:
Figure4. Customer satisfaction of December 2013
Ps (t0)=P(Xt0 = S)=0.727
PNS (t0)=P(Xt0 = NS)=0.181
PN (t0)=P(Xt0 = N)=0. 091
We have Qr=92% and from the example =72%
f − P0
P0(1−P0)
n
=
0.72 − 0.92
0.92(1−0.92)
22
=
−0.2
0.182
= −1.09 > −1.645
And with Student law we have tα = 1.721, so this is also verified.
So we accept the hypothesis H0 = "Qr=Qp" and we reject H1="Qp<Qr"
with error risk α =5%. The difference between P and P0 observed is due to
sampling fluctuations.
3.2.2 Example 3: January 2014
The team treated 21 applications as shown the following table:
Table 5. Customer’s feedback of January 2014
Applications Satisfaction
(S, NS, N)
REASONS OF
DISSATISFACTION
1 Windows6.1-KB2574819 S ------
2 MigrationAssistantTool NS The installation must
be silent
73%
18%
9% S (satisfied)
NS (unsatisfied)
N (neutral)
Customer satisfaction of December 2013
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 89
3 See Electrical Viewer 4 S ------
4 Adobe_Flash_Player s ------
5 MSC_DEPOT s ------
6 Colibri 2.0 s -------
7 Navision s -------
8 OFFICE 2013 s --------
9 Windows6.1-KB2592687 s --------
10 CheckPoint VPN s --------
11 Interlink_MSCLink s ---------
12 CrystalReportsRuntime N
13 InterlinkComponentOne s ---------
14 MSXML s ---------
15 VisualC++Redistributable s ---------
16 ReportViewer_2010 NS Application does not
work correctly
17 .Net_Framework s ---------
18 MSCLink_Core s ---------
19 MSCLink_Configuration NS timeout
20 LDOC S ----------
21 MigrationAssistantTool S ----------
We proceed to calculate the percentage of the monthly satisfaction as shown
in the following table:
Table 6. Satisfaction rates of January 2014
Satisfaction type Customer Satisfaction Satisfaction rates
S (satisfied) 17 80.95%
NS (unsatisfied) 3 14.28%
N (neutral) 1 4.76%
The table above can be modeled by the following figure:
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 90
Figure 5. Customer satisfaction of January 2014
Ps (t0)=P(Xt0 = S)=0.81
PNS (t0)=P(Xt0 = NS)=0.14
PN (t0)=P(Xt0 = N)=0. 05
We have Qr=92% and from the example =80%
f − P0
P0(1−P0)
n
=
0.8 − 0.92
0.92(1−0.92)
21
=
−0.12
0.187
= −0.64 > −1.645
And with Student law we have tα = 1.721, so this is also verified.
So we accept the hypothesis H0 = "Qr=Qp" and we reject H1="Qp<Qr"
with error risk α =5%. The difference between P and P0 observed is due to
sampling fluctuations.
4. TEST OF HOMOGENEITY
We are faced with two samples which are most often not known whether
they are from the same source population. It is sought to test whether these
samples have the same characteristic ℓ. Two values is observed ℓ1 and ℓ2,
the difference between these two values may be due either to sampling
fluctuations or the difference of the characteristics of the two original
populations. That is to say, from the examination of two samples of size n1
and n2 , are respectively extracts of populations P1 (M1; α1) and P2 (M2;α2),
these tests are used to decide between:
H0 = « ℓ1= ℓ2»: (we conclude the homogeneity)
H1= «ℓ1 ≠ ℓ2»: (we conclude the heterogeneity).
In our case we test the homogeneity of 2 proportions:
f1= proportion of units having the calculated character X in sample 1;
f2= proportion of units having the calculated character X in sample 2;
p1= proportion of units having the character X in the population ;
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 91
p2= proportion of units having the character X in the population .
H0= «P1 =P2=P » and H1= « P1≠P2 »
P is replaced by the estimator f =
n1f1+n2f2
n1+n2
=
22∗0.72+21∗0.81
22+21
= 0.764
 x =
0.81−0.72
0.764∗0.24(
1
22
+
1
21
)
=0.02 > -1.645
So we conclude the homogeneity of the proposed solution. The proposed
population is homogeneous and the difference observed is more significant
and is due to sampling fluctuations.
4. CONCLUSIONS
The work done is to develop a practical and pragmatic approach to maximize
customer satisfaction in an organization for a given period. Therefore, an
approach has been proposed, evaluation and validation of the latter are
described above. This work opens the way to our sense towards diverse
perspectives of research which are situated on two plans: a plan of deepening of
the realized research and a plan of extension of the domain of research. In terms
of deepening of the proposed work, it would be interesting at first to use the
Markov chain to model statistically the proposed model and to propose or
develop practical tools for implementation of the proposed approach. As for
extension of the domain of the research, it would be interesting to connect this
approach to governance of information systems and to drive decision-making
system which consist to investigate the options and compare them to choose an
action that help in making decision.
REFERENCES
[1] BUFFA, Elwood. Operations Management, 3rd
Ed., NY, John Wiley & Sons, 1972.
[2] FRITZSIMMONS, James A. et Mona J. FRITZSIMMONS. Service management:
Operations, Strategy and Information Technology, 3rd
Ed., NY, Irwin/McGraw-Hill, 2001.
[3] Z. Adhiri, S. Arezki, A. Namir : What is Application LifeCycle Management?,
International Journal of Research and Reviews in Applicable Mathematics and Computer-
Science, ISSN: 2249 – 8931, December 2011
[4] http://hal.archives-ouvertes.fr/docs/00/71/95/35/PDF/2010CLF10335.pdf
[5] STEVENSON, William J. Introduction to Management Science, 2e edition, Burr Ridge,
IL., Richard D. Irwin, 1992.
[6] HILLIER, Frederick S., Mark S. HILLIER and Gerald J. LIEBERMAN. Introduction to
Management Science : A Modeling and Case Studies Approach with Spreadsheets, New
York, Irwin/McGraw-Hill, 2000.
[7] A. EL KEBBAJ et A. NAMIR. Modeling customer's satisfaction. Day of Science
Engineers, Faculty of Science Ben M’Sik, Casablanca july 29, 2013
[8] http://www.projectsmart.co.uk/docs/chaos-report.pdf
[9] http://info.informatique.entreprise.over-blog.com/article-approche-du-systeme-d-
information-dans-l-entreprise-69885381.html
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 92
[10] http://www.hamadiche.com/Cours/Stat/Cours5.pdf
[11] S. ARIZKI : ITGovA : proposed a new approach to the governance of information
systems. PhD in Computer Science, defended at the Faculty of Sciences of Ben M'Sik
Casablanca 24/02/2013
This paper may be cited as:
Kebbaj, A., E. and Namir A., 2014. An Approach for Customer Satisfaction:
Evaluation and Validation. International Journal of Computer Science and
Business Informatics, Vol. 14, No. 1, pp. 79-91.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 92
Spam Detection in Twitter - A
Review
C. Divya Gowri and Professor V. Mohanraj
Sona College of Technology, Salem
ABSTRACT
Social Networking sites have become popular in recent years, among these sites Twitter is
one of the fastest growing site. It plays a dual role of Online Social Networking (OSN) and
Micro Blogging. Spammers invade twitter trending topics (popular topics discussed by
Twitter user) to pollute the useful content. Social spamming is more successful compared to
email spamming by using social relationship between the users. Spam detection is
important because Twitter is mainly used for commercial advertisement and spammers
invade the privacy information of the user and also the reputation of the user is damaged.
Spammers can be detected using content and user based attributes. Traditional classifiers
are required for spam detection. This paper focuses on study of detecting spam in twitter.
Keywords: Social Network Security, Spam Detection, Classification, Content based
Detection.
1. INTRODUCTION
Web-based social networking services connect people to share interests and
activities across political, economic, and geographic borders. Online Social
Networking sites like Twitter, Facebook, and MySpace have become
popular in recent years. It allows users to meet new people, stay in touch
with friends, and discuss about everything including jokes, politics, news,
etc., Using Social networking sites marketers can directly reach customers
this is not only benefit for the marketers but it also benefits the users as they
get more information about the organization and the product. Twitter [1] is
one among these social networking sites. Twitter provides a micro blogging
(Exchange small elements of content such as short sentences, individual
images, or video links) service to users where users can post their messages
called tweets. Tweet can be limited to 140 characters only HTTP links and
text are allowed. Twitter user is identified by their user name optionally by
their real name. The user „A‟ starts following other users and their tweets
will appear on A‟s page. User A can be followed back if other user desires.
Trending topics in Twitter can be identified with hash tags („#‟). When a
user likes a tweet he/she can „retweet‟ that message. Tweets are visible
publically by default, but senders can deliver message only to their
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 93
followers. The „@‟ sign followed by username is a reply to other user. The
most common type of spamming in Twitter is through Tweets. Sometimes it
is via posting suspicious links.
Spam [14] can arrive in the form of direct tweets to your Twitter inbox.
Unfortunately spammers use twitter as a tool to post malicious link, send
spam messages to legitimate users. Also they spread viruses, or simply
compromise the system reputation. Twitter is mainly used for commercial
advertisement, and spammers invade the privacy information of the user and
also the reputation of the user is damaged. The attackers advertise on the
Twitter for selling products by offering huge discount and free products.
When users try to purchase these products, they are asked to provide
account information which is retrieved by attackers and they misuse the
information. Therefore spam detection in any social networking sites are
important.
2. RELATED WORKS
McCord et al., [1] has proposed user based and content based features to
facilitate spam detection.
User Based Features
The user based features considered are number of friends, number of
followers, user behaviors e.g. the time periods and the frequencies when a
user tweets and the reputation (Based on the followers and friends) of the
user. Reputation of a user is given by the equation,
𝑅 𝑗 = 𝑛𝑖(𝑗) (𝑛𝑖 𝑗 + 𝑛0 𝑗 ) (2.1)
Where 𝑛𝑖(𝑗) represents the number of followers of user „j‟ and 𝑛0 𝑗
represents the number of friends the user „j‟ has. According to Twitter Spam
and Abuse Policy „if the users have small number of followers compared to
the amount of people the user following then it may be considered as a spam
account‟. Spammers tend to be most active during the early morning hours
while regular users will tweet much less.
Content Based Features
The content based features [11] considered in this approach are number of
Uniform Resource Locator‟s (URL), replies/mentions, keywords/word
weight, retweet, hash tags. Retweet is a reposting someone‟s post, it is like a
normal post with author‟s name. It helps to share the entire tweet with all
the followers. The „#‟ containing tweets are the popular topics being
discussed by the users.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 94
Secondly, they compare four traditional classifiers namely Random forest,
Support Vector Machine (SVM), Naïve Bayesian and K-nearest neighbor
classifiers which are used to detect Spammers. Among these classifiers
Random Forest is found to be effective but this classifier can deal with only
imbalanced data set (data set with more regular users than spammers). Alex
HaiWang [2] considered 'Follower – Friend‟ relationship in his paper. A
„Direct Social Graph‟ is modeled. The author considers content based and
graph based features to facilitate spam detection.
Graph Based Features
A social graph is modeled as a direct graph G= (V, A) where set of V nodes
representing user accounts and set A that connect the nodes. An arc a = (i, j)
represents user i is following user j. Follower is considered as the Incoming
links or in links of a node i.e.., People following you not necessary that you
should follow back. A Friend is an Outgoing links or out links. i.e.., People
you are following. A Mutual Friend is a Follower and Friend at the same
time. When there is no connection between two users then they are
considered to be strangers.
Fig 2.1 A Simple Twitter Graph
In the above figure user A is following user B, user B and user C are
following each other. i.e., User B and user C are mutual friends, and User A
and User C are strangers. The graph based features considered are number
of followers, number of friends, and the reputation of a user.
The classifier used in this paper to detect spam is Naïve Bayesian classifier
[10]. It is based on Bayes theorem which is given by the equation,
𝑃 𝑌 𝑋 = ((𝑃 𝑋 𝑌 𝑃 𝑌 ) 𝑃(𝑋) (2.2)
The twitter account is considered as vector X and each account is assigned
with two classes Y spam and non-spam, the assumption is that the features
are considered to be conditionally independent. This classifier is easy to
implement, it requires small amount of training data set. But, conditionally
independence may lead to loss of accuracy. This classifier cannot model
independency.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 95
Twitter Account Features:
Zi Chu et al., [13] review some of the classification features to detect
spammers. These include tweet level features and account level features.
The tweet level features include Spam Content Proposition i.e. tweet text
checked with spam word list and the final landing URL is checked. The
account level features include account Profile which is the self description
of short description text and homepage URL and check whether the
description contains spam words.
Fabricio Benevento et al., [3] have considered the problem of detecting
spammers. In this paper approximately 96% legitimate users and 70%
spammers were correctly classified. Like [1] user based and content
attributes are considered. To detect spammers with accuracy, confusion
matrix is introduced
Fig 2.2 An Example of Confusion Matrix
In the above matrix, „a‟ is the number of spam correctly classified, „b‟ is the
number of spam wrongly classified as non-spam, „c‟ is the number of non-
spam wrongly classified as spam and „d‟ is the number of non-spam
correctly classified. For effective classification some of evaluation metrics
are considered. They are Precision, Recall, F-measure (Micro-F1, Macro-
F1).
Evaluation Metrics:
Precision: It is defined as the ratio of the number of users classified
correctly to the total predicted users and is given by the
equation,
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑝 = 𝑎 (𝑎 + 𝑐) (2.3)
Recall: It is defined the ratio of number of users correctly classified to the
number of users and is given by the equation,
𝑅𝑒𝑐𝑎𝑙𝑙, 𝑟 = 𝑎 (𝑎 + 𝑏) (2.4)
F-measure: It is the harmonic mean between precision and recall and it is
given by the equation,
𝐹_𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2𝑝𝑟 (𝑝 + 𝑟) (2.5)
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 96
The classifier used to detect spam is SVM. It is a state of the art method in
classification and in this approach they use non-linear SVM with the Radial
Basic Function kernel that allow SVM to perform with complex boundaries.
The biggest limitation of the support vector approach lies in choice of the
kernel and high algorithmic complexity. This approach mainly focuses on
detecting spam instead of spammers so that it can be useful in filtering
spam. Once a spammer is detected it is easy to suspend that account and
block the IP address but spammers continue their work from other new
account.
Puneeta Sharma and Sampat Biswas [4] proposed two key components (1)
identifying timestamp gap between two successive tweets and (2)
identifying tweet content similarity. They found two common techniques
used by spammers (1) Posting duplicate content by modifying small content
of the tweet; (2) Post spam within short intervals. Spam Identification
approach included BOT Activity Detection and Tweet Similarity Index.
Twitter data can be filtered in various ways by user id, by keyword, many
spammers post spam messages using BOT (computer program), reducing
the frequency between consecutive tweets. To calculate timestamp between
tweets, they first cluster tweets based on user id and sort by increasing
timestamp.
Fig 2.3 BOT activity detection
Cluster Tweets
Identify
Timestamp Gap
Gap < 10s
Spam Non
Spam
Cluster tweets
based on user id
Calculate time
difference
YES NO
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 97
Spammers can be classified as (1) desperate spammers and (2) sophisticated
spammers. Desperate spammers use automatic programs to post multiple
tweets with small time difference between posts. Sophisticated spammers
create time gap between each tweet. Spammers mostly post duplicate
tweets in trending topics such as jumbling the words between tweets, using
set of words, including numbers in the topic or appending the topic with
commercial advertisement. Tweet similarity index approach determines the
behavior of spammers and filters spam.
They first cluster tweets based on user id and then process each user‟s set of
tweets independently. They create buckets of similar tweets by calculating
Jaccard and Levenshtein similarity coefficient. As a result, they have
buckets containing most similar tweets together resulting in clusters of
similar text. Once all the tweets are collected they check the size of each
bucket and if it is greater than one then they considered it as spam.
Fig 2.4 Tweet Similarity Index
Identify Tweet
Similarity
Create Buckets
for Similar
Tweets
BucketSize
> 1
Cluster Tweets
Non
Spam
Spam
Cluster tweets
based on user id
Calculate Jaccard and
Levenshtein Distance
YES NO
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 98
Levenshtein distance
It is a string metric measurement for calculating the difference
between two sequences or text. Informally, the Levenshtein distance
between two words is the minimum number of single-character edits
required to change one word into the other including insertion, deletion, and
substitution. The edit distance phrase is often used to refer Levenshtein
distance. The distance is zero if the strings are equal. For example, the
Levenshtein distance between "sitter" and "sitting" is 3
Sitter → sittin (substitution of "i" for "e")
Sitter → sittin (substitution of "r" for "n")
Sittin → sitting (insertion of "g" in the end).
This Levenshtein distance is used to find out the duplicate tweets i.e. if the
tweets are duplicates then the distance is zero.
Jaccard Index
It is also called Jaccard similarity coefficient. It is used for comparing
diversity and similarity of sample sets.
J (A, B) =
|𝐴∩𝐵|
|𝐴∪𝐵|
(2.6)
Jaccard Distance measures dissimilarity between sample sets which is
obtained by subtracting the jaccard coefficient from 1.
𝑑𝑗 𝐴, 𝐵 = 1 − 𝐽(𝐴, 𝐵) (2.7)
Dolvara Gunatilaka [9] discusses about two privacy issues. First is user‟s
identity or user‟s anonymity. The second issue is about user profile or
personal information leakage.
User anonymity
It is that in many social networking sites users use their real name to
represent their account. There are two methods to expose user‟s anonymity:
(1) de-anonymization attack and (2) neighborhood attack [15]. In the first
one, the user‟s anonymity can be revealed by history stealing and group
membership information while in the second one, the attacker finds the
neighbors of the victim node. Based on user‟s profile and personal
information, attackers are attracted by user‟s personal information like their
name, date of birth, contact information, relationship status, current work
and education background.
There can be leakage of information because of poor privacy settings. Many
profiles are made public to others i.e. anyone can view their profile. Next is
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 99
leakage of information through third party application. Social networking
sites provide an Application Programming Interface (API) for third party
developers to create applications. Once users access these applications the
third party can access their information automatically.
Social Worms
It discuss about some of the social worms. Among those worms Twitter
worm is one of the popular worms.
Twitter worm: It is a term to describe worms that are spreading through
twitter. There are many versions and two worms that are discussed in this
paper are:
Profile Spy worm: This worm spreads by posting a link that downloads a
third party application called “Profile Spy” (a fake application). When users
try to download the application they need to fill some personal information
which allows attacker to obtain user‟s information. Once account is infected,
it will continuously tweet malicious messages to their followers. Next
twitter worm is Google worm which uses shortened Google URL that tricks
the users to click the link. The fake link will redirect users to a fake anti-
virus website. The website will provide a warning saying that computer got
affected and allows user to download the fake antivirus which is actually a
malicious code.
Sender Receiver Relationship
Jonghyuk Song et al., [7] propose a spam filtering techniques based on
sender receiver relationship. This paper addresses two problems in detecting
spam. First is account features can be fabricated by spammers. Second is
account features cannot be collected until number of malicious messages are
reported in that account. The spam filtering does not consider account
features rather than it uses relational features i.e. the connectivity and the
distance between the sender and receiver. Relational features are difficult to
manipulate by spammers. Since twitter limits the tweet to 140 characters
spammers cannot put enough information in that. For this reason spammers
go for posting URL containing spam. They classify the messages as spam
based on the sender. Content filtering is not effective in twitter because it
contains small amount of text.
Restrictions in Twitter
Some of the restrictions considered in twitter [9] are: The user must not
follow large number of users in a short time.
a. Unfollowing and following someone repeatedly.
b. Small number of followers when compared to the amount
of following.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 100
c. Duplicate tweets or updates.
d. Update consisting of only links.
The distance between two users is calculated as follows [5][6] when two
users are directly connected by an edge, the distance is one. This means that
the two users are friends. When the distance is greater than one, they have
common friends but not friends themselves. Next the connectivity
represents the strength of the relationship. The way to measure connectivity
is counting the number of paths. Hence, the connectivity between a
spammer and a legitimate user is weaker. The problem of this system is that
it identifies messages as normal if it comes from infected friends.
Sometimes attackers may send spam messages from legitimate accounts by
stealing passwords.
D. Karthika Renuka and T. Hamsapriya [8] an unsolicited email is also
called spam and it is one of the fastest growing problems associated with
Internet. Among many proposed techniques Bayesian filtering is considered
as an effective one against spam. It works based on probability of words
occurring in spam and legitimate mails. But keywords are used to detect
spam mails in many spam detection system.
In that case misspellings may arise and hence it needs to be constantly
updated. But it is difficult to constantly update the blacklist. For this
purpose Word Stemming or Hashing Technique is proposed. This improves
the efficiency of content based filter. These types of filters are useless if
they don‟t understand the meaning of the meaning of the words. They have
employed two techniques to find out the spam content
Content based spam filter [10] This filter works on words and phrases of
the email content i.e. associates the word with a numeric value. If this value
crosses certain threshold it is considered as spam. This can detect only valid
words with correct spellings. This filter uses bayes theorem for detecting the
spam content.
Word stemming or word hashing technique this filter [12] extracts the
stem of the modified word so that efficiency of detecting spam content can
be improved. Rule-based word stemming algorithm is used for spam
detection. Stemming is an algorithm that converts a word into related form.
One such transformation is conversion of plurals into singular, removing
suffixes.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 101
3. CONCLUSIONS
Spammers are the major problem in any online social networking sites.
Once a spammer is detected it is easy to suspend his/her account or block
their IP address. But they try to spread the spam from other account or IP
address. Hence it is recommended to check for spam content in a tweet in
the server. If any content matches the spam words present in the data set it is
prevented from being displayed. Accuracy is being evaluated in classifying
the spam content. Many traditional classifiers are present in classifying
spammers from legitimate users but many classifiers wrongly classify non-
spammers as spammers. Hence it is efficient to check for spam content in
tweets.
REFERENCES
[1] M. McCord, M. Chuah, “Spam Detection on Twitter Using Traditional Classifiers”.
Lecture Notes in Computer Science, Volume 6906, pp. 175-186, September 2011.
[2] A. H. Wang, “Don‟t Follow me Spam Detection in Twitter”, Security and
Cryptography (SECRYPT). Proceedings of 5th
International Conference on Security
and Cryptography, July, 2010.
[3] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida,
”Detecting Spammers on Twitter”. CEAS 2010 Seventh annual Collaboration,
Electronic messaging, Anti Abuse and Spam Conference, July 2010.
[4] Puneeta Sharma and Sampat Biswas,”Identifying Spam in Twitter Trending Topics”.
American Association for Artificial Intelligence, 2011.
[5] “Levenshtein_distance”, http//en.wikipedia.org/wiki/Levenshtein_distance.
[6] ”Jaccard_Index”, http//en.wikipedia.org/wiki/Jaccard_index.
[7] Jonghyuk Song, Sangho Lee and Jong Kim, “Spam Filtering in Twitter using Sender-
Receiver Relationship”. Recent Advances in Intrusion Detection, Lecture Notes in
Computer Science, Volume 6961, pp 301-317, 2011
[8] D. Karthika Renuka, T. Hamsapriya, “Email classification for Spam Detection using
Word Stemming”. International Journal of Computer Applications 1(5), pg.45–47,
February 2010.
[9] ”Twitter_Spam_Rules”, http//support.twitter.com/articles/64986-reporting-spam-on-
twitter.
[10] S. L. Ting, W. H. Ip, Albert H. C. Tsang - “Is Naïve Bayes a Good Classifier for
Document Classification?”. International Journal of Software Engineering and Its
Applications, Vol. 5, No. 3, July 2011.
[11] R. Malarvizhi, K. Saraswathi. "Content - Based Spam Filtering and Detection
Algorithms - An Efficient Analysis & Comparison". International Journal of
Engineering Trends and Technology (IJETT), Volume 4, Issue 9, Sep 2013.
[12] N.S. Kumar, D. P. Rana, R. G. Mehta, “Detecting E-mail Spam Using Spam Word
Associations”, International Journal of Emerging Technology and Advanced
Engineering , Volume 2, Issue 4, April 2012.
International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 102
[13] Zi Chu, Indra Widjaja, Haining Wang, “Detecting Social Spam Campaigns on
Twitter”. Lecture Notes in Computer Science, Volume 7341, pp. 455-472, 2012.
[14] Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang, “@spam: The Underground
on 140 Characters or Less”. Proceedings of the 17th
ACM conference on Computer and
Communications Security, ACM New York, NY, USA, 2010.
[15]Bin Zhou and Jian Pei, “Preserving Privacy in Social Networks Against Neighborhood
Attacks,”. Data Engineering, IEEE 24th
International Conference, April 2008.
This paper may be cited as:
Gowri, C. D. and Mohanraj, V., 2014. Spam Detection in Twitter - A
Review. International Journal of Computer Science and Business
Informatics, Vol. 14, No. 1, pp. 92-102.

Vol 14 No 1 - July 2014

  • 1.
    ISSN: 1694-2507 (Print) ISSN:1694-2108 (Online) International Journal of Computer Science and Business Informatics (IJCSBI.ORG) VOL 14, NO 1 JULY 2014
  • 2.
    Table of ContentsVOL 14, NO 1 JULY 2014 Symmetric Image Encryption Algorithm Using 3D Rossler System........................................................1 Vishnu G. Kamat and Madhu Sharma Node Monitoring with Fellowship Model against Black Hole Attacks in MANET.................................... 14 Rutuja Shah, M.Tech (I.T.-Networking), Lakshmi Rani, M.Tech (I.T.-Networking) and S. Sumathy, AP [SG] Load Balancing using Peers in an E-Learning Environment ...................................................................... 22 Maria Dominic and Sagayaraj Francis E-Transparency and Information Sharing in the Public Sector ................................................................ 30 Edison Lubua (PhD) A Survey of Frequent Subgraphs and Subtree Mining Methods ............................................................. 39 Hamed Dinari and Hassan Naderi A Model for Implementation of IT Service Management in Zimbabwean State Universities ................ 58 Munyaradzi Zhou, Caroline Ruvinga, Samuel Musungwini and Tinashe Gwendolyn Zhou Present a Way to Find Frequent Tree Patterns using Inverted Index ..................................................... 66 Saeid Tajedi and Hasan Naderi An Approach for Customer Satisfaction: Evaluation and Validation ....................................................... 79 Amina El Kebbaj and A. Namir Spam Detection in Twitter – A Review...................................................................................................... 92 C. Divya Gowri and Professor V. Mohanraj IJCSBI.ORG
  • 3.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 1 Symmetric Image Encryption Algorithm Using 3D Rossler System Vishnu G. Kamat M Tech student in Information Security and Management Department of IT, DIT University Dehradun, India Madhu Sharma Assistant Professor Department of Computer Science, DIT University Dehradun, India ABSTRACT Recently a lot of research has been done in the field of image encryption using chaotic maps. In this paper, we propose a new symmetric block cipher algorithm using the 3D Rossler system. The algorithm utilizes the approach used by Mohamed Amin et al. [Commun. Nonlinear Sci. Numer. Simulat, (2010)] and Vinod Patidar et al. [Commun Nonlinear SciNumerSimulat, (2009)]. The merits of these algorithms such as the encryption structure and the diffusion scheme respectively are combined with an approach to split the key for the three dimensions to use for encryption of color (RGB) images. The experimentation results suggest an overall better performance of the algorithm. Keywords Image Encryption, Rossler System, Block Cipher, Security Analysis. 1. INTRODUCTION Image encryption is relatively different from text encryption. Image is made up of pixels and they are highly correlated; so different approaches are followed for encryption of images [1-12]. One of the approaches is known as chaotic cryptography. In this approach, for encryption we use chaotic maps, which generate good pseudo-random numbers. Cryptographic properties of these maps such as, sensitive dependence on initial parameters, ergodic and random like behavior, make them ideal for use in designing secure cryptographic algorithms. Many scholars have proposed various chaos-based encryption schemes in recent years [4-12]. A scheme proposed by Mohamed Amin et al. [11] uses Tent map as the chaotic map and the scheme is implemented for gray scale images. They proposed a new approach of using the plaintext as blocks of bits rather than block of pixels. Another scheme proposed by Vinod Patidaret al.[12] uses chaotic standard and logistic maps and they introduce a way of spreading the bits using diffusion to avoid redundancy. In this paper, we propose an algorithm which utilizes the merits of the mentioned schemes. The
  • 4.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 2 algorithm uses the Rossler system for the chaotic key generation. We demonstrate a way to split the 3 dimensions of the key for the 3 image channels i.e. Red, Green and Blue. The algorithm in [11] is used as a base structure and the diffusion concept from [12] is used to spread the effect of adding the key. The symmetric Feistel structure, diffusion method and key splitting of the encryption scheme provide better results. The rest of the paper is organized as follows: Section 2 provides a brief overview of the Rossler system. Section 3 provides the algorithmic details. The results of the security analysis are shown in section 4. Lastly, Section 5 concludes the paper. 2. BRIEF OVERVIEW OF 3D ROSSLER SYSTEM Rossler system is a system of non-linear differential equations which has chaotic properties [13]. Otto Rossler defined these equations in 1976. The equations are as given below Xn+1 = -Yn-Zn Yn+1 = Xn + αYn (1) Zn+1 = β + Zn (Xn-γ) where, α, β and γ are real parameters. Rossler system's behavior is dependent on the values of the parameters α, β and γ. For different values of these parameters the system displays considerable changes. It may be chaotic, converge toward a fixed point, follow a periodic orbit or escape towards infinity. The Rossler system displays chaotic behavior for the values of α=0.432, β=2 and γ=4. The chaotic behavior refers to the fact that keeping the parameters constant, even a slight change in the initial value would bring a significant change in the subsequent values. For example the value of Z0 = 0.3 generates the value of Z1 = 0.5. After changing the value of Z0to0.6it generates the value of Z1 = -1. The same chaotic rule applies for the changes of other two dimensions (X and Y). This chaotic behavior is known as deterministic chaos, i.e. the knowledge of initial values and parameter values can help in recreating the same chaotic pattern. Hence the initial conditions have to be shared between the entities using the system for encryption/decryption process. 3. PROPOSED ALGORITHM In this section we provide details of our algorithm. The algorithm is designed to work with color images (RGB). In this scheme the plaintext (image) is taken as blocks of bits. The block size is 8w, where ‘w’ is the word size which is 32 bits. Each block of data is divided and stored into 8 w-bit registers and operations are performed on them. The key length
  • 5.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 3 depends on the number of rounds ‘r’ i.e. Key length is 4r+8. The number of rounds can vary from 1-255. We have taken ‘r’ to be 12 for our experimentation. The flowchart shown in Fig. 1 displays the various steps performed on the image during the encryption process. The steps are explained in the following subsections. Figure 1. Flowchart of the Encryption Scheme 3.1 Padding The processing of the image is done on block of data. 256 bits ie.32 bytes of data are encrypted/decrypted at a time using eight 32-bit registers. The image size should be a multiple of 256 bits to ensure that there is always a full block size for encryption. Hence padding is added so as to make the input block of size 32 bytes when the image size in bytes is not an integral multiple of 32. A padding of all zeros (1-31 bytes) is appended to the end of each row to make the bytes in each row a multiple of 32.
  • 6.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 4 For example if the image is of dimensions 252 x 252 pixels, a 4 byte padding of zeros is appended at the end of each row. The last byte of the image then stores the number of bytes used as padding as a pixel value i.e. 4 in this case. This pixel value is used to remove the padding after decryption. After retrieving the number of bytes padded ‘n’, all rows are checked to determine if zeros exist in all the last ‘n’ bytes and in ‘n-1’ bytes of the last row. The padding is then removed to generate the original image. 3.2 Key Generation The key is generated by the 3D chaotic Rossler system as shown in (1). The number of key bytes ‘t’ depends on the number of rounds ‘r’ i.e. t=4r+8. We use the three equations separately. The random sequence generated by each equation of the map is used as a key separately during the encryption process of the red, green and blue channel of the image respectively. The key generation concept is as shown below. The steps repeat ‘t’ number of times to generate necessary key bytes. a. Iterate Rossler system of equations (1) ‘r’ times where ‘r’ is the number of rounds. b. Use the decimal part of the X, Y, Z values to generate the key byte. Xn = abs (Xn - integer part); // decimal part of x Yn = abs (Yn - integer part); // decimal part of y Zn = abs (Zn - integer part); // decimal part of z c. Key byte for each dimension (R,G,B) is taken as X, Y, Z values respectively by mapping it to a value between 0-255. d. For the next set of key bytes the number of iterations is changed to a value obtained by performing exclusive-or on the current set of key bytes. Iterations for next key byte = XOR (Xn, Yn, Zn); 3.3 Vertical and Horizontal Diffusion The diffusion process explained in [12] is used in the algorithm. The horizontal diffusion in our algorithm is used in a slightly different way i.e. it is performed separately on each channel after the encryption of the channel rather than using it on the entire image. The diffusion ensures spread of the key additions for the channel. The horizontal diffusion moves in the forward direction from the first pixel of a channel to the last. The second pixel is the exclusive or of first and second pixel of a channel, the third pixel is the
  • 7.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 5 exclusive or of the new second pixel and the third pixel, and so on. Thus the first pixel of the channel remains unchanged. The Vertical Diffusion is performed before and after the entire encryption and horizontal diffusion is performed on the 3 channels of the image. In Vertical Diffusion the channels are treated collectively. The processing occurs from the last pixel of the image to the first pixel. It starts by performing XOR of the green and blue values of the last pixel of the image with the red value of the second last pixel to form the new red value of the second last pixel. The green value of the second last pixel is formed by performing XOR operation on the red and blue values of the last pixel. The blue value of the second last pixel is formed by XOR operation on the red and green values of the last pixel. This continues in the backward direction. Thus the last pixel remains unchanged. 3.4 Encryption/Decryption Scheme The encryption is performed on 256 bits (32 bytes) of data at a time using eight 32-bit registers. The algorithm is shown in Fig. 2. In the initial step four bytes of the key are added to alternate registers. 2’s compliment addition is performed. Then for ‘r’ rounds arithmetic operations are performed on the image data. It uses a function ‘f’, the output of which is used as the number of rotations to be performed on another block of data. After the swapping operation of the last round, the last four key bytes are added. The entire encryption structure is displayed in Fig. 3. For decryption the algorithm follows reverse of the encryption process. Figure 2.Encryption Algorithm for each Channel (R,G,B)
  • 8.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 6 Figure 3. The Image Encryption Structure
  • 9.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 7 4. EXPERIMENTATION RESULTS We performed security analysis on six 256 x 256 color(RGB) images as shown in Fig. 4. The statistical and differential analysis tests performed display very favorable results. The results display the strength and security of the algorithm. The results have been given in [14] to demonstrate the overcoming of vulnerability in [11]. Figure 4.Plain images (clockwise from top left): Lena, Bridge, Lake, Plane, Peppersand Mandrill 4.1 Statistical Analysis Statistical analysis is performed to determine the correlation between the plain image and the cipher image. For an encryption system to be strong the cipher image should not be correlated to the plain image and the cipher image pixels should not have correlation among them. In this section we provide the histogram and correlation analysis. 4.1.1 Histogram Analysis When the encrypted image and the plain image do not show high degree of correlation we can consider the encryption to be secure form information leakage. Histograms are used to plot the number of pixels at each intensity level i.e. pixels having values 0-255. This helps in displaying how the pixels are distributed. Fig. 5 depicts the histogram for the red, green and blue channels of the plain image ‘lena’ on the left side (from top to down) and the histograms of the ‘lena’ image after encryption for the three channels respectively on the right side. They depict that the encryption does not leave any concentration of a single pixel value.
  • 10.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 8 Figure 5.Left Side: Histogram of ‘lena’ plain image for red, green and blue channels (top to down). Right Side: Histogram of encrypted ‘lena’ image for red, green and blue channels (top to down).
  • 11.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 9 4.1.2 Correlation of Adjacent Pixels In a plain image the adjacent pixels show a high degree of correlation in horizontal, vertical and diagonal directions. The encrypted image should have a very small degree of correlation among its adjacent pixels. We select 1000 random pairs of pixels from an image and the following formula gives the correlation coefficient. 𝑐𝑜𝑟𝑟𝑥𝑦 = 𝐶(𝑥,𝑦) 𝐷 𝑥 𝐷(𝑦) (2) where, 𝐶 𝑥, 𝑦 = 1 𝑁 (𝑥𝑖 − 𝐸(𝑥))(𝑦𝑖 − 𝐸(𝑦))𝑁 𝑖=1 (3) 𝐷 𝑥 = 1 𝑁 𝑥𝑖 − 𝐸 𝑥 2𝑁 𝑖=1 (4) 𝐸 𝑥 = 1 𝑁 𝑥𝑖 𝑁 𝑖=1 (5) Here xi and yi form the pair of ith adjacent pixels and N is the total number of pairs. Table 1 shows the correlation coefficient values of the six plain images (Fig. 4) between horizontal, vertical and diagonal adjacent pixels. It can be noted that the adjacent pixels are highly correlated. Table 1.Correlation Values of Plain-Images Channels Plain Images Horizontal Vertical Diagonal RED Lena 0.9558 0.9781 0.9336 Bridge 0.8680 0.9070 0.8287 Lake 0.9234 0.9201 0.8886 Mandrill 0.8474 0.8032 0.7944 Peppers 0.9371 0.9392 0.9077 Plane 0.9205 0.9092 0.8546 GREEN Lena 0.9401 0.9695 0.9180 Bridge 0.9055 0.9131 0.8700 Lake 0.9354 0.9272 0.8943 Mandrill 0.7285 0.6674 0.6487 Peppers 0.9657 0.9673 0.9451 Plane 0.8938 0.9174 0.8419 BLUE Lena 0.9189 0.9495 0.8948 Bridge 0.9354 0.9411 0.9138 Lake 0.9377 0.9401 0.9099 Mandrill 0.8030 0.7914 0.7625 Peppers 0.9259 0.9330 0.8928 Plane 0.9179 0.8912 0.8563
  • 12.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 10 Table 2 shows the correlation coefficient values for the Red, Green and Blue channel of the cipher images formed by encrypting the plain images with the proposed encryption algorithm. The cipher images bear very little resemblance to the original images and that the adjacent pixels in the horizontal, vertical and diagonal directions are correlated to a very small degree. Table 2.Correlation Values of Cipher-Images Channels Plain Images Horizontal Vertical Diagonal RED Lena -0.0014 -0.0012 0.0004 Bridge -0.0040 -0.0066 -0.0010 Lake -0.0052 -0.0011 0.0018 Mandrill 0.0034 0.0001 0.0033 Peppers -0.0014 -0.0034 -0.0016 Plane -0.0024 -0.0043 0.0088 GREEN Lena 0.0004 0.0067 -0.0026 Bridge -0.0053 -0.0017 0.0008 Lake 0.0044 -0.0025 0.0068 Mandrill -0.0031 -0.0041 0.0029 Peppers 0.0008 0.0027 0.0029 Plane 0.0026 -0.0003 0.0014 BLUE Lena -0.0049 0.0014 -0.0005 Bridge 0.0023 0.0001 0.0037 Lake -0.0010 -0.0044 0.0002 Mandrill 0.0023 0.0001 -0.0014 Peppers -0.0016 -0.0006 0.0013 Plane 0.0040 -0.0007 0.0041 4.1.3 Correlation between plain and cipher image The previous section showed correlation between adjacent pixels of plain image or cipher image. But it is also necessary to have no relevant correlation between the plain image and the corresponding cipher image. Rather than using the pixel pairs of a single image, we use the pixels of the plain and cipher image at the same grid position. The 2D correlation coefficients of the images are calculated by pairing the three channels of the plain image with the three channels of the cipher image. These form nine different pairs i.e. correlation between; red channel of plain image and red channel of cipher image, red channel of plain image and green channel of cipher image, red channel of plain image and blue
  • 13.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 11 channel of cipher image; and so on for the green and blue channels of the plain image. These are represented as CRR, CRG, CRB, CGR, CGG, CGB, CBR, CBG, CBB; where for any Cij, i represents a channel (R,G,B) of plain image and j represents a channel (R,G,B) of cipher image. The coefficient values given in Table 3 depict that there is little or practically no correlation between the plain image and its corresponding cipher image. The cipher image thus displays characteristics of a random image. Table 3.Correlation Values between Plain Image and Cipher Image Images CRR CRG CRB CGR CGG CGB CBR CBG CBB Lena -0.0033 0.0016 0.0047 -0.0026 -0.0008 0.0006 -0.0029 0.0003 -0.0021 Bridge -0.0029 0.0005 0.0003 -0.0020 -0.0006 0.0011 0.0008 0.0007 0.0010 Lake -0.0012 0.0002 0.0005 -0.0041 -0.0007 0.0033 -0.0050 -0.0021 0.0039 Mandrill -0.0019 -0.0004 -0.0024 -0.0035 0.0011 -0.0036 -0.0034 0.0005 -0.0036 Peppers -0.0030 -0.0059 -0.0022 -0.0033 -0.0024 -0.0012 -0.0042 -0.0007 0.0005 Plane 0.0072 0.0014 -0.0003 0.0068 0.0025 0.0015 0.0057 0.0033 0.0033 4.2 Differential Analysis Differential analysis displays the amount of change that the encryption performs on the image. The encryption of two very similar images should not have a similar distribution of pixels in the cipher image. In other words, cipher images of two plain images having just a single pixel difference, should not bear any pixel resemblance between them. An adversary should not be able to extract any meaningful relationship between plaintext and cipher text, by comparing the 2 different cipher text of similar plaintext. NPCR (net pixel change rate) and UACI (unified average changing intensity) are used as measures of differential analysis. NPCR indicates the percentage of pixel change in the cipher image when a single pixel of plain image is changed. UACI measures the average intensity of the change between plain and cipher image. Let us consider 2 cipher images X1 and X2, obtained by plain images P1 and P2 having difference of a single pixel. The pixel values at the grid position of ith row and jth column for the cipher images are denoted as X1(i,j) and X2(i,j). A bipolar array B is defined as follows 𝐵(𝑖, 𝑗) = 0, 𝑖𝑓 X1 𝑖, 𝑗 = X2(𝑖, 𝑗) 1, 𝑖𝑓 X1 𝑖, 𝑗 ≠ X2(𝑖, 𝑗) (6)
  • 14.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 12 Values for NPCR and UACI are calculated as given in equations (7) and (8), where W and H denote width and height of the cipher images, T denotes the largest supported pixel value in the cipher images (255 in our case) and abs() computes the absolute value. The NPCR and UACI values given in Table 4 show that the encryption algorithm is secure against differential attacks. NPCR = 𝐵(𝑖,𝑗)𝑖,𝑗 W x H x 100% (7) UACI = 1 W x H 𝑎𝑏𝑠(x1 𝑖,𝑗 −x2 𝑖,𝑗 ) 𝑇𝑖,𝑗 x 100% (8) Table 4.NPCR and UACI Values Obtained for Encryption of 6 Plain images and Same Images with 1 Pixel Changed Plain Images NPCR UACI Lena 99.6333 33.4706 Bridge 99.5722 33.4403 Lake 99.5900 33.5313 Mandrill 99.6089 33.4595 Peppers 99.6185 33.4657 Plane 99.6206 33.4539 5. CONCLUSION In this paper we proposed a new image encryption algorithm. The merits of the recent research, based on results, were combined along with a symmetric approach of encryption to provide a secure algorithm. The diffusion mechanism along with Feistel structure makes the algorithm stronger. The 3D Rossler system of equations is used for the random key generation. The splitting of the three dimensions of the key for the three channels makes the cryptanalysis to obtain the key more difficult. The experimentation performed depict that the algorithm generates favorable results. REFERENCES [1] Chang, C.-C., Hwang, M.-S.and Chen, T.-S., 2001. A New Encryption Algorithm for Image Cryptosystems. Journal of Systems and Software, Vol. 58, No. 2, pp. 83-91. [2] Yano, K. and Tanaka, K., 2002. Image Encryption Scheme Based on a Truncated Baker Transformation. IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences, Vol. E85-A, No. 9, pp. 2025-2035. [3] Gao, T. and Chen, Z., 2008. Image Encryption Based on a New Total Shuffling Algorithm.Chaos, Solitons and Fractals, Vol. 38, No. 1, pp. 213-220.
  • 15.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 13 [4] Chen, G., Mao, Y. and Chui, C.K., 2004. A Symmetric Image Encryption Based on 3D Chaotic Cat Maps. Chaos, Solitons and Fractals, Vol. 21, pp. 749-761. [5] Mao, Y., Chen, G. and Lian, S., 2004. A Novel Fast Image Encryption Scheme Based on 3D Chaotic Baker Maps. International Journal of Bifurcation and Chaos, Vol. 14, No. 10, pp. 3613-3624. [6] Guan, Z.-H., Huang, F. and Guan, W., 2005. Chaos Based Image Encryption Algorithm. Physics Letters A, Vol. 346, pp. 153-157. [7] Zhang, L., Liao, X. and Wang, X., 2005. An Image Encryption Approach Based on Chaotic Maps. Chaos, Solitons and Fractals, Vol. 24, pp. 759-765. [8] Gao, H., Zhang, Y., Liag, S. and Li, D., 2006. A New Chaotic Algorithm for Image Encryption. Chaos, Solitons and Fractals, Vol. 29, pp. 393-399. [9] Pareek, N.K., Patidar, V. and Sud, K.K., 2006. Image Encryption Using Chaotic Logistic Map. Image and Vision Computing, Vol. 24, pp. 926-934. [10]Wong, K.-W., Kwok, B.S.-H.and Law, W.-S., 2008. A Fast Image Encryption Scheme Based on Chaotic Standard Map. Physics Letters A, Vol. 372, pp. 2645-2652. [11]Amin, M., Faragallah, O.S. and Abd El-Latif, A.A., 2010. A Chaotic Block Cipher Algorithm for Image Cryptosystems. Communications in Nonlinear Science and Numerical Simulation, Vol. 15, pp. 3484-3497. [12]Patidar, V., Pareek, N.K. and Sud, K.K.,2009. A New Substitution-Diffusion Based Image Cipher Using Chaotic Standard and Logistic Maps. Communications in Nonlinear Science and Numerical Simulation, Vol. 14, pp. 3056-3075. [13]Rossler, O.E., 1976. An Equation for Continuous Chaos. Physics Letters A, Vol. 57, No. 5, pp. 397-398. [14]Kamat, V.G. and Sharma, M., 2014. Enhanced Chaotic Block Cipher Algorithm for Image Cryptosystems. International Journal of Computer Science Engineering, Vol. 3, No. 2, pp. 117-124. This paper may be cited as: Kamat V. G. and Sharma M., 2014. Symmetric Image Encryption Algorithm Using 3D Rossler System. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 1-13.
  • 16.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 14 Node Monitoring with Fellowship Model against Black Hole Attacks in MANET Rutuja Shah, M.Tech (I.T.-Networking) School of Information Technology & Engineering, VIT University Lakshmi Rani, M.Tech (I.T.-Networking) School of Information Technology & Engineering, VIT University S. Sumathy, AP [SG] School of Information Technology & Engineering, VIT University Abstract Security issues have been considerably increased in mobile ad-hoc networks. Due to absence of any centralized controller, the detection of problems and recovery from such issues is difficult. The packet drop attacks are one of those attacks which degrade the network performance. In this paper, we propose an effective node monitoring mechanism with fellowship model against packet drop attacks by setting up an observance zone where suspected nodes are observed for their performance and behavior. Threshold limits are set to monitor the equivalence ratio of number of packets received at the node and transmitted by node inside mobile ad hoc networks. This fellowship model enforces a binding on the nodes to deliver essential services in order to receive services from neighboring nodes thus improving the overall network performance. Keywords: Black-hole attack, equivalence ratio, fair-chance scheme, observance zone, fellowship model. 1. INTRODUCTION Mobile ad-hoc networks are infrastructure less and self organized or configured network of mobile devices connected with radio signals. There is no centralized controller for the networking activities like monitoring, modifications and updating of the nodes inside the network as shown in figure 1. Each node is independent to move in any direction and hence have the freedom to change the links to other nodes frequently. There have been serious security threats in MANET in recent years. These usually lead to performance degradation, less throughput, congestion, delayed response time, buffer overflow etc. Among them is a famous attack on packets known as black-hole attack which is also a part of DoS(Denial of service) attacks. In this, a router relays packets to different nodes but due to presence of malicious nodes
  • 17.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 15 these packets are susceptible to packet drop attacks. Due to this, there is hindrance is secure and reliable communication inside network. Figure 1. MANET Scenario Section 2 addresses the seriousness of packet drop attacks and related work done so far in this area. Section 3 elaborates our proposal and defending scheme for packet drop attacks. Section 4 provides concluding remarks. 2. LITERATURE SURVEY The packet drop loss in ad-hoc network gained importance because of self-serving nodes which fail to provide the basic facility of forwarding the packets to neighboring nodes. This causes an occupational hazard in the functionality of network. Generally there are two types of nodes- selfish and malicious nodes. Selfish nodes are those nodes which act in the context of enhancing its performance while malicious nodes are those which mortifies the functions of network through its continual activity. The WATCHERS [1] from UC Davis was presented to detect and remove routers that maliciously drop or misroute packets. A WATCHER was based on the “principle of packet flow conservation”. But it could not differentiate much between malicious and genuine nodes. Although it was robust against byzantine faults, it could not be much effective in today’s internet world to reduce packet loss. The basic mechanism of packet drop loss is that the nodes do not progress the packets to other nodes selfishly or maliciously. Packet Drop loss could occur due to Black hole attack. Sometimes the routers behave maliciously i.e. the routers do not forwards packets, such kinds of attacks are known as “Grey Hole Attack”. In case of routers, the attacks can be traced quickly while in the case of nodes it’s a cumbersome task. Many researchers have worked in this field and have tried to find
  • 18.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 16 solutions to this attack [2-6]. Energy level was one of the parameter on which the researchers have shown their results. This idea works on the basis of the ratio of fraction of energy committed for a node, to overall energy contributed towards the network. The node is retained inside the network on the basis of energy level and the energy level is decided by the activeness of node in a network through mathematical computations. Mathematical computations are [7] too complicated to clench and sometimes the results are catastrophic. It can be said that the computations are accurate but they are very much prone to ambiguity in the case of ad-hoc networks. Few techniques involve usage of routing table information which is modified after detecting the MAC address of malicious node which uses jamming style DoS attack to cease their activities [8]. Another approach to reduce attacks was using historical evidence trust management based strategy. [9] Direct trust value (DTV) was used amongst neighboring nodes to monitor the behavior of nodes depending on their past against black hole attacks. However, there is high possibility that trust values may get compromised by the malicious nodes. Also the third party used for setting the trust values is also vulnerable to attacks. Recent methods included an [10] introduction to a new protocol called RAEED (Robust formally Analyzed protocol for wirEless sEnsor networks Deployment) which reduces this attack but not by a considerable percentage. To overcome the issues faced in order to implement these strategies there is a need of an effective mechanism to curb these attacks and make network more secure. 3. PROPOSED APPROACH In this paper, we put forth a mechanism to reduce these packet-drop attacks by implementing “node monitoring with fellowship” technique. We introduce an obligation on the nodes inside a particular network to render services to network. If services are not rendered, the node will be expelled outside the performance. However, we have kept a “fair-chance” scheme for all nodes which help to make out whether it is genuine node or malicious node. 3.1 Fellowship of Network The prime parameter we used in this to address packet drop attacks issue is by maintaining the count of incoming packets, except the destined one on that node and the count of outgoing nodes except the ones which are originated at that node, should be same, referred to as “equivalence ratio”. If that count is same, there is uniform distribution and forwarding of packets among the nodes inside network. However, if the count is not same, then that particular node is kept under “observance zone” in order to monitor its suspicious behavior. We suggest a periodical reporting of all nodes about their equivalence ratio to neighboring nodes inside the network. This will help to decide whether to keep a particular node in “observance zone” which could be done with polling techniques amongst each other. Inside, observance
  • 19.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 17 zone, the suspected node is given “fair-chance” treatment. That is, during observance zone period, the suspected node is required to submit its “status- message” to neighboring nodes to prove its genuineness of performance inside network. The genuine nodes will promptly provide their status-message to neighboring nodes because they will be willing to stay inside the network to render services under obligation for the network. However, the malicious nodes may or may not reply their status-messages to neighboring nodes since they have to degrade performance of network. But, for such status-messages only fair-chance is given. That is, a standard threshold level is been set up unanimously amongst neighboring nodes inside network. Status-messages will be entertained only up to threshold level. So, even if malicious nodes produce and fake their own status-messages to neighboring nodes in order to sustain inside network due to threshold limits it will not degrade network performance much. When threshold is crossed, the neighboring nodes will be intimated about the node which is under observance zone and a unanimous decision will be taken to expel that suspected node out of the network. Because of this scheme, there is possibility that the suspected node is expelled outside the network under 2 circumstances: either its genuine node (which are underperforming) or malicious nodes. In both cases, the suspected node needs to be expelled out of network because it is leading to performance degradation of the network. The “fair-chance” scheme ensures that genuine nodes are given fair chance to justify themselves and repair itself soon to prove their genuineness to render services to network under obligation. 3.2 Scenario Assumptions Let the nodes inside MANET be connected through wireless links with each other. Let number of packets be transmitted and received with each other by the nodes. Let nodes be named alphabetically from A,B,C…and so on till Z. Let node X be malicious node which drops packets and undergoes black hole attack and hence has poor equivalence ratio while node Y be the genuine node but has poor equivalence ratio due to network congestion or may be due to some other network issues. All nodes inside the network follow the principle of “node monitoring with fellowship”. Data structures used are the networking parameters which are as follows: 1)equi_ratio = denoting the equivalence_ratio of nodes 2)observance_zone= denoting list of suspected nodes inside observance zone. 3)threshold_value= denoting threshold value decided by the nodes inside MANET. 4)status_message= denoting the status messages exchanged amongst neighboring nodes.
  • 20.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 18 Steps involved: Step 1: All nodes calculate their own equivalence ratio(equi_ratio) and share it with their neighboring nodes(let them be at one hop distance) periodically. Step 2: All nodes unanimously agree upon a standard threshold level (in this case, threshold_value=3) through exchange of messages using agreement protocols. Step 3: All nodes will monitor their neighbor’s equi_ratio and if any node has equi_ratio which is quite poor then that particular node will be kept under “observance zone” list through mutual exchange of messages of nodes inside network. These nodes may be suspected as malicious nodes or genuine nodes but with poor performance. Step 4: Once the suspected node is kept in “observance zone” list, it is made mandatory for that node to report the “status_message” to the neighboring nodes to justify their performance and behavior. Step 5: If it’s a malicious node (node X) it may either fake its status_message to show its genuineness and stay inside network or it may just avoid sending its status_message since it wishes to continue its malicious activities in future too. If it is genuine node (node Y) it will send status_message in order to prove its genuineness and try to improve its performance by repairing itself with the network issues it is facing while sending the packets. However, in both the cases , we have limited the frequency of justification through status_message by the nodes using fair chance scheme wherein nodes are allowed to justify themselves only till certain threshold_value(here, value=3 .i.e. only 3 times the suspected nodes are allowed to send status_message in order to justify their performance). In short, malicious nodes and genuine nodes which are underperforming both are kept under surveillance to observe their behavior. Step 6: Thus, the nodes which cross the limits of threshold_value, will be immediately expelled outside the network through exchange of protocols and messages between the neighboring nodes. In this way, packet-drop attacks can be considerably reduced. Figure 2 explains the workflow mechanism.
  • 21.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 19 Figure 2. Flowchart of proposed mechanism Set the threshold_value unanimously and exchange equi_ratio with neighboring nodes periodically Check whether equi_ratio is acceptable or unacceptable Place the suspected node under observance_zone Check if below threshold_value Suspected node is expelled outside network Continue normal network activities Exchange of status_message Acceptable Unacceptable Above threshold_value Less than or equal to threshold_value
  • 22.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 20 3.3 Advantages: 1. Fair chance scheme ensures genuineness of innocent nodes. 2. No complex mathematical computations of energy levels at each node. 3. Periodical reporting ensures removal of both underperforming and malicious nodes from the network. 4. Up gradation of network performance in MANET. 3.4 Disadvantages However, there is an overhead of exchanging more number of messages among the neighboring nodes. Optimization on number of messages exchanged during communication can be addressed and worked upon in future research. 4. CONCLUSION In this paper, we have proposed a novel scheme to reduce packet drop attacks and enhance the network performance. However, we anticipate our “node-monitoring with fellowship” model may lead to increase in number of exchanged messages amongst neighboring nodes during the agreement protocols inside network but at the same time it will be robust against attacks and thus increase the availability of nodes in mobile ad-hoc networks. The outcomes of minimizing packet drop loss have better utility of channel, resources and QoS guaranteed which results in productive priority management and a considerable controlled traffic by periodic surveillance over nodes. The future research on this would be to reduce the exchange of messages amongst the nodes, minimize the overhead and achieve optimization inside mobile ad-hoc networks. 5. REFERENCES [1] K. A. Bradley, S. Cheung, N. Puketza, B. Mukherjee and R. A. Olsson, Detecting Disruptive Routers: A Distributed Network Monitoring Approach, in the 1998 IEEE Symposium on Security and Privacy, May 1998. [2] Y.C. Hu, A. Perrig and D. B. Johnson, Ariadne: A Secure On-demand Routing Protocol for Ad Hoc Networks, presented at International Conference on Mobile Computing and Networking, Atlanta, Georgia, USA, pp. 12 - 23, 2002. [3] P. Papadimitratos and Z. J. Haas, Secure Routing for Mobile Ad hoc Networks, presented at SCS Communication Networks and Distributed Systems Modeling and Simulation Conference, San Antonio, TX, January2002. [4] K. Sanzgiri, B. Dahill, B. N. Levine, C. Shields and E. M. Belding-Royer, A Secure Routing Protocol for Ad Hoc Networks, presented at 10th IEEE International Conference on Network Protocols (ICNP'02), Paris, pp. 78 - 89, 2002. [5] V. Balakrishnan and V. Varadharajan, Designing Secure Wireless Mobile Ad hoc Networks, presented at Proceedings of the 19th IEEE International Conference on advanced information Networking and Applications (AINA 2005). Taiwan, pp. 5-8, March 2005.
  • 23.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 21 [6] V. Balakrishnan and V. Varadharajan, Packet Drop Attack: A Serious Threat to Operational Mobile Ad hoc Networks, presented at Proceedings of the International Conference on Networks and Communication Systems (NCS 2005), Krabi, pp. 89-95, April 2005. [7] Venkatesan Balakrishnan and Vijay Varadharajan Short Paper: Fellowship in Mobile Ad hoc Networks presented at Proceedings of the First International Conference on Security and Privacy for Emerging Areas in Communications Networks (SECURECOMM’05) IEEE. [8] Raza, M., and Hyder, S.I. A forced routing information modification model for preventing black hole attacks in wireless Ad Hoc network presented at Applied Sciences and Technology (IBCAST), 2012, 9th International Bhurban Conference, Islamabad, pp. 418-422, January 2012. [9] Bo Yang , Yamamoto, R., Tanaka, Y. Historical evidence based trust management strategy against black hole attacks in MANET published in 14th International Advanced Communication Technology(ICACT), 2012 on pp. 394 – 399. [10] Saghar, K., Kendall, D.and Bouridane, A. Application of formal modeling to detect black hole attacks in wireless sensor network routing protocols .Applied Sciences and Technology (IBCAST), 2014, 11th International Bhurban Conference, Islamabad, pp. 191-194, January 2014. This paper may be cited as: Shah, R., Rani, L. and Sumathy, S. 2014. Node Monitoring with Fellowship Model against Black Hole Attacks in MANET. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 14-21.
  • 24.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 22 Load Balancing using Peers in an E-Learning Environment Maria Dominic Department of Computer Science, Sacred Heart College, India Sagayaraj Francis Department of Computer Science and Engineering, Pondicherry Engineering College, India ABSTRACT When an e-Learning System is installed on a server, numerous learners make use of it and they download various learning objects from the server. Most of the time the request is for same learning object and downloaded from the server which results in server performing the same repetitive task of locating the file and sending it across to the requestor or the client. This results in wasting the precious CPU usage of the server for the same task which has been performed already. This paper provides a novel structure and an algorithm which stores the details of the various clients who have already downloaded the learning objects in a dynamic hash table and look up that table when a new request comes in and sends the learning object from that client to the requestor thus saving the precious CPU time of the server by harnessing the computing power of the clients. Keywords Learning Objects, e-Learning, Load Distribution, Load Balancing, Data Structure, Peer – Peer Distribution. 1. INTRODUCTION 1.1 e-Learning Education is defined as the conscious attempt to promote learning in others to acquire knowledge, skills and character [1]. To achieve this mission different pedagogies were used and later on with the advent of new information communication technology tools and popularity gained by internet were used to enhance the teaching learning process and gave way to the birth of e-learning [2]. This enabled the learner to learn by breaking the time, geographical barriers and it allowed them to have individualized learning paths [3]. The perception on e-Learning or electronic learning is that it is a combination of internet, electronic form and network to disseminate knowledge. The key factors of e-learning are reusing, sharing resources and interoperability [4]. At present there are various organizations
  • 25.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 23 providing e-learning tools of multiple functionalities and one such is MOODLE (Modular Object Oriented Dynamic Learning Environment) [5] which is used in our campus. This in turn created difficulty in sharing the learning objects between heterogeneous sites and standards such as SCORM & SCORM LOM [6], IMS & IMS DRI [7], AICC [8] and likewise were proposed by different organizations. In Berner-Lee’s famous architecture for Semantic Web, ontology’s are used for sharing and interoperability which can be used to build better e-learning systems [9]. In order to define components for e-learning systems the methodology used is the principle of composibility in Service Oriented Architecture [10] since it enables us to define the inter-relations between the different e-learning components. The most popular model used nowadays in teaching learning process is Felder- Silverman learning style model [11]. The e-Learning components are based on key topics, topic types and associations and occurrences. VLE – Virtual Learning Environment is the software which handles all the activities of learning. Learning Objects are the learning materials which promotes a conscious attempt to promote visual, verbal, logical and musical intelligence [12] through presentations, tutorials, problem solving and projects. By the multimedia, gaming and simulation kin aesthetic intelligence are promoted. Interpersonal, intrapersonal and naturalistic intelligence are promoted by means of chat, SMS, e-mail, forum, video, audio conference, survey, voting and search. Finally assessment is used to test the knowledge acquired by the learner and the repository is the place which will hold all the learning materials. This algorithm is useful when the learners access the learning objects which are stored in the repository. It reduces the server’s response rate by the directing a client to respond to the requestor with the file it has already downloaded from the server. 1.2 Load Balancing The emergence of large and faster networks with thousands of computers connected to it provided a challenge to provide effective sharing of resource around the computers in the network. Load balancing is a critical issue in peer to peer network [14]. The existing load balancing algorithms for heterogeneous, P2P networks are organized in a hierarchical fashion. Since P2P have gained popularity it became mandatory to manage huge volume of data to make sure that the response time is acceptable to the users. Due to the requirement for the data from multiple clients at the same instance may cause some of the peers to become bottleneck, and thereby creating severe load imbalance and the response time to the user. So to reduce the bottlenecks and the overhead of the server there was a need to harness the computing power of the peers [15]. Much work has been done on harnessing
  • 26.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 24 the computing power of the computer in the network in high performance computing and scientific application, faster access to data and reducing the computing time is still to be explored. In a P2P network the data is de- clustered across the peers in the network. When there is requirement for a popular data from across the peer then there occurs a bottleneck and degrading the system response. So to handle this, a new strategy using a new structure and an algorithm are proposed in this paper. 2. PROPOSED DATA STRUCTURE AND THE ALGORITHM The objective of this architecture is to harness the computational power of the clients in the network. This architecture is with respect to the clients available in the e-learning network. The network comprises of Master degree students of Computer Applications accessing learning materials for their course. The degree programme is a three year programme. So the clients are categorized to three different clusters namely I MCA, II MCA, III MCA. We shall name it as class cluster. Every class cluster contains many clusters inside it, let us name it as file clusters, one cluster for one type of file since the learning objects can be made up of presentation, video, audio, picture, animation etc [13]. An address table, named file address table holds the address of each file cluster in the class cluster. When a request for a file is received the corresponding cluster is identified by reading the address from the address table. The following algorithm represents the working logic of the concept. The data structure is represented in Figure 1. Every file cluster holds a Dynamic Hash Table (DHT), Linked List and a Binary Tree. The dynamic hash table holds the address of the linked list which holds the file names that are already downloaded from the server. The hashing function used to identify an index in the DHT is as follows, 1. Represent every character in the filename with its position in the alphabet list and its position in the filename. Eg: File Name - abc.ppt = 112233, the value for a is got as 11 since the position of it in the alphabet list is 1 and its position in the file name is 1. 2. Sum all the digits calculated from step 1. Eg: 112233 = 12 3. Divide the sum by length of the file name, so 12/3 = 4, which becomes the index for the file in DHT. The above three steps are mathematical formulated in equation 1.
  • 27.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 25 Address of PPT File Cluster Address of Audio File Cluster Address of Video File Cluster Cluster of Presentational Files Cluster of Audio Files Cluster of Video Files abc.ppt bac.ppt NUL L * 0 1 2 Dynamic Hashing Table Linked List Binary Tree 3 ** * * * A b c . p p t b a c . p p t N U L L * 0 1 2 Dynamic Hashing Table Linke d List Binar y Tree 3 ** * * * Address Table  - IP, CPU usage Time Figure 1. Proposed Data Structure
  • 28.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 26 Every index of DHT holds the starting address of a linked list in which every node stores the file names that are already downloaded. The linked list structure is used to avoid index collision between filenames generating the same index in DHT. The index collision is avoided by creating a new node in the linked list for the new file name. As shown in Figure 1, every node in the linked list holds three values namely file name, address of the binary tree and the last part holds the address of the next node in the list. The nodes of the binary tree holds the active clients IP’s and its current CPU processing status. The binary tree is used to identify the leased CPU used client to transfer the file to the requestor. This activity will harness the computing power of the least used CPU. The binary tree structure was used to reduce the search time for the least used client. If the file has not been downloaded by any clients i.e. when the last node of the linked list is reached, then the file is transferred from the server. Algorithm 1 SEND ( ) { Request directed to the File Cluster Address of the File Cluster taken from the Address Table Index Location of the File = HASHED (File Name) If the index is out of bound { The file has not been downloaded by any client It is sent from the server to the client } Else { While (not end of Linked List OR the node is not found) { If (Node. data == File Name) { Node found = true While (end of binary tree} { Least usage time CPU IP = LEASTUSEDCPU() } } If (Node found == true) Send the requested file from the IP to the Requestor Else The file has not been downloaded by any client It is sent from the server to the client
  • 29.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 27 } End of the SEND ( ) Algorithm 2 int LEASTUSEDCPU ( ) { While (not end of Binary Tree) { Compare the CPU Usage Time of the first node with the CPU Usage Time of every other Node If CPU Usage Time of the new node is Lesser Leastusedcpu = IP } return ( Leastusedcpu) End of LEASTUSEDCPU ( ) Algorithm 3 int HASHED (String Filename) { Len = StringLength(Filename) While (End of String) { IndexString += (Position of the Character in the Filename + Position of the Character in the alphabet list) } IndexInt = ConverttoInteger(IndexString) return (IndexInt = IndexInt/Length of the Array) End of HASHED ( ) 3. MATHEMATICAL FORMULATION Mathematical formulation of the above dealt problem is as follows, l index = ( Σ ( a (i+1)+j ) / l ) (1) i=0 n Σ index ( Z K (f3)) = X(f2) goto 4, 5 (2) K=1
  • 30.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 28 n Σ index ( Z K (f3)) ≠ X(f2) goto 6 (3) K=1 n Min Σ Y = C (m, m+1) (4) m = 1 f2 = X -1 Y (f1) (5) f2 = X -1 S (f1) (6) where, index is the index in the Dynamic Hash Table l is the length of the file name i is the character position in the file name j = { 1,2,3,4,5….26 } k is the number of nodes in the linked list Z is the node in the linked list f3 is the file name in the node in the linked list f2 is the targeted file m is the nodes in the Binary Tree Y is the node in the Binary Tree with the minimum C C is the CPU usage time of the specified IP S is the Server 4. CONCLUSION The main advantage from this architecture is that the server time is saved by harnessing the computational power of the clients who have already downloaded the file to send across it to the requestor. Another advantage of the architecture is the file search, which has been fastened due to the Dynamic Hashing table and Binary tree structures. This algorithm is been currently implemented using PHP and the results of it will be published in the further publications. Initial results indicate that there is substantial reduction of the server’s CPU processing time when this algorithm is executed on the server.
  • 31.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 29 REFERENCES [1] Lavanya Rajendran, Ramachandran Veilumuthu., 2011. A Cost Effective Cloud Service for E-Learning Video on Demand, European Journal of Scientific Research, pp.569-579. [2] Maria Dominic, Sagyaraj Francis, Philomenraj., 2013. A Study On Users On Moodle Through Sarasin Model, International Journal Of Computer Engineering And Technology, Volume 4, Issue 1, pp 71-79. [3] Maria Dominic, Sagyaraj Francis., 2013. Assessment Of Popular E-Learning Systems Via Felder-Silverman Model And A Comphrehensive E-Learning System, International Journal Of Modern Education And Computer Science, Hong Kong, Volume 5, Issue 11, pp 1-10. [4] Zhang Guoli, Liu Wanjun, 2010. The Applied Research of Cloud Computing. Platform Architecture in the E-Learning Area, IEEE. [5] www.moodle.org [6] SCORM(Sharable Courseware Object Reference Model), http://www.adlnet.org [7] IMS Global Learning Consortium, Inc., “Instructional Management System (IMS)”, http://www.imsglobal.org. [8] http://www.aicc.org [9] Uschold, Gruninger., 1996. Ontologies, Principles , Methods and Applications, Knowledge Engineering Review, Volume 11, Issue 2. [10]Papazoglou, Heuvel., 2007. Service Oriented Architectures: Approaches, Technologies, and research issues, The VLDB Journal, , Volume 16, Issue 3, pp. 389- 415. [11]Graf, Viola, Kinshuk., 2006. Representative Characterestics of Felder-Silverman Learning Styles: an Empirical Model, IADIS, pp. 235-242. [12]Lorna Uden, Ernesto Damiani., 2007. The Future of E-Learning: E-Learning ecosystem, Proceeding of IEEE Conference on Digital ecosystems and Techniques, Australia, pp. 113-117. [13]Maria Dominic, Sagyaraj Francis., 2012. Mapping E-Learning System To Cloud Computing, International Journal Of Engineering Research And Technology, India, Volume1, Issue 6. [14]Chyouhwa Chen, Kun-Cheng Tsai., 2008. The Server Reassignment Problem forload Balancing In Structured P2P Systems, IEEE Transactions On Parallel And Distributed Systems, Volume 19, Issue 2. [15]A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica., 2006. Load Balancing in Structured P2P Systems. Proc. Second Int’l Workshop Peer-to-Peer Systems (IPTPS ’03). This paper may be cited as: Dominic, M. and Francis, S. 2014. Load Balancing using Peers in an E- Learning Environment. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1., pp. 22 -29.
  • 32.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 30 E-Transparency and Information Sharing in the Public Sector Edison Lubua (PhD) Mzumbe University, P.O. Box 20266, Dar Es Salaam, 255, Tanzania ABSTRACT This paper determines the degree of information sharing in government institutions through e- transparent tools. First the basis for the study is set through the background, problem statement and objectives. The discussion then proceeds by focusing on ICT tools for information sharing. An information sharing model is proposed and the extent of information sharing in the public sector of Tanzania through online media is discussed; furthermore, the correlation that exists between the extent of information sharing and factors such as accessibility, understandability, usability and reliability is established. The paper concludes by providing recommendations on information sharing and how it can be enhanced through e-transparency systems for public service delivery in an open society. Keywords E-transparency, E-Governance, Information Sharing, Public Sector, ICT. 1. BACKGROUND OF THE STUDY Generally, information services are an important pillar for any democratic government. Citizens rely on information for making decisions which impact upon their social, political and economic lives. In this regard, there are laws which govern the right to access, and disseminate information locally and internationally (Hakielimu, LHRC, REPOA, 2005). Locally, government authority reflects international agreements through different legislations including the National Constitution (United Republic of Tanzania, 1995).The constitution in Tanzania entitles every citizen the right of access to information and empowers citizens with the right to disseminate information. In his study Onyach-Olaa (2003) commended government authorities, which make an effort to enhance information sharing with citizens. The government has to improve interaction with those it governs while addressing information sharing as its core function. Furthermore, information sharing and transparency in government operations must become the culture of any democratic republic, including Tanzania (Mkapa, 2003). Transparency in government operations, improve the confidence of citizens toward their government, while reminding government leaders that their decisions and associated impact are transparent to citizens (Navarra, 2006). Traditionally, information services have been either provided or received through physical means; mostly, people use oral/listening and writing/reading methods to issue and receive information. In many cases, the traditional method of information sharing is characterised by delays, high cost,
  • 33.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 31 low transparency and bureaucracy (Im & Jung, 2001); as the result this method allows for subversion of accountability (Lubua, 2014). Arguably, communication developments brought by the use of Information and Communication Technology tools provide a better platform for information sharing. Instant communication is enhanced through tools such as emails, online telephoning, video conferencing, chat rooms and social websites. As the result of these tools challenges that relate to delays, high communication costs and bureaucratic procedures are addressed. Apart from the platform provided by online media in enhancing communication, it is equally important to understand that the efficiency of information sharing is directly related to the size of the network connecting individuals, groups of people and organisation (Hatala & Lutta, 2009). The higher the intensity of networks the more the information received; the organisation enjoys these benefits if it form a strategic alliance with partners which allow free flow of information to both ends. This is the reason why the e-governance agency was instituted in Tanzania. The appropriate use of e-transparency tools is perhaps the best strategy for the organisation to enhance information sharing with their stakeholders. The organisation has to emphasize good qualities of information sharing such as timely response, accessibility of systems, reliability of data, online security, completeness of online procedures and openness in service processes. Basically, this paper discusses different issues, including the need of online information sharing in the public sector and the extent to which government institution applies online media for information sharing and service provision. The study is based on opinions from clients who are consumers of such services. 2. PROBLEM STATEMENT Business competition compels organisations to invest in information systems to improve the efficiency of their operations (Barua, Ravindran, & Whinston, 2007). This investment is made possible through the knowledge of employees, suppliers, customers, and other key stakeholders. In this regard the organization that shares its information with stakeholders more efficiently earns a competitive advantage (Drake, Steckler, & Koch, 2004). Information sharing is an important resource which should be embraced in order to enhance the performance of an organisation (Hatala & Lutta, 2009). Depending on the type of organisation, the extent of information sharing is partly influenced by organisational policies and practices. The management team, employees and partners have to work together to foster organisational information sharing, which guarantees the future existence of the organisation (Drake, Steckler, & Koch, 2004).
  • 34.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 32 The government of Tanzania acknowledges the importance of ICTs in promoting information sharing in the society. It uses methods such as conferences, workshops, public portals, and so on to show its intention for maximizing information sharing. With the growth of the number of users of ICTs, the degree of information sharing is expected to increase. Therefore, the study intends to establish the extent to which uses of ICTs have enhanced information sharing. Further, the correlation between the extent of information sharing and factors which negatively influence the perception of users will also be established by the study. 3. OBJECTIVES This study is designed to cover the following objectives; i. To determine the extent of information sharing through e-transparency in the Tanzanian public sector? ii. To establish the extent to which information usefulness, understandability, reliability and accessibility influences information sharing through e-transparency systems. 4. METHODOLOGY This study was conducted through a mixed research method. First, the study reviewed a number of literatures to establish its relevance. Then, the Tanzanian Revenue Authority’s Custom Online System was identified as a case for study, followed by survey procedures. Data were collected from twenty (20) clearing and forwarding companies that operate under Custom regulations of the Tanzania Revenue Authority. The study received and analysed a total of 40 responses. The study collected data from original sources to enhance validity and relevance. The analytical models used include the Spearman’s Correlation Model and Regression Analysis. 5. ICT TOOLS AND GOVERNMENT INFORMATION SHARING Transparency is one of the pillars of good governance that promotes openness in conditions and activities; eventually, transparency ensures that the stakeholders have the information necessary for them to make decisions necessary for the progress of business and their lives. In this case, information forms the cornerstone of transparency, more especially in civic organisations. In the management of civic institutions, information dissemination provides guidance and education to stakeholders in different matters that influence their lives; these issues include political, socio-economic and cultural. This availability of information is clearly influenced by the media used in the capturing, storage and dissemination process. While electronic media are effective in raising the level of transparency in the society; the government should take advantage of these tools in building its relationship with citizens through sharing information, and hence engage them is supporting planned public development goals (Abu- Dhabi-Government, 2011; Lubua & Maharaj, 2012).
  • 35.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 33 In the Republic of Tanzania, usage of ICT tools for communication and information sharing increases in a daily bases; users of the internet increased by 450% between 2001 and 2010. Additionally, about 50% of the population of Tanzania are reported to use either internet or a mobile phones (Kasumuni, 2012). With this increase, understanding the extent to which information from government institutions is shared enables the government to know how effective the media are utilized to promote national developments. 6. AN INFORMATION SHARING MODEL This paper summarises information sharing using a model presented in figure 1. The abundance and availability of information means that the user needs skill to determine what it is that they want. In this case, the user of information has the key role to play in effecting information sharing. The user must be able to use relevant tools in searching for information and be able to determine the relevance of accessed data to his/her operations. The ability to use such tools is attained through learning. Having the knowledge to use the tools for searching the information, the user must be aware of the problem that they need to solve. Figure 1: Information Sharing Model Source: Research Data (2012) The choice of the information is dictated by the gap which has to be covered. When this gap is expressed, it becomes a need. Upon responding to the need, the user of information consults the source which is either electronic or physical. It is possible that the source may not have the type of information requested or the information may not be satisfying. Regardless of the status of satisfaction, the user of information takes action towards covering the gap. In case the public seeks information from government institutions, dissatisfaction may influence Information User Information Needs Information Source No Information in the Source Satisfaction/ Dissatisfaction Action
  • 36.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 34 members of the public to take action, even against the government, on the other hand satisfaction influences more support from the government (Lubua, 2014). The satisfied user of the information applies it to solve the problem identified in the gap. A good example could be of a farmer, who was searching for a good market of his/her harvests; s/he will eventually use the information to choose a better market. Similarly, the recent Arab uprising represents a possible negative response by users of the information in case of low satisfaction (Maharaj & Vannikerk, 2011). The government should therefore respond adequately to inquiries from citizens to alleviate the possibility of negative response from citizens. It must ensure the adequate availability of information that address citizens’ daily challenges. 7. INFORMATION SHARING USING E-TRANSPARENCY TOOLS IN PUBLIC INSTITUTIONS The introduction of ICT tools brings more opportunities for information sharing in the organisation through allowing users to receive and send information more easily (Kilama, 2013). In other cases, stakeholders are able to discuss issues of different interests through tools such as social networks, chat rooms, e-mail systems and video/teleconferencing. In other places, the organisation is able to solicit stakeholders’ opinions before making different decisions (Im & Jung, 2001; Lubua & Maharaj, 2014). Together with the progress made in information sharing, there is the need to know the extent which government institutions apply online media for information sharing. This study is based on opinions from clients who are consumers of online information from a government institution. Based on the response from clients of Tanzania Revenue Authority, it was found that, 70% of respondents agree that the Tanzanian revenue authority, sufficiently shares its information through online media. These respondents are clients of Custom services who benefit from the Custom Online System (CULAS). The following factors influenced a successful deployment of this system:- a.) Good ICT infrastructure The ICT infrastructure of Tanzania Revenue Authority is well established; it is characterised by good interface, reliable data backup systems, power backups and reliable internet connection. In addition, the revenue authority is among the organizations benefiting from the massive flow of internet through the National ICT Backbone (NICTBB). Nevertheless, the study observed that not all respondents had access to the infrastructure of the revenue authority. Some lacked computers to access such systems; the presence of the computer room for clients would be an important ingredient to extension of services offered by the revenue authority in its custom section. This will equally, facilitate users who are not based in Dar Es Salaam, but visit for custom services.
  • 37.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 35 b.) Technical Skills and Competency The infrastructure of the information system requires competent staff to maintain and operate its functions (Badillo-Amador, García-Sánchez, & Vila, 2005; Cohen, 2012). In many cases, the revenue authority uses its staff to run its operations; in the case where advanced knowledge is required the institution uses partnerships with non-governmental organisations to offer technical services. To a large extent, the revenue authority use trainings to equip its employees. Nevertheless, the study noted that there were cases where training was not effective as expected. In fact the analysis that desired to know the degree of association between training and skills possessed by staff through the Pearson Correlation Model observed an insignificant association (r = 0.101, p =. 316), EXCEPT where a follow-up program was instituted (r= 0.292, p=0. 003). Therefore, it is necessary to incorporate follow-up programs after trainings for enhanced competency. c.) Institutional Will Installing a good ICT infrastructure has to be complemented by the willingness of members of staff to use the new system exclusively for service provision. The management of the Tanzania Revenue Authority, Custom Department dedicates its online system to be the only method for the issuance of services to clients. So far the experience of operational staff is reported to be outstanding. However, the lack of important equipment such as computers to some employees and occasional system breakdown affects the use. d.) Customer Satisfaction Changes are to be managed carefully in order to avoid frustrating clients. Together with implementing new changes for service provision, the Tanzania Revenue Authority Custom Department established a help desk that attends to queries from clients about different applications of the new system. Additionally, documentation is provided that addresses steps to be taken in using the system. In this study it was discovered that 95% of respondents recommend or strongly recommend the use of Tanzania Revenue Authority Custom Online systems for securing services from the institution; these results shows that the extent of users’ satisfaction with the online system is high. 8. INFORMATION USEFULNESS, UNDERSTANDABILITY, RELIABILITY AND ACCESSIBILITY AND THE EXTENT OF INFORMATION SHARING As shown in the previous section, respondents from Tanzania Revenue Authority have confidence on the extent that government institution shares information with stakeholders through online media. While this extent is influenced by a number of factors, this study is interested in the following: Information accessibility, Information Usefulness, Information Reliability and Information understandability. This part of the study identifies how information sharing is
  • 38.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 36 influenced by these factors and a linear regression model is used to demonstrate the relationship. The Linear regression analysis was used to establish the relationship between these variables as shown in Table 1 below. Table 1:Model Summary Regression Model R R Square 1 0.724a 0.524 a. Predictors: (Constant), Government Online Information is reliable, Government online information is useful, the use of Internet has enhanced access to information, and Government online information is easily understood According to data reported by clients of Tanzania Revenue Authority, the value for Coefficient of Relatedness (R) is 0.724a ; this value suggests the presence of correlation between the variables. At the Tanzania Revenue Authority, information usefulness, understandability, reliability and accessibility are important segments of the information provided to users. This is because the online system is the only means for users to access custom services. The appreciation of these variable influences the extent of information sharing among stakeholders. Below is a brief explanation on how these variables are enhanced at the Tanzania Revenue Authority. a.) Information Accessibility The Tanzania Revenue Authority’s Custom online System provides users1 with credentials which provide access to the system. Within the system, users are able to trace every stage of their application. Besides, to ensure that the system is constantly accessible to clients, the link to the online system is published on the website and supported by servers which are constantly running with the support of information and power backups. Although accessibility is better compared to other public institutions, users report that there were cases where they failed to launch their service applications due to extended system downtime. b.) Data Reliability The online system of the Tanzania Revenue Authority ensures reliability by dedicating few officials who are experts in custom services to manage queries and applications by clients to the system. Furthermore, employees of the revenue authority verifies the information sent by clients before they effect the transaction to ensure the reliability of information involved in transactions. This ensures that only the information which is both relevant and correct is provided to consumers through the online media. Moreover, to ensure that the information from users of an online system is reliable, the system provides guidance to users on different stages involved in an application for services. The system also dictates the format 1 Who are clearing and forwarding experts
  • 39.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 37 of the information to be entered to ensure consistency; further, it grants the user with the opportunity to proof read their data entry before the information is completely submitted. c.) Information Usefulness The Custom Online System is dedicated to the Customs department only, and is tailored to meet the needs of Clearing and Forwarding argents to simplify their tax paying processes. The authority receives feedback from clients on different aspects of the system, this includes its usefulness for intended use. Although many respondents agree that the information they receive is useful, the study noted that a number of users were not comfortable with the use of the English language for communication. Swahili is the Tanzania’s national language, its adequate use would improve the ability of users to understand information context, hence improve usefulness. d.) Information Understandability The issue of understanding the information provided through online systems is critical; the fact that users of Tanzania Revenue Authority are of diverse nature, suggests differences in analytical and language skills. While Tanzania uses Kiswahili as the National language, English is used for academic and business operations. Due to differences in education and analytical skills some clients of Tanzania Revenue Authority need to make some language consultation before they understand the content of information. Understanding the presence of this challenge, the revenue authority has a dedicated helpdesk to clarify issues which users find difficult to understand. 9. CONCLUSION The purpose of the study was to establish the degree to which the Tanzanian public sector uses ICTs to enhance transparency. The assessment was guided by the fact that Tanzania advocates good governance, of which information sharing is an important component. Also, the study recognises that ICTs play an important role in the business sector to ensure that the client access services efficiently with maximum transparency. The same business experience could be adopted by the government to raise the level of satisfaction of citizens about government services. The study observed that many people are aware of the importance of ICTs in ensuring transparency in government operations. However, there are several cases where the performance in government operations did not meet users’ expectations. Factors such as low reliability of the system and ineffectiveness of officials operating the system were among those which affected the use of ICTs for enhanced transparent services. While training was identified to be important is equipping users with the required technical skills; it was occasionally observed to be the opposite. Training required follow-ups to ensure that it meets expected goals. Equally to this, information accessibility, reliability, usefulness and user understanding ability have great impact on the experience of users towards online media.
  • 40.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 38 REFERENCES [1] Badillo-Amador, L., García-Sánchez, A., & Vila, L. E. (2005). Mismatches In The Spanish Labor Market: Education Vs. Competence Match. International Advances in Economic Research, Vol 11, 93-109. [2] Barua, A., Ravindran, S., & Whinston, A. (2007). Enabling Information Sharing Within Organizations. Information Technology and Management, Vol (3), 31 - 45 . [3] Cohen, J. (2012). Benefits Of On Job Training. Retrieved February 7, 2013, from http://jobs.lovetoknow.com [4] Drake, D., Steckler, N., & Koch, M. (2004). Information Sharing In And Across Government Agencies: The Role And Influence Of Scientist, Politician, And Bureaucratic Subcultures. Social Science Computer Research, 22(1), , 67–84. [5] HAKIELIMU, LHRC, REPOA. (2005). Access To Information In Tanzania: Is Still A Challenge. Retrieved September 11, 2012, from http://www.tanzaniagateway.org/docs/Tanzania_Information_Access_Challenge.pdf [6] Hatala, J.-P., & Lutta, J. (2009). Managing Information Sharing Within an Organisation Settings: A social Network Perspective. Retrieved September 13, 2012, from http://www.performancexpress.org/wp-content/uploads/2011/11/Managing-Information- Sharing.pdf [7] Im, B., & Jung, J. (2001). Using ICT For Strengthening Government Transparency. Retrieved May 10, 2011, from http://www.oecd.org/dataoecd/53/55/2537402.pdf [8] Kilama, J. (2013). Impacts Of Social Networks In Citizen Involvements To Politics . Dar es Salaam: Mzumbe University. [9] Mkapa, B. (2003). Improving Public Communication Of The Government Policies And Enhancing Media Relations. Bagamoyo. [10]Navarra, D. D. (2006). Governance Architecture Of Global ICT Programme: The Case Of Jordan. London: London School of Economics and Political Science. [11]United Republic of Tanzania. (1995). The Constitution of United Republic of Tanzania. Dar Es Salaam, Tanzania: Government Printer. [12]Van Niekerk, B., Pillay, K., & Maharaj, M. (2011). Analyzing the Role of ICTs in the Tunisian and Egyptian Unrest. International Journal of Communication, 5(1406–1416). This paper may be cited as: Lubua, E. 2014. E-Transparency and Information Sharing in the Public Sector. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 30 -38.
  • 41.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 39 A Survey of Frequent Subgraphs and Subtree Mining Methods Hamed Dinari and Hassan Naderi Department of Computer Engineering Iran University of Science and Technology Tehran, Iran ABSTRACT A graph is a basic data structure which, can be used to model complex structures and the relationships between them, such as XML documents, social networks, communication networks, chemical informatics, biology networks, and structure of web pages. Frequent subgraph pattern mining is one of the most important fields in graph mining. In light of many applications for it, there are extensive researches in this area, such as analysis and processing of XML documents, documents clustering and classification, images and video indexing, graph indexing for graph querying, routing in computer networks, web links analysis, drugs design, and carcinogenesis. Several frequent pattern mining algorithms have been proposed in recent years and every day a new one is introduced. The fact that these algorithms use various methods on different datasets, patterns mining types, graph and tree representations, it is not easy to study them in terms of features and performance. This paper presents a brief report of an intensive investigation of actual frequent subgraphs and subtrees mining algorithms. The algorithms were also categorized based on different features. Keywords Graph Mining, Subgraph, Frequent Pattern, Graph indexing. 1. INTRODUCTION Today we are faced with ever-increasing volumes of data. Most of these data naturally are of graph or tree structure. The process of extracting new and useful knowledge from graph data is known as graph mining [1] [2] Frequent subgraph patterns mining [3] is an important part of graph mining. It is defined as “process of pattern extraction from a database that the number frequency of which is greater than or equal to a threshold defined by the user.” Due to its wide utilization in various fields, including social network analysis [4] [5] [6], XML documents clustering and classification [7] [8], network intrusion [9] [10], VLSI reverse [11], behavioral modeling [12], semantic web [13], graph indexing [14] [15] [16] [17] [18], web logs analysis[19], links analysis[20], drug design [21] [22] [23], and Classification of chemical compounds[24] [25] [26], this field has been subject matter of several works.
  • 42.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 40 The present paper is an attempt to survey subtree and subgraph mining algorithms. A comparison and classification of these algorithms, according to their different features, is also made. The next section discusses the literature review followed by section three that deals with the basic ideas and concepts of graphs and trees. Mining algorithms, frequent subgraphs are discussed in section four from different viewpoint such as criteria of representing graphs (adjacency matrix and adjacency list), generation of subgraphs, number of replications, pattern growth-based and apriori-based classifications, classification based on search method, classification based on transactional and single inputs, classification based on type of output, and also Mining based on the logic. Fifth section focuses on frequent Mining algorithm from different angles such as trees representation method, type of algorithms input, tree-based Mining, and Mining based on Constraints on outputs. 2. RELATED WORKS H.J.Patel1, R.Prajapati,et al. [27] Classified graph mining and mentioned two types of the algorithms, apriori-based and pattern growth based. K.Lakshmi1,T.Meyyappan [28] studied apriori based and pattern growth based, taking into account aspects such as input/output type, how to display a graph, how to generate candidates, and how many times a candidates is repeated in the graph dataset. In [29] D.Kavitha, B.V.Manikyala, et al. suggested the third type of graph mining algorithms named as inductive logic programming. Here a complete survey of graph mining concepts and a very useful set of examples to ease the understanding of the concept come next. 3. BASIC CONCEPTS 3.1 Garph A graph G (V, E) is composed of a set of vertices (V) connected to each other by and a set of edges (E). 3.2 Tree A tree T is a connected graph that has no cycle. In other words, there is only and only one path between any two vertices. 3.3 Subgraph A subgraph G '(V', E') is a subgraph of G (V, E), which vertices and edges are subsets of V and E respectively:  V’⊆V  ⊆
  • 43.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 41 One may say that a subgraph of a graph is a pattern of that graph. Concerning trees two types of patterns can be defined: 3.3.1 Induced pattern The definition is exactly the same as the definition of subtree in a tree (Figure.1.a, Figure.1.c). It means that the vertices and the edges of Figure.1.a. Can be seen in Figure.1.c as well 3.3.2 Embedded pattern: Almost the same as induced pattern, except that there may be one or more supplementary vertices between the two parents and child nodes of pattern, For example vertex A in Figure.1.c is parent of vertex D; and in Figure.1.b an embedded pattern of Figure 1.c is seen. Figure.1. An example of the Induced and embedded subtree pattern 3.3.3 Isomorphism Two graphs are isomorph, if there are one to one relationships among their vertices and edges. 3.3.4 Frequent Subgraph Suppose a graph G and a set of graphs D = {g1, g2, g3,…, gn} are given, support(G) is: Support (G) = A graph G in a dataset D is called Frequent if its support is not less than of a predefined threshold. 4. AN OVERVIEW OF FREQUENT SUBGRAPH MINING ALGORITHM ACCORDING TO DIFFERENT CRITERIA This section discusses different criteria for classification of frequent graph mining algorithms, including: graph representation, input type, constraint- based, inductive logic programming, search strategy, and completeness/- incompleteness of outputs. 4.1 Graph Representation 4.1.1 Adjacency Matrix A graph can be demonstrated as an adjacency matrix, in this case the row and the column represent vertex of graph and the entries represents edges of graph (i.e. when there is an edge between two vertices, entries constituted by the junction of the row and the column is filled by “1” and otherwise by
  • 44.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 42 ”0”). Furthermore, the nodes are represented on the main diameter of the matrix (Figure.2).To Show the graph as a string a combination of nodes and edges as a sequence in particular order can be used, and since every permutation of the nodes may generate a specific string, a series of maximum or minimum canonical adjacency matrix (CAM) must be taken into account. An advantage of this is that two isomorphism graphs will have the same maximum/minimum CAM. Figure.2. Left side a graph and right side corresponding adjacency matrix 4.1.2 Adjacency List Another way to represent a graph is adjacency list. When the graph is sparse, several “zeros” are generated in the adjacency matrix, which is a great waste of memory and to avoid this, adjacency list is an answer as it assigns memory dynamically 4.2 Subgraph Generation Two subgraphs can be mixed to generate candidate subgraph and the result will be a new subgraph. However, given that many copied subgraphs might be generated in the mixing process, the way of generating candid subgraphs is critical. Among the available methods are extension and right most expansion. In the latter case, the subgraphs are expanded in one direction and no duplicate candidate is generated. 4.3 Frequency Counting To check if the generated candidates is a duplicate or not, the frequency of each must be determined and compared with the support value. Of the data structures, which are used to count the frequency of each candidate are embedding list and TSP tree. 5. A SURVEY OF FREQUENT SUBGRAPH MINING ALGORITHMS 5.1 Classification Based on Algorithmic Approach 5.1.1 Apriori-Based (Breadth First Search) This category of algorithms uses generates and test method and surface search to find a subgraph from the network that consist the database. Therefore, before the subgraph with length of k +1 ((k+1)-candidate), all frequent subgraphs with length of k must be found. Thus, each candidate with length of k +1 is obtained by connecting two frequent subgraphs with
  • 45.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 43 length of k. However, in this method, all state of candidate subgraph generated is considered .Maintenance and processing need plenty of time and memory, which tackles the performance [30] [2]. 5.1.2 Pattern Growth-Based In FP-growth-based methods a candidate subgraph with length of k+1 is obtained by extending a frequent pattern with length of k. Since extending a frequent subgraph with length of k may generate several candidate of length k+1, thus the way a frequent subgraph is expanded is critical in reducing generation of copied subgraphs.Table1 lists apriori and pattern growth algorithms [2]. Table 1. Frequent Subgraph Mining Algorithms AprioriPattern Growth FARMER [31] FSG [3] HSIGRAM GREW [32] FFSM [4] ISG SPIN [33] Dynamic GREW [34] AGM [35] MUSE [36] SUBDUE [37] AcGM [38] DPMine gFSG [39] MARGIN [40] GSPan [41] CloseGraph [42] Gaston [43] TSP [44] MOFa [45] RP-FP [46] RP-GD [46] JPMiner [47] MSPAN VSIGRAM [48] FPF [49] Gapprox [50] HybridGMiner FCPMiner [51] RING [52] SCMiner [53] GraphSig [54] FP-GraphMiner [55] gPrune [56] CLOSECUT [57] FSMA [58] 5.2 Classification Based on Search Strategy There are two search strategies to find frequent subgraphs. These two methods include breadth first search (BFS) and depth first search (DFS). 5.3 Classification Based on Nature of the Input Depending on input type of algorithms, here tried to be divided two categories presented as following:
  • 46.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 44 5.3.1 Single Graph Database Database consists of a single large graph 5.3.2 Transactional Graph Database Database consists of a large number of small graphs. Figure.3 shows a database consist of a set of graphs and two subgraphs and their frequency. In Figure.3 (left side, g, g2, and g2) demonstrates a transactional graph database and frequency of two frequent subgraphs (right side). Figure.3. A database consisting of three graph g1, g2, g3 and two subgraph and frequency of each 5.4 Classification Based on Nature of the Output 5.4.1 Completeness of the Output While, some algorithms find all frequent patterns, some other only mines part of frequent patterns. Frequent patterns mining is closely related to performance. When the total size of dataset is too high, it is better to use algorithms that are faster in execution so that reduction of the performance is avoided, although, not all frequent patterns are minded. Table 2 lists the completeness of output [29]. Table 2. Completeness of Output Complete OutputIncomplete Output FARMER gSpan FFSM Gaston FSG HSIGRAM SUBDUE GREW CloseGraph ISG 5.4.2 Constraint-Based With increase of size database, the number of frequent pattern is increased. This makes maintenance and analyzing more difficult as it needs more memory space. Reducing the number of frequent patterns without losing the data is achievable through mining and maintains more comprehensive patterns. Given that each pattern satisfies the condition of being frequent the
  • 47.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 45 whole subset satisfies the condition, to achieve more comprehend patterns we can use the following terms: 5.4.2.1 Maximal Pattern Subgraph g1 is maximal pattern if the pattern is frequent and does not consist of any super-pattern, so that g2 ‫با‬ g2⊃g1. 5.4.2.2 Closed Pattern Subgraph g1 is closed if it is frequent and does not consist of any frequent super-pattern such as g2, g2⊃g1 (i.e. support).Table3 lists maximal and closed subgraph algorithms. Table 3. Frequent Subgraph Mining (Constraintd) MaximalClosed SPIN MARGIN ISG GREW CloseGraph CLOSECUT TSP RP-FP RP-GD 5.5 Logic-Based Mining Also known as inductive logic programming, which also an area of machine learning, mainly in biology. This method uses inductive logic to display structured data. ILP core uses the logic to display for search and the basic assumptions of that structured way (e.g. WARMR, FOIL, and C-PROGOL), which is derived from background knowledge [29]. Table 4 lists the Pattern Growth and Table 5 indicates apriori-based algorithms categorized from different aspects [59] [27] [60] [61] [62] [28] [30] [63]. Table 4. Frequent Subgraph Mining Algorithms (Pattern Growth-based) Frequency Counting Subgraph Generation Graph Representation Input TypeAlgorithms
  • 48.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 46 DFS DFS DFS TSP Tree DFS DFS DFS DFS DFS DFS M-DFSC Normalize Matrix R-tree, DFS DFS Rightmost Extension Rightmost Extension Extension Extension Rightmost Extension Rightmost Extension Rightmost Extension Rightmost Extension Rightmost Extension Extension Iteration Extension Extension Merge and Extension Adjacence Matrix Adjacence Matrix Hash Table Adjacence Matrix Adjacence Matrix Adjacence Matrix Adjacence Matrix Adjacence Matrix Adjacence Matrix BitCode Adjacency matrix incident matrix Invariant vector Feature vector Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs GSpan CloseGeaph Gaston TSP MOFA RP-FP RP-GD JPMiner MSPAN FP-Graph-Miner gPrune FSMA RING GraphSig Table 5. Frequent Subgraph Mining Algorithms (Apriori-based) Frequency CountingSubgraph GenerationGraph Representation Input TypeAlgorithms MDFS Trie data structure TID list Maximal independent set Maximal independent set Suboptimal canonical adjacency matrix tree TID List Canonical Spanning Tree Suffix trees Canonical Labeling DFS coding CAM CAM Hashtree Level-wise Search Level-wise Search, ILP One Edge Extension Iterative Merging Iterative Merging Merging and Extension Edge Triple Extension Join Operation Iterative Merging Vertex Extension Disjunctive Normal Form Join Join Iterative Merging Adjacence Matrix Trie structure Adjacency List Adjacency Matrix Sparse graph Adjacency Matrix Edge Triple Adjacency Matrix Sparse graph Adjacency Matrix Search Tee Lattice Adjacency Matrix Adjacency Matrix Single Large Graph Set of graphs Set of graphs Single large Graph Single large Graph Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs Set of graphs SUBDUE FARMER FSG HSIGRAM GREW FFSM ISG SPIN Dynamic GREW AGM MUSE MARGIN AcGM gFSG Here several algorithms related to graph/tree mining are discussed in more details.  Gp-Growth Algorithm The algorithm consists of three main steps: 1. Candidate generation by join operation.
  • 49.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 47 2. Using a new method for tree representation and look up table that allows quick access to the information nodes in the candidate generation phase without having to read the trees of the database. 3. using right most expansion to candidate generation that guaranteed not generate duplicate candidate. This algorithm uses lookup table that is implemented as Hash table to store input trees information. It is the key part, represented as the pair of (T,pos), where T is identification of input tree and pos is number in preorder traversal, and value part, represented as (l,s), where l is label and s is scope of node. In this algorithm a new candidate is generated using scope of each node That means, first node, which is added to the other node should be added along the right most expansion and that within the scope of the first node to be added continually this process other frequent pattern is found [64].  Fp-Graph Miner Algorithm This algorithm uses FP-growth method to find frequent subgraphs, Its input is a set of graphs (Transactional database). First a BitCode for each edge is defined, then a set of edge is defined for each edge .When, edge is found in the each of graphs, the BitCode is ‘1’ and otherwise ‘0’. Then a frequency table is sorted in ascending order based on equivalent BitCode belongs to each edge and afterward, FP tree is constructed and frequent subgraphs are obtained through depth traversal [55]. 6. FREQUENT SUBTREES MINING ALGORITHMS CLASSIFICATION 6.1 Trees Representation A tree can be encoded as a sequence of nodes and edges. Some of most important ways of encoding trees are introduced below: 6.1.1 DLS (Depth Label Sequence) Let T be a Labeled Ordered Tree and depth-label pairs including labels and depth for each node are belonged to V. For example, (d(vi),l(vi)) are added to string s throughout DFS traversal of tree T. Depth-label sequence of tree T is obtained as { d(v1), l(v1)), …,(d(vk), l(vk) }. For instance, DLS for tree in Figure.4 can be presented as follow: {(0,a),(1,b),(2,e),(3,a),(1,c),(2,f),(3,b),(3,d),(2,a),(1,d),(2,f)(3,c)} 6.1.2 DFS – LS (Depth First Sequence)-(Label Sequence) Assumed a labeled ordered tree, Labels is added to string of s during the DFS traversal of Tree T. During backtrack ‘-1’or‘$’or ‘/’ is added
  • 50.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 48 to the string, DFS-LS code for tree T is illustrated in in Figure.4 {abea$$$cfb$d$$a$$dfc$$$} 6.1.3 BFCS (Breadth First Canonical String) Let T be an unordered tree. Several sequence encoded string can be generated using the BFS method and through changing the order of children of a node. Thus, one may say that BFCS tree T equals to the smallest lexicographic order of this encoded string. BFCS of tree T is showed in Figure.4. {a$bcd$e$fa$f$a$bd$$c#} 6.1.4 CPS (Consolidate Prufer Sequence) Let T be a labeled tree T, and CPS encoding method consists of two parts: NPS as extended prufer sequence, which uses vertex numbers traversal as set of unique label is obtained; an d LS (Label Sequence) as a sequence consisting of labels in prefix traversal after the leafs is removed is achieved. Both NPS and LS generate a unique encoding for labeled tree. NPS and LS obtained for the tree presented in Figure.4 is as follow respectively: {ebaffccafda-}, {aebbdfaccfda}. To obtain NPS, a leaf from the tree is removed in each step and the parent of the leaf is taken get as output. This is repeated until only the roots remain and ‘-’ is added to note as the end of the string. Regarding LS (Label Sequence) the same postfix traversal of the tree is taken as LS. Table9 remarks this category of trees [65]. Figure.4. A Tree Example
  • 51.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 49 Table 6. Frequent subtree Mining Algorithms (Tree Representation) Tree RepresentationAlgorithms DLS DFS-LS DLS FST-Forest BFCS DLS DFS-LS DLS DLS DFS string DFS-LS CPS BFCS DFS-LS BFCS DFS-LS uFreqt SLEUTH Unot Path Join RootedTreeMiner [66] FREQT TreeMiner Chopper XSPanner AMIOT IMB3Miner TRIPS FreeTreeMiner CMTreeMiner HybridTreeMiner [67] GP-Growth 6.2 Input Types 6.2.1 Rooted Ordered Trees Rooted Ordered sub-tree is a kind of tree in which a single node is considered as the “root” of the tree and there is a relationship between children of each node so that each child is greater than or equal to its siblings that are placed at the left hand side of it; moreover it is less than or equal to ones that are placed at its right hand side. If we elate or definition of rooted ordered tree such that there was no need to consider the relationship between siblings we have a rooted unordered sub-tree.in Table7 rooted ordered tree mining algorithms is shown. Table 7. Rooted Ordered Tree mining Algorithms InducedEmbedded FREQT [68] AMIOT [69] IMB3Miner [70] TRIPES [65] TIDES [65] TreeMiner [71] Chopper [72] XSPanner [72] IMB3-Miner 6.2.2 Rooted Unordered Trees In this type of trees, a node is considered as the root, however, there is no particular order between the descendants of each node,In Table 8 rooted unordered tree mining algorithms is listed.
  • 52.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 50 Table 8. Rooted unordered Tree mining Algorithms InducedEmbedded uFreqT [73] Unot [74] PathJoin [65] Rooted TreeMiner [75] TreeFinder [76] TreeFinder Cousin Pair [77] SLEUTH [78] 6.3 Tree Base Data Mining Frequent subtrees mining algorithm can be categorized into two major categories, aprior-based and pattern growth-based. Table 9 lists the apriori and pattern growth algorithms of trees [79] [76] [80]. Table 9. Frequent Subtree Mining Algorithms 7. CONCLUSIONS AND FUTURE WORKS Frequent subgraph Mining algorithms were first examined from different viewpoints such as different ways of representing a graph (e.g. adjacency matrix and adjacency list), generation of subgraphs, frequency counting, pattern growth-based and apriori-based algorithm classification, search based classification, input-based classification (single, transactional), output based classification. Furthermore, Mining based on logic was discussed. Afterward, frequent subtrees traversal algorithms were examined from different viewpoints such as trees representation methods, type of inputs, tree-based traversal, and also Mining based on Constraints of outputs. Given the results, it is concluded that in absence of generating patterns by pattern- AprioriPattern Growth TreeFinder AMIOT FreeTreeMiner TreeMine [81] SLEUTH CMTreeMiner [82] Pattern Matcher [71] W3Miner [83] FTMiner [84] CFFTree [85] IMB3-Miner uFreqt Unot FREQT TRIPS TIDES Path Join XSPanner Chopper PrefixTreeISpan [86] PCITMiner [87] F3TM [88] GP-Growth [64]
  • 53.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 51 growth, it is featured with less computation work and needs smaller memory size. Moreover, these algorithms are specifically designed for trees and graphs and cannot be used for other purposes. On the other hand, as they work on variety of datasets, it is not easy to find tradeoffs between them. The same frequent patterns can be used for searching similarity, indexing, classifying graphs and documents in future studies. Parallel methods and technologies such as Hadoop can also be needed when working with excessive data volume. 8. ACKNOWLEDGMENTS Authors are thankful to Mohammad Reza Abbasifard for their support of the investigations. REFERENCES [1] A.Rajaraman, J.D.Ullman, 2012. Mining of Massive Datasets, 2nd ed. [2] J.Han, M.Kamber, 2006, Data Mining Concepts and Techniques. USA: Diane Cerra. [3] Kuramochi, Michihiro, and G.Karypis., 2004. An efficient algorithm for discovering frequent subgraphs, in IEEE Transactions on Knowledge and Data Engineering, pp. 1038-1051. [4] J.Huan, W.Wang, J. Prins, 2003. Efficient Mining of Frequent Subgraph in the presence of isomorphism, in Third IEEE International Conference on Data Minign (ICDM). [5] (2013, Dec.) Trust Network Datasets - TrustLet. [Online]. http://www.trustlet.org [6] L.YAN, J.WANG, 2011. Extracting regular behaviors from social media networks, in Third International Conference on Multimedia Information Networking and Security. [7] Ivancsy,I. Renata, I.Vajk., 2009. Clustering XML documents using frequent subtrees, Advances in Focused Retrieval, Vol. 3, pp. 436-445. [8] J.Yuan, X.Li, L.Ma, 2008. An Improved XML Document Clustering Using Path Features, in Fifth International Conference on Fuzzy Systems and knowledge Discovery, Vol. 2. [9] Lee, Wenke, and Salvatore J. Stolfo, 2000. A framework for constructing features and models for intrusion detection systems, in ACM transactions on Information and system security (TiSSEC), pp. 227-261. [10] Ko, C, Logic induction of valid behavior specifications for intrusion detection , 2000. in In IEEE Symposium on Security and Privacy (S&P), pp. 142–155.
  • 54.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 52 [11] Yoshida, K. and Motoda, 1995. CLIP: Concept learning from inference patterns, in Artificial Intelligence, pp. 63–92. [12] Wasserman, S., Faust, K., and Iacobucci. D, 1994. Social network analysis : Methods and applications. Cambridge university Press. [13] Berendt, B., Hotho, A., and Stumme, G., 2002. semantic web mining, in In Conference International Semantic Web (ISWC), pp. 264–278. [14] S.C.Manekar, M.Narnaware, May 2013. Indexing Frequent Subgraphs in Large graph Database using Parallelization, International Journal of Science and Research (IJSR), Vol. 2 , No. 5. [15] Peng, Tao, et al., 2010. A Graph Indexing Approach for Content-Based Recommendation System, in IEEE Second International Conference on Multimedia and Information Technology (MMIT), pp. 93-97. [16] S.Sakr, E.Pardede, 2011. Graph Data Management: Techniques and Applications, in Published in the United States of America by Information Science Reference. [17] Y.Xiaogang, T.Ye, P.Tao, C.Canfeng, M.Jian, 2010. Semantic-Based Graph Index for Mobile Photo Search," in Second International Workshop on Education Technology and Computer Science, pp. 193-197. [18] Yildirim, Hilmi, and Mohammed Javeed Zaki., 2010. Graph indexing for reachability queries, in 26th International Conference on Data Engineering Workshops (ICDEW)IEEE, pp. 321-324. [19] R.Ivancsy and I.Vajk, 2006. Frequent Pattern Mining in Web Log Data, in Acta Polytechnica Hungarica, pp. 77-90. [20] G.XU, Y.zhang, L.li, 2010. Web mining and Social Networking. melbourn: Springer. [21] S.Ranu, A.K. Singh, 2010. Indexing and mining topological patterns for drug, in ACM, Data mining and knowlodge discovery, Berlin, Germany. [22] (2013, Dec.) Drug Information Portal. [Online]. http://druginfo.nlm.nih.gov [23] (2013, Dec.) DrugBank. [Online]. http://www.drugbank.ca [24] Dehaspe,Toivonen, and King, R.D., 1998. Finding frequent substructures in chemical compounds, in In Proc. of the 4th ACM International Conference on Knowledge Discovery and Data Mining, pp.30-36. [25] Kramer, S., De Raedt, L., and Helma, C., 2001. Molecular feature mining in HIV data, in In Proc. of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-01), pp. 136–143. [26] Gonzalez, J., Holder, L.B. and Cook, 2001. Application of graph-based concept learning to the predictive toxicology domain, in In Proc. of the
  • 55.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 53 Predictive Toxicology Challenge Workshop. [27] H.J.Patel, R.Prajapati, M.Panchal, M.Patel, Jan. 2013. A Survey of Graph Pattern Mining Algorithm and Techniques, International Journal of Application or Innovation in Engineering & Management (IJAIEM), Vol. 2, No. 1. [28] K.Lakshmi, T. Meyyappan, 2012. FREQUENT SUBGRAPH MINING ALGORITHMS - A SURVEY AND FRAMEWORK FOR CLASSIFICATION, computer science and information technology, pp. 189– 202. [29] D.Kavitha, B.V.Manikyala Rao and V. Kishore Babu, 2011. A Survey on Assorted Approaches to Graph Data Mining, in International Journal of Computer Applications, pp. 43-46. [30] C.C.Aggarwal,Wang, Haixun, 2010. Managing and Mining Graph Data. Springer,. [31] B.Wackersreuther, Bianca, et al. , 2010. Frequent subgraph discovery in dynamic networks, in ACM, Proceedings of the Eighth Workshop on Mining and Learning with Graphs, Washington DC USA, pp. 155-162. [32] Kuramochi, Michihiro, and G.Karypis, 2004. Grew-a scalable frequent subgraph discovery algorithm, in Fourth IEEE International Conference on Data Mining (ICDM), pp. 439-442. [33] Huan, Jun, SPIN: mining maximal frequent subgraphs from graph databases, 2004. in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. [34] Borgwardt, Karsten M., H-P. Kriegel, and P.Wackersreuther, 2006. Pattern mining in frequent dynamic subgraphs, in Sixth International Conference on Data Mining (ICDM), pp. 818-822. [35] Inokuchi, Akihiro, T.Washio, and H.Motoda, 2000. An apriori-based algorithm for mining frequent substructures from graph data, in Principles of Data Mining and Knowledge Discovery, pp. 13-23, Springer Berlin Heidelberg. [36] Zou, Zhaonian, et al, 2009. Frequent subgraph pattern mining on uncertain graph data, in Proceedings of the 18th ACM conference on Information and knowledge management, pp. 583-592. [37] Ketkar, N.S, Lawrence B.Holder, and D.J.Cook, 2005. Subdue: compression- based frequent pattern discovery in graph data, in ACM, Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, pp. 71-76. [38] A. Inokuchi, T. Washio, and H. Motoda, 2003. Complete mining of frequent patterns from graphs: Mining graph data, in Machine Learning, pp. 321-354.
  • 56.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 54 [39] Kuramochi, Michihiro, and G.Karypis, 2007. Discovering frequent geometric subgraphs, in Information Systems, pp. 1101-1120. [40] Thomas, Lini T, Satyanarayana R. Valluri, and K.Karlapalem, 2006. Margin: Maximal frequent subgraph mining, in IEEE Sixth International Conference on Data Mining (ICDM), pp. 1097-1101. [41] Yan, Xifeng, and J.Han, 2002. gspan: Graph-based substructure pattern mining, in Proceedings International Conference on Data Mining.IEEE, pp. 721-724. [42] Yan, Xifeng, and Jiawei Han, 2003. CloseGraph: mining closed frequent graph patterns, in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 286-295. [43] Nijssen, Siegfried, and J.N. Kok., 2005. The gaston tool for frequent subgraph mining, in Electronic Notes in Theoretical Computer Science, pp. 77-87. [44] Hsieh, Hsun-Ping, and Cheng-Te Li, 2010. Mining temporal subgraph patterns in heterogeneous information networks, in IEEE Second International Conference on Social Computing (SocialCom), pp. 282-287. [45] Wörlein, Marc, et al, 2005. A quantitative comparison of the subgraph miners MoFa, gSpan, FFSM, and Gaston, in Knowledge Discovery in Databases: PKDD , Springer Berlin Heidelberg, pp. 392-403. [46] S.J.Suryawanshi,S.M.Kamalapur, Mar 2013. Algorithms for Frequent Subgraph Mining, International Journal of Advanced Research in Computer and Communication Engineering, Vol. 2, No. 3. [47] Liu, Yong, Jianzhong Li, and Hong Gao, 2009. JPMiner: mining frequent jump patterns from graph databases, in IEEE, Sixth International Conference on Fuzzy Systems and Knowledge Discovery, pp. 114-118. [48] Reinhardt, Steve, and G.Karypis, 2007. A multi-level parallel implementation of a program for finding frequent patterns in a large sparse graph, in IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 1- 8. [49] Schreiber, Falk, and H.Schwobbermeyer., 2005. Frequency concepts and pattern detection for the analysis of motifs in networks, in Transactions on computational systems biology III, pp. 89-104, Springer Berlin Heidelberg. [50] Chent, Chen, et al., 2007. gapprox: Mining frequent approximate patterns from a massive network, in Seventh IEEE International Conference on Data Mining (ICDM), pp. 445-450. [51] Ke, Yiping, J.Cheng, and Jeffrey Xu Yu, 2009. Efficient discovery of frequent correlated subgraph pairs, in Ninth IEEE International Conference on Data Mining (ICDM), pp. 239-248.
  • 57.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 55 [52] Zhang, Shijie, J.Yang, and Shirong Li, 2009. Ring: An integrated method for frequent representative subgraph mining, in Ninth IEEE International Conference on Data Mining (ICDM), pp. 1082-1087. [53] Fromont, Elisa, Céline Robardet, and A.Prado, 2009. Constraint-based subspace clustering, in International conference on data mining, pp. 26-37. [54] Ranu, Sayan, and Ambuj K. Singh., 2009. Graphsig: A scalable approach to mining significant subgraphs in large graph databases, in IEEE 25th International Conference on Data Engineering (ICDE), pp. 844-855. [55] R. Vijayalakshmi,R. Nadarajan, J.F.Roddick,M. Thilaga, 2011. FP- GraphMiner, A Fast Frequent Pattern Mining Algorithm for Network Graphs, Journal of Graph Algorithms and Applications, Vol. 15, pp. 753-776. [56] Zhu, Feida, et al., 2007. gPrune: a constraint pushing framework for graph pattern mining, in Advances in Knowledge Discovery and Data Mining, , pp. 388-400, Springer Berlin Heidelberg. [57] Yan, Xifeng, X. Zhou, and Jiawei Han, 2005. Mining closed relational graphs with connectivity constraints, in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pp. 324- 333. [58] Wu, Jia, and Ling Chen, 2008. A fast frequent subgraph mining algorithm, in The 9th International Conference for Young Computer Scientists (ICYCS), pp. 82-87. [59] Krishna, Varun, N. N. R. R. Suri, G. Athithan, 2011. A comparative survey of algorithms for frequent subgraph discovery, Current Science(Bangalore), pp. 1980-1988. [60] K.Lakshmi, T. Meyyappan, Apr. 2012. A COMPARATIVE STUDY OF FREQUENT SUBGRAPH MINING ALGORITHMS, International Journal of Information Technology Convergence and Services (IJITCS), Vol. 2, No. 2. [61] C.Jiang, F.Coenen, M.Zito, 2004. A Survey of Frequent Subgraph Mining Algorithms, The Knowledge Engineering Review, pp. 1-31. [62] M.Gholami, A.Salajegheh, Sep. 2012. A Survey on Algorithms of Mining Frequent Subgraphs, International Journal of Engineering Inventions, Vol. 1, No. 5, pp. 60-63. [63] V.Singh, D.Garg, Jul. 2011. Survey of Finding Frequent Patterns in Graph Mining: Algorithms and Techniques, International Journal of Soft Computing and Engineering (IJSCE), Vol. 1, No. 3. [64] Hussein, M.MA, T. H.Soliman, O.H. Karam, 2007. GP-Growth: A New Algorithm for Mining Frequent Embedded Subtrees. 12th IEEE Symposium on Computers and Communications.
  • 58.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 56 [65] Tatikonda, Shirish, S.Parthasarathy,T.Kurc., 2006. TRIPS and TIDES: new algorithms for tree mining, in Proceedings of the 15th ACM international conference on Information and knowledge management. [66] Tung, Jiun-Hung, 2006. MINT: Mining Frequent Rooted Induced Unordered Tree without Candidate Generation. [67] Chi, Yun, Y.Yang, and Richard R. Muntz., 2004. HybridTreeMiner: An efficient algorithm for mining frequent rooted trees and free trees using canonical forms, in Proceedings 16th International Conference on Scientific and Statistical Database Management. [68] T.Asai, H.Arimura, T.Uno, S.Nakano and K.Satoh, 2008. Efficient tree mining using reverse search. [69] S.Hido, and H. Kawano., 2005. AMIOT: Induced Ordered Tree Mining in Tree-structured Databases, in Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05). [70] H.Tan, T.S. Dillon, F.Hadzic, E.Chang, and L.Feng, 2006. IMB3-Miner: Mining Induced/Embedded Subtrees by Constraining the Level of Embedding, in Advances in Knowledge Discovery and Data Mining, Springer Berlin Heidelberg, pp. 450–461. [71] M.J.Zaki, 2002. Efficiently mining frequent trees in a forest, in In Proceedings of the 8th International Conference on Knowledge Discovery and Data Mining (ACM SIGKDD), pp. 71-80. [72] C.Wang, M.Hong, J.Pei, H.Zhou, W.Wang, 2004. Efficient pattern-growth methods for frequent tree pattern mining, in Advances in Knowledge Discovery and Data Mining, Springer Berlin Heidelberg, pp. 441-451. [73] S.Nijssen and J.N.Kok, 2003. Efficient Discovery of Frequent Unordered Trees, in Proc. First Intl Workshop on Mining Graphs Trees and Sequences, pp. 55-64. [74] T. Asai, H. Arimura, T.Uno and S. Nakano., 2003. Discovering Frequent Substructures in Large Unordered Trees, in procceding sixth conference on Discovery Science, pp. 47-61. [75] Y.Chi, Y.Yang, and R. Muntz., May 2004. Canonical Forms for Labeled Trees and Their Applications in Frequent Subtree Mining, Knowledge and Information Systems, No. 8.2, pp. 203-234. [76] Chi, Yun, et al.,2005. Frequent subtree mining-an overview, in Fundamenta Informaticae, pp. 161-198. [77] Shasha, Dennis, J.Tsong-Li Wang and Sen Zhang.,2004. Unordered tree mining with applications to phylogeny, in IEEE Proceedings 20th International Conference on Data Engineering, pp. 708-719.
  • 59.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 57 [78] M.J.Zaki., 2005. Efficiently Mining Frequent Embedded Unordered Trees, in IOS Press, pp. 1-20. [79] Jimenez, Aida, F.Berzal,J.Cubero.,2008. Mining induced and embedded subtrees in ordered, unordered, and partially-ordered trees, in EEE Transactions on Knowledge and Data Engineering, Springer Berlin Heidelberg, pp. 111-120. [80] Jimenez, Aida,F. Berzal Juan-Carlos Cubero.,2006. Mining Different Kinds of Trees: A Tree Mining Overview, in Data Mining. [81] B.Bringmann.,2004. Matching in Frequent Tree Discovery, in Fourth IEEE International Conference on Data Mining. [82] Chi, Yun, et al. Mining.,2004. Cmtreeminer: Mining both closed and maximal frequent subtrees, in Advances in Knowledge Discovery and Data , Springer Berlin Heidelberg, pp. 63-73. [83] AliMohammadzadeh, Rahman, et al., Aug 2006. Complete Discovery of Weighted Frequent Subtrees in Tree-Structured Datasets, International Journal of Computer Science and Network Security (IJCSNS ), Vol. 6, No. 8, pp. 188-196. [84] J.HU, X.Y.LI., Mar 2009. Association Rules Mining Including Weak-Support Modes Using Novel Measures," WSEAS Transactions on Computers, Vol. 8, No. 3, pp. 559-568. [85] Zhao, Peixiang, and J.X.Yu.,2007. Mining closed frequent free trees in graph databases, in Advances in Databases: Concepts, Systems and Applications, Springer Berlin Heidelberg, pp. 91-102. [86] Zou, Lei, et al.,2006. PrefixTreeESpan: A pattern growth algorithm for mining embedded subtrees, in Web Information Systems (WISE), Springer Berlin Heidelberg, pp. 499-505. [87] Kutty, Sangeetha, R.Nayak, Y.Li., 2007. PCITMiner: prefix-based closed induced tree miner for finding closed induced frequent subtrees, in Proceedings of the sixth Australasian conference on Data mining and analytics, Vol. 70, Australian Computer Society. [88] Zhao, Peixiang, and J.X.Yu., 2008. Fast frequent free tree mining in graph databases, in Springer World Wide Web, Hong Kong, pp. 71-92. This paper may be cited as: Dinari, H. and Naderi, H. 2014. A Survey of Frequent Subgraphs and Subtree Mining Methods. International Journal of Computer Science and Business Informatics. Vol. 14, No. 1, pp. 39-57.
  • 60.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 58 A Model for Implementation of IT Service Management in Zimbabwean State Universities Munyaradzi Zhou, Caroline Ruvinga, Samuel Musungwini and Tinashe Gwendolyn Zhou Department of Computer Science and Information Systems Gweru, Zimbabwe ABSTRACT Several IT service management (ITSM) frameworks have been deployed and are being adopted by companies and institutes without redefining the framework to a model which suits their IT departments’ operating environment and requirements. An IT service management model is proposed for Zimbabwean universities and is a holistic approach through integration of Operational Level Agreements (OLAs), Service Level Agreement (SLAs) and IT Service Catalogues (ITSCs). OLA is considered as the domain for describing IT Service management and its attainment is geared by organizational management and IT section personnel in alignment with the mission, vision and values of the organization. Explicitly defining OLAs will aid management in identification of key services and processes in both qualitative and quantitative form (SLAs). After defining SLAs then ITSCs can be formulated, a measure which is both customer and IT service provider centric and acts as the nucleus of the model. Redefining IT Service Management from this this perspective will result in deriving value from IT service management frameworks and customer satisfaction. Keywords: SLAs, OLAs, ITSCs, ITSM. 1. INTRODUCTION The IT service management is a modern concept adopted by the IT community for improved IT services delivery and productivity to attain customer satisfaction and control costs. IT Service Management is an integration of IT services provisioning between service providers and end users to arrive at end-to-end service through the implementation of measures such as Service Level Agreements (SLAs), Operations Level Agreements (OLAs) and IT Service Catalogues (ITSCs) (Almeroth & Hasan, 2002). Service management frameworks in IT industry have been developed such as Control Objectives for Information and related Technology (COBIT), and IT Infrastructure Library (ITIL) but have not been related for a specific IT sections given its operating environment and
  • 61.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 59 constraints. IT Service is the nucleus in accomplishment of business processes at a University, thus it supports academic research, learning and teaching. Universities offers IT Services to staff, researchers and students, visitors and partners on platforms such as Electronic Learning (ELearning), library services, staff directory and email, learning resources which are crucial to learning, teaching, and collaboration as the community becomes global. The IT department must offer better services to these stakeholders in a resource constraint environment (staff and financial resources) (University of Birmingham, 2014). 2. RELATED WORKS An ITS service consists of three key elements namely, a Service Level Agreements (SLAs), Operational Level Agreements (OLAs) and Service Catalogue page/s. Operational Level Agreements (OLAs) are agreements between the ITS teams and such as hardware, software and networking teams on how they will collaborate to ensure the appropriate service level is met for a particular service under supervision of a coordinator and it defines the expectations and commitments needed to deliver Service Level Agreements (SLAs) (University of California, 2012). Service Level Agreements (SLAs) are agreements between the Information Technology Services (ITS) team or teams and their clients which define the level of service the client should receive. An IT service catalogue is a mapping database of an institute’s available technological resources and products/ IT services in offer and about to be rolled out (Griffiths, Lawes, & Sansbury, 2012; Moeller, 2013). The ITS Service Catalogue is the division of services offered at an institute into components with policies, guidelines and responsibilities of parties involved, SLAs and delivery conditions (Bon et al., 2007). The service level catalogue should be readily accessible to authorised users and facilitate them to create a service request on behalf of themselves and others, and contain facilities to approve service requests. IT service catalogues should be tested by both IT and key users so that the product complies with the prescribed technical functionality and usability metrics. The IT catalogue should be developed in such a way that it facilitates effective communication between IT management and stakeholders involved and acts as an effective tool for good governance (Griffiths et al., 2012; Moeller, 2013). Basically an IT service catalogue is divided into business service catalogue and technical service catalogue. A business service catalogue is client centric and must meet users’ requirements thus the user community should be engaged in requirement gathering and design. Alternatively, a technical service catalogue is service provider centric and focuses on specific services description in IT terms including services constructs and their
  • 62.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 60 interrelationships. IT managerial and technical staff work processes are explicitly defined and the technical service catalogue access is mainly restricted to organizational (Troy, Rodrigo, & Bill, 2007). A SLA should consist of the following elements namely, placement of services into categories (sections for catalogue), listing of each category as a service catalogue section, establishing integrated/packaged/bundled service products, identification of modular service products , definition of each service product, establishing service owner and supplier, defining procurement procedures(how and the cost), specifying service level metrics (availability, reliability, response), defining limits of service and defining customers responsibilities thus it provides a basis for managing the relationship between the service provider and the customer, describing the agreement between the service provider and customer for the service to be delivered, including how the service is to be measured (Hiles, 2000). A service must provide a bridge from the developers and engineers’ point of view to the end-user’s perspective and identifies internal processes necessary to offer and maintain the services. Services change management and continuous processes improvement is important in addressing stakeholders’ needs (University of California, 2012). A Service Lifecycle basically focuses on defining a service strategy thus maintaining it and implementing it, service designing which focuses on the methodology and architectural design to offer the service, thirdly, service transition, which focuses on testing and integration of services offered for quality and control compliance, and finally service operation focuses on smooth running of daily IT services and continuous improvement which aligns the life cycle stages thus offering room for best practices and improvement in value delivery (Office of Government Commerce, 2010). A Service Level Agreement (SLA) is a blue-print which governs service provision parameters between the service provider and the client (University of California, 2012). Mainly, a SLA consists services being provided by the IT service provider and how they will deliver them (they must meet user requirements and standards agreed upon by parties involved and be attainable thus communication is key in all processes), definition of key performance parameters, assigning IT service providers personnel and users to measure specific performance using specific metrics (continuously monitor, manage and measure Service Level commitments), identification of rewards or penalties levied if service delivery is being offered effectively or they’re failing to render the services (SLA matrices should have performance buffers to allow for the recovery from breaches) (Dube & Gulati, 2005; Lahti & Peterson, 2007).
  • 63.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 61 4. METHODOLOGY The research questions in this study examine ITS personnel services delivery in relation to SLAs, OLAs and ITSCs. Research approach is the way the researcher approaches the research either by gathering data and formulates a theory or the researcher develops a theory and hypotheses and then tests or validate it. An inductive approach was adopted since it allowed the researchers to develop a theory during data analysis of the collected data (Saunders, Lewis, & Thornhill, 2009). The researchers used questionnaires to carry out the research since they facilitated saturation, the questionnaires were distributed in proportion to personnel in each ITS department team. 20 Questionnaires were distributed in the Hardware Section, 7 in the Software section and 7 in the Networking section. The response rates were 80 percent, 71.43% and 85.71% percent respectively. The data was coded manually. 5. RESULTS The hardware section team is not aware of any agreements with the software team and the networking department which ensure appropriate service level is met for particular services within the ITS department. If OLAs agreements are in place personnel felt that the ITS department Director and or other senior officers should facilitate and maintain these agreements since they increase efficiency and they allow alignment of work processes with organizational objectives. The hardware section team is not aware of any agreements with the software and networking team which define the level of service the students and staff members should receive and this should be led by the chief technician. Personnel act on intuition to work and tasks when called upon or infer to those which are his/her job description. All respondents agreed to the notch that the adoption of SLAs will improve service delivery to the clients and helps in setting boundaries on personnel’s duties and how they would execute them with confidence. Furthermore, it results in process standardization and improved accuracy in execution of tasks. 10% of the respondents strongly agree, 60% Agree, 15% are Neutral and also 15% Disagree that the use SLAs will improve and differentiate services by defining performance and its measures and this will help in building actionable performance tracking and controls. There is no policy about IT services currently in offer and ready to be delivered which the respondents felt they should be monitored by supervisors responsible for a specific services being offered. In hardware maintenance, personnel from other departments are called upon to offer all related activities on ad-hoc basis. ITSCs offers a platform to evaluate services being offered if they’re meeting the required standard. Top
  • 64.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 62 management such as directors and supervisors are key stakeholders in implementation of IT service management. The networking section team do not have any agreements with the software and hardware teams to ensure appropriate service level is met for particular services in the ITS department. The service level, which students and staff should receive is not defined such as the uptime and download speed available in both the wireless and wired network. Staff portal services and the students’ electronic learning (E-Learning) accounts being monitored by the software team are dependent upon network availability and the server capacity which is the responsibility of the networking and hardware section respectively even though there are no OLAs among the departments concerned. Staff and students are informally consulted on their requirements on the services being offered by the ITS department. Students and staff members should be given a platform to request additional functionalities ‘add-on’ on their E-Learning and staff portal services accounts. IT service management model A University-wide IT service management model was developed which consists of the Operational Level Agreements which is viewed as the cornerstone of IT service management implementation, the service level agreements which is the sub-domain linking OLAs and ITSCs, and finally the IT service catalogues which is referred to as the nucleus of IT service management. Leadership support is important from personnel such as IT directors, projects managers and Chief IT technicians since they will initiate setting of specific benchmarks for performance measurement and facilitate effective feedback mechanism and communication. Top management will help in organizing seminars or workshops in form of refresher courses or awareness campaigns about execution of their work processes. Explicitly defining OLAs will aid management in identification of key services and processes in both qualitative and quantitative form while monitoring them and taking corrective measures where necessary (SLAs). After defining SLAs then ITSCs can be formulated, a measure which is both customer and IT service provider centric and acts as the nucleus of the model. Services being offered should be end-user centric rather than the provider’s point of view such as the website should be navigated easily and there must be a distinction between administrative issues and other information to be displayed on the homepage. Support services including how to access the website using mobile phones and those which are supported or compatible mobile browsers should be availed to clients. Additionally, key future plans such as general upgrade of the site (time it will be expected to be down during maintenance should be communicated),
  • 65.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 63 upgrading to mobile site, modification of functionalities on the webpage, and phasing out of specific service should be communicated. Figure 1 overleaf shows the developed model. Figure 1: IT service management implementation model OPERATIONAL LEVEL AGREEMENTS (IT service provider centric) SERVICE LEVEL AGREEMENTS Identify key services and processes to achieve the required goal. Define services in qualitative and quantitative form. Monitor the key services and processes while corrective measures are being taken where necessary. SERVICE CATALOGUE (Customer centric) Details of services and products offering Give reports on website availability (response time, uptime percentage etc.) Support services (e.g. installation of preliminary software, mobile browser support/types of mobile phones compatible) Key policies Terms and conditions Service Level Agreements (SLAs) Key future plans (upgrading to mobile, modification of functionality etc. or phasing out of a service OLAs DRIVING FORCES Leadership support Setting specific performance benchmarks Rewards and recognition or penalties in relationship response on adopting OLAs Education and awareness campaigns to ITS department sections personnel Ensure effective feedback mechanism and communication. Definition of services required to deliver services Explicitly define responsibilities of IT service provider and recipient
  • 66.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 64 6. CONCLUSIONS An enabling collaborative approach to quality improvement should be explored by the ITS teams while involving their clients (staff and students) so that their needs are satisfied. In achieving ITSM, goals must be benchmarked and reviewed by the monitoring and evaluation committee being steered by the project manager. The committee must ensure availability of human and financial resources for example through lobbying top management support and training of employees. In addition, the committee should facilitate a cyclical communication system with stakeholders and top management so as to ensure their support and commitment even during the review process. The institutional goals, vision and mission should be aligned with ITSM strategy adopted. A service catalogue which acts as a blue-print to clients in understanding and making an informed decision about the services they use or intends to use must always be availed to clients, and also it acts as a benchmark for quality assurance on services the ITS department offers to clients. OLA between IT service provider and a procurement department or other departments to obtain hardware or other resources in agreed times and between a service desk and a support group to provide incident resolution in agreed times should be defined to ensure appropriate service level is met (Rudd, 2010). Adoption of OLAs will result in better service delivery and management of duties and responsibilities. Universities must integrate various IT teams within departments across the various campuses while explicitly defining implementation of SLAs, OLAs and ITSCs and also emphasise on performance reporting which must be facilitated by a team leaders from all IT sections. Additionally, institutes must identify facilitating and clogging conditions for successful ITSM and this can be necessitated through conducting seminars and or workshops on relevant IT aspects. Conducting post training evaluation on deliberations on ITSM will help in continuous improvement in service delivery. Relating COBIT and ITIL to IT service management constructs (OLAS, SLAs and ITSCs) presents an interesting area for further research. REFERENCES [1] Almeroth, K.C. and Hasan, M., 2002. Management of Multimedia on the Internet: 5th IFIP/IEEE Proceedingds of the International Conference on Management of Multimedia Networks and Services, MMNS 2002, Santa Barbara, CA, USA, October 6- 9, 200. CA: Springer, p.356. [2] Bon, J. van et al., 2007. IT Service Management: An Introduction. Van Haren Publishing, p.514. [3] Dube, D.P. and Gulati, V.P., 2005. Information System Audit and Assurance. Tata McGraw-Hill Education, p.671. [4] Griffiths, R., Lawes, A. and Sansbury, J., 2012. IT Service Management: A Guide for ITIL Foundation Exam Candidates. BCS, The Chartered Institute for IT, p.200.
  • 67.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 65 [5] Hiles, A., 2000. Service Level Agreements: Winning a Competitive Edge for Support & Supply Services. Rothstein Associates Inc, p.287. [6] Lahti, C.B. and Peterson, R., 2007. Sarbanes-Oxley IT Compliance Using Open Source Tools. Syngress, p.466. [7] Moeller, R.R., 2013. Executive’s Guide to IT Governance: Improving Systems Processes with Service Management, COBIT, and ITIL. John Wiley & Sons, p.416. [8] Office of Government Commerce, 2010. Introduction to the ITIL service lifecycle. The Stationery Office, p.247. [9] Rudd, C., 2010. ITIL V3 Planning to Implement Service Management. The Stationery Office, p.320. [10] Saunders, M., Lewis, P., & Thornhill, A. (2009). Research methods for business students. (5th Edition, Ed.) Pearson Education Limited, Essex, England. [11]Troy, D. M., Rodrigo, F., & Bill, F. (2007). Defining IT Success Through the Service Catalog: A Practical Guide about the Positioning, Design and Deployment of an Actionable Catalog of IT Services (1st Edition, Ed.). US: Van Haren Publishing. [12]University of Birmingham, 2014. IT Services - University of Birmingham. [Online] Available at: <http://www.birmingham.ac.uk/university/professional/it/index.aspx> [Accessed 18 Mar. 2014]. [13] University of California, 2012. ITS Service Management: Key Elements. [online] Available at: <http://its.ucsc.edu/itsm/servicemgmt.html> [Accessed 18 Mar. 2014]. This paper may be cited as: Zhou, M., Ruvinga, C.,Musungwini, S. and Zhou, G., T., 2014. A Model for Implementation of IT Service Management in Zimbabwean State Universities. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 58-65.
  • 68.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 66 Present a Way to Find Frequent Tree Patterns using Inverted Index Saeid Tajedi Department of Computer Engineering Lorestan Science and Research Branch, Islamic Azad University Lorestan, Iran Hasan Naderi Department of Computer Engineering Iran University of Science and Technology Tehran, Iran ABSTRACT Among all patterns occurring in tree database, mining frequent tree is of great importance. The frequent tree is the one that occur frequently in the tree database. Frequent subtrees not only are important themselves but are applicable in other tasks, such as tree clustering, classification, bioinformatics, etc. In this paper, after reviewing different methods of searching for frequent subtrees, a new method based on inverted index is proposed to explore the frequent tree patterns. This procedure is done in two phases: passive and active. In the passive phase, we find subtrees on the dataset, and then they are converted to strings and will be stored in the inverted index. In the active phase easily, we derive the desired frequent subtrees by the inverted index. The proposed approach is trying to take advantage of times when the CPU is idle so that the CPU utilization is at its highest in in evaluation results. In the active phase, frequent subtrees mining is performed using inverted index rather than be done directly onto dataset, as a result, the desired frequent subtrees are found in the fastest possible time. One of the other features of the proposed method is that, unlike previous methods by adding a tree to the dataset is not necessary to repeat the previous steps again. In other words, this method has a high performance on dynamic trees. In addition, the proposed method is capable of interacting with the user. Keywords: Tree Mining, Inverted Index, Frequent pattern mining, tree patterns. 1. INTRODUCTION Data mining or knowledge discovery deals with finding interesting patterns or information that is hidden in large datasets. Recently, researchers have started proposing techniques for analyzing structured and semi-structured datasets. Such datasets can often be represented as graphs or trees. This has led to the development of numerous graph mining and tree mining algorithms in the literature. In this article we present an efficient algorithm for mining trees.
  • 69.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 67 Data mining has evolved from association rule mining, sequence mining, to tree mining and graph mining. Association rule mining and sequence mining are one-dimensional structure mining, and tree mining and graph mining are two-dimensional or higher structure mining. The applications of tree mining arise from Web usage mining, mining semi-structured data, and bioinformatics, etc. Basic and fundamental ideas of tree mining, roughly since the early '90s were seriously discussed and during this decade were completed. These can be stated that the origin and beginning of these ideas is their application especially on the web. First, some essential and basic concepts are described, and then describe the proposed method and finally the results will be evaluated. 2. Related Works 2.1 Pre Order Tree Traversal There are several ways to navigate through the ordered trees; pre order traversal is one of the most important and most widely used of them. In this way, we are acting like Depth First Search algorithm. This means that on the tree like T starting from the root, then the left child and finally the right child is navigating; this method is done recursively on all nodes of the tree. 2.2 Post Order Tree Traversal It is also among the most important and widely used methods of ordered trees traversal. In this method, we first on the tree like T starting from the left child, then right child and finally the root is navigating, the operation is performed recursively on all nodes of the tree. Using either method, the above display, we can assign a number to each of the nodes that in fact, it is represents a time to meet each node. If we use the Post Order Traversal, that number is called PON. 2.3 RMP, LMP LMP is the acronym Left Most Path represents a path from the root to the leftmost leaf and the RMP is the acronym Right Most Path represents a path from the root to the rightmost leaf. 2.4 Prüfer Sequence [23] This algorithm was introduced in 1918 and used to convert the tree to string. The algorithm works as follows in the tree like T, in every step the node with the smallest label has been removed and label the parent node of this tree is added to the Prüfer Sequence. This process is repeated n-2 times to 2 nodes remain.
  • 70.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 68 2.5 Label Sequence The next concept is Label Sequence. This sequence is produced according to the Post Order Traversal. In other words, in the Post Order Traversal, label of each node that will be scanned, add to the sequence. 2.6 Support Simply this implies that the S pattern has been repeated several times in the tree T. (1) Where S is a tree pattern and D is a database of trees. This concept for determining the number of occurrences of each subtree in a set of trees is being used. 2.7 Inverted Index [24] Inverted Index is a structure that used to indexing frequent string elements in set of documents and is consists of two main parts: Dictionary and Posting List. Frequent string elements uniquely stored in Dictionary and the number of occurrences of each of these elements in total documents is determined. Informations about the frequent elements such as the document name, number of occurrences in each document are determined in Posting List. 3. An overview of research history In recent years, much research about the frequent subtrees mining has been done. Yongqiao Xiao et al in 2003 used the Path Join Algorithm and a compact data structure, called FST-Forest to find frequent subtrees [25]. In this way, we first find frequent root path in all directions and then with integrate these paths, frequent subtrees are reached. Shirish Tatikonda et al published an article in 2006 on the basis of the pattern growth[26]; In this way that all trees in the database tree are converted to strings, that is done with the two different methods: Prüfer Sequence and DFS algorithms; then scroll all strings in which there is a subtree or pattern such as S, we are seeking a new edge can be added to S. Then, concurrently with the previous step, as production of the candidate subtrees, the threshold values are evaluated for be frequent. In 2009, Federico Del Razo Lopez et al presented an idea to make flexible the tightly constrained tree mining In non- fuzzy[27]. This paper used the principle of Partial Inclusion; that to say that there is a pattern S in a tree T, it is no need to exist all the pattern nodes in the tree. The proposed algorithm uses Apriori property for pruning undesirable patterns.
  • 71.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 69 4. The proposed approach This procedure is done in two phases: Passive and Active. In the Passive phase, first we need to find all subtrees available in all trees and then must be store in Inverted Index. In the Active phase, simply use it and will extract frequent tree patterns. 4.1 Passive Phase This phase is done in two stages, in first stage; we must find all subtrees of every tree in dataset then they will be converted to a string called that tree and in second stage; produced strings in the first stage should be stored in the Inverted Index. 4.1.1 First stage of Passive phase The first important point is that in each tree, each node label in the tree can be repeated many times, but every node in every tree has a unique label; to solve this problem, we use the method of Prüfer Sequence. This means that each tree can be traced to Post Order and In fact, Prüfer Sequence Algorithm works based on the PON. As a result, each node label of a tree will be marked with a unique number. The next issue is that the Prüfer Sequence able to cover all the nodes, therefore, the algorithm implementation process rather than n-2 steps process will continue until n steps and rather than the parent label of the last node, put the number 0. In Figure 1 you can see an example of this method is that the purpose of the NPS is Prüfer Sequence that has been achieved using Post Order. The next thing is that every subtree should be displayed uniquely; to this end, must obtain CPS for each node. In fact, CPS will merge Prüfer Sequence and Label Sequence. In other words, CPS(T) = ( NPS,LS )(T). CPS can uniquely display a rooted and labeled tree. As you can see in Figure 1, the T1 tree can be displayed uniquely using two strands that are complementary. Figure 1. An example of the Prüfer Sequence and Label Sequence for T1 tree
  • 72.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 70 The next thing that we need to ensure that in each tree can produce all subtrees and each subtree is created only once, for this purpose, we use the LMP to generate the subtree. This means that if we show the T tree using Prüfer sequence and n is the subtree, A node such as v that is to be added to the n should be included in the LMP of the T tree and since the PON is built on the Prüfer sequence, just the v node should be after the last node of the n and attached to that in the Prüfer sequence of the T tree. So is guaranteed to be generated only once for each subtree and if it is done for all the nodes, entire subtree of each tree will produce. Now we will introduce the algorithm. The proposed algorithms for generating subtrees and convert them into a string can be seen in Figure 2. Insert CPS(T) in Array A For i=n downto 1 do { Subtree=A[n] Insert CPS(A[n]) in Treestringi Sub(subtree,i,A, stack1,stack2) } Sub(subtree,index,A[],stack1,stack2) c=0 t=0 For j= 1 to index-1 do If index in A[j] then { stack3=stack1 stack4=stack2 subtree2=subree while stack3 not empty { t++ Pop x from stack1 Pull y from stack2 Subtree2=subtree2+x if t>0 then { Insert CPS(subtree2) in treestringi Sub(subtree2,y, A[],stack3,stack4) } } If c>0 then { Temptree push in stack1 TempIndex push in stack2 } Temptree= a[j] TempIndex=j c++ Subtree=subtree+a[j]
  • 73.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 71 Insert CPS(subtree) in treestringi Sub(subtree,j,A[],stack1,stack2) while stack1 not empty { c-- Pull x from stack1 Pull y from stack2 Insert CPS(subtree+x) in treestringi Sub(subtree+x,y, A[],stack1,stack2) } } Figure 2. The algorithm of the subtrees generation and convert them to a string In the following we examined the algorithm works with an example. In the beginning starting from the first tree and CPS (T) are stored in the array A. As a result, the array will be completed for T1 according to Figure 3. Figure 3. Production of the array using CPS (T) In this step we identify all existing subtrees and store them in a string. To do this, we start from the root node of T1, therefore the last element of the array namely A0 and respectively, the branching subtrees from this node should be stored in the string. As a result, at first A0 is stored in the string according to the algorithm, next we run sub function. Considering that the index of previous node is equal to 9 to find the subtrees with two nodes, respectively start from the first element of the array and review to the element with the pre Index of the previous node namely 8, If the value contains the index of the previous node namely 9, It has added to the previous tree namely A and CPS of the found subtree can be inserted in the string of this tree that here A0C2 and A0E2 are stored in the string and recursively repeat the same steps for new generated subtrees. Given that both produced subtrees branched from a node, adding node with smaller index from Stack1 and its index from stack2 are extracted and added to the subtree with a larger index and its CPS is stored in the string,therefore in this step is also added A0E3C3 and also this is repeated for the whole produced subtrees with larger index in the next step. Similarly, the work continues recursively until all subtrees branching from the first node of the array to be stored in string. Then do the same procedure for the next elements of the array, until complete the string of the subtrees of the tree and then we proceed next trees until for each tree, the string is created for all subtrees of the tree.
  • 74.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 72 4.1.2 First stage of Passive phase In the second stage of this phase we use the Inverted Index. Thus, the strings created in the previous stage are inserted into the Inverted Index. CPS and the number of occurrences of each subtree in the all trees are stored in the Dictionary and the name of the trees that are containing the subtree will be stored in the corresponding Posting List. Figure 4. Part of the Inverted Index made for the collection of trees T1, T2 As can be seen in the subtrees are stored in the Dictionary and the parent trees of the corresponding subtrees are stored in the Posting List. 4.2 Active Phase In this phase, simply use inverted index made in the previous phase and will extract frequent tree patterns. Simply types of queries about frequent subtree mining to be answered quickly by using inverted index made in the previous phase. Then we will examine several types of different queries. 4.2.1 Find the occurrence of the desired pattern in tree set First, we are achieved CPS of the desired pattern and then search it into Dictionary of the inverted index and easily extract the number of the occurrence and name of trees that contain the desired pattern from the Posting List of the inverted index. For example, to find the number of occurrences of the S pattern on the collection of trees T1, T2 in Figure 5, should search CPS (S) ie A0C3B3 into Inverted Index that T1 and T2 will be the result.
  • 75.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 73 Figure 5. Part of the Inverted Index made for the collection of trees T1, T2 4.2.2 Find frequent subtrees in considering the Support If we want to find some subtrees that them Support are greater than a threshold, must find the subtrees with their occurrence compared to the total trees is greater than the Support. So we can search in the inverted index and easily find subtrees that the length of them Posting List compared to the total trees is at least equal to Support. 4.2.3 Find frequent subtrees in considering the Support and minimum nodes In this case, in addition to Support, the number of nodes is also the criterion, so easily search in Inverted Index and only show the subtrees with the following conditions. First, in Dictionary Length the subtree is greater than the minimum number of nodes and Second, length of corresponding Posting List compared to the total trees is at least equal to Support. 5. Evaluation In this section, the proposed method will be evaluated from various aspects. We present the experimental evaluation of the proposed approach on synthetic datasets. In the following discussion, dataset sizes are expressed in terms of number of trees. In the graphs is used from symbolizes Algorithm to display proposed method. Name and details of synthetic datasets are shown in Table 1. Table 1. Name and details of synthetic datasets Name Description DS1 -T 10 -V 100 DS2 -T 10 -V 50 As shown in Table 1, the synthetic datasets DS1 and DS2 are generated using the PAFI[28] toolkit developed by Kuramochi and Karypis (PafiGen). Since PafiGen can create only graphs we have extracted spanning trees from these graphs and used in our analysis. We also used minsup to analyze the various factors. This means if the number of replicated subtree is less than minsup value, the tree won't be indexed in Inverted Index. Minsup value is from 1 to infinity, which is the default value is equal to 1 in the proposed algorithm. In addition, we also use from maxnode in evaluations. Maxnode is the symbol to specify the maximum number of nodes in each subtree in Inverted Index. This means if the number of subtree nodes reach maxnode amount in the proposed algorithm, production of its subtree will halt.
  • 76.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 74 Maxnode value is from 1 to infinity, and the default value is equal to infinity. 5.1 Evaluating the performance of the proposed method At the beginning, we evaluated our proposed algorithm on two synthetic datasets DS1 and DS2. The performance of the proposed algorithm for frequent tree minig on synthetic datasets is shown in Diagram 1. In this experiment, the minsup equal to one and the maxnode is equal to infinity. Given that the Subtrees are indexed in passive phase at times when the system is idle, mining time in Inverted Index rises with a gentle slope By increasing the number of trees. So clearly spelled out the introduced algorithm is scalable. Diagram 1: The performance of the algorithm on synthetic datasets 5.2 Evaluating effect of minsup on the number of indexed patterns We examine effect of minsup on the number of indexed patterns in Diagram 2. This experiment has been done on synthetic datasets DS1 and DS2 generated by Pafi and with size 50K. In this experiment the maxnode is the default value ie infinity. As can be seen in the diagram, the number of indexed patterns is increasing exponentially by decreasing minsup. Diagram 2: Effect of minsup on the number of indexed patterns 0 1 2 3 4 5 6 7 8 9 10 10K 20K 30K 40K 50k miningtime # of trees DS1 DS2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 2,500 500 250 50 25 5 1 #ofindexedpatterns Thousands minsup DS1 DS2
  • 77.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 75 5.3 Evaluating effect of maxnode on usage memory We examine the effect the maximum number of nodes in the indexed subtrees on usage memory in passive phase. This experiment has been done on synthetic datasets DS1 and DS2 generated by Pafi and with size 50K. In this experiment the minsup is the default value ie 1. As can be seen, the usage memory of the algorithm is increasing by increasing the number of indexed nodes in each subtree. Diagram 3: Effect of maxnode on the usage memory 5.4 Evaluation of CPU utilization compared with the Tree Miner In diagram 4, the comparison is performed between the proposed algorithm with Tree Miner that was introduced by Zaki and is one of the best algorithms for tree mining[29]. This experiment has been done on synthetic dataset DS1 generated by Pafi and with size 50K. Given that in passive phase the proposed algorithm is searching for subtrees and adding them to inverted index, consequently, as can be seen in the diagram, CPU utilization is close to 100 percent in most situations while the average CPU utilization on TreeMining algorithm is approximately 90%. Diagram 4: Comparison CPU utilization between TreeMiner and the algorithm 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 1 5 25 50 250 500 2,500 VirtualMemory(MB) Maximum # of node in Subtrees DS1 DS2 0 10 20 30 40 50 60 70 80 90 100 10K 20K 30K 40K 50K Cpuutilization(%) # of trees TreeMiner Algorithm
  • 78.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 76 6. Conclusions and Recommendations In this paper, a new method based on the Inverted Index in order to frequent pattern mining was introduced to overcome many of the disadvantages of previous methods. One problem with existing approaches is that mainly act as a static on the set of trees and if a new tree is added to the set of trees, all mining operations must be done from scratch again. This problem has been overcome by the Inverted Index in the proposed approach. This means that all the trees are indexing in the Passive phase and if a new tree is added to the treeset at any stage, just the tree is indexed and there is no need to repeat the previous operations. This algorithm will result in a high performance on a collection of dynamic trees. Another advantage of this method compared to other methods is that it is scalable. As listed in Section 5.1, the performance of this algorithm is not slowed by increasing the treeset. As listed in Section 5.4, one of the most striking features of this algorithm is efficient use of CPU. In this method, the user interaction is also present. As Listed in Section 5.2, the number of indexed patterns increases exponentially by decreasing minsup, while the generally patterns with low occurrences doesn't matter to us. As a result, we can speed up indexing in passive phase with determining the appropriate amount of the minsup. As Listed in Section 5.3, the usage memory increases by increasing the maximum number of nodes in the indexed subtrees, while the usually subtrees with very large number of nodes doesn't matter to us. As a result, we can manage the usage memory with determining the appropriate amount of the maxnode. REFERENCES [1] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets based on WIT-trees," International Journal of Advanced Computer Research, p. 9, 2013. [2] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of Semi Structured data: a Survey," International Journal of Advanced Computer Research, p. 5, 2013. [3] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013. [4] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear Prefix tree," International Journal of Advanced Computer Research, p. 15, 2014. [5] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and Data Mining, p. 13, 2013. [6] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.
  • 79.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 77 [7] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets from Sparse Data," Web-Age Information Management, p. 7, 2013. [8] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent pattern mining over data streams," Advances in Knowledge Discovery and Data Mining, p. 15, 2014. [9] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering Patterns of Reposting Behavior in Microblog," Advanced Data Mining and Applications, p. 13, 2013. [10] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering weight conditions over data streams", Advances in Knowledge Discovery and Data Mining, 2014. [11] B. Kimelfeld and P. G. Kolaitis ," The complexity of mining maximal frequent subgraphs," Proceedings of the 32nd symposium on Principles of database systems, p. 12, 2013. [12] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets based on WIT-trees," International Journal of Advanced Computer Research, p. 9, 2013. [13] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of Semi Structured data: a Survey," International Journal of Advanced Computer Research, p. 5, 2013. [14] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013. [15] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear Prefix tree" International Journal of Advanced Computer Research, p. 15, 2014. [16] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and Data Mining, p. 13, 2013. [17] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013. [18] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets from Sparse Data," Web-Age Information Management, p. 7, 2013. [19] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent pattern mining over data streams," International Journal of Advanced Computer Research, p. 15, 2014. [20] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering Patterns of Reposting Behavior in Microblog," Advanced Data Mining and Applications, p. 13, 2013. [21] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering weight conditions over data streams," International Journal of Advanced Computer Research, 2014. [22] B. Kimelfeld, and P. G. Kolaitis, "The complexity of mining maximal frequent subgraphs," Proceedings of the 32nd symposium on Principles of database systems, p. 12, 2013. [23] H. Prüfer. Prüfer sequence. Available: http://en.wikipedia.org/wiki/Pr%C3%BCfer_sequence [24] C. D. Manning, P. Raghavan, and H. Schütze, An Introduction to Information Retrieval. Cambridge, England: Cambridge University Press, 2008.
  • 80.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 78 [25] Y. Xiao, J.-F. Yao, Z. Li, and M. H. Dunham, "Efficient data mining for maximal frequent subtrees," Proceedings of 3rd IEEE International Conference on Data Mining, p. 8, 2003. [26] S. Tatikonda, S. Parthasarathy, and T. Kurc, "TRIPS and TIDES: New Algorithms for Tree Mining," Proceedings of 15th ACM International Conference on Information and Knowledge Management (CIKM), p. 12, 2006. [27] F. D. R. Lopez, A.Laurent, P.Poncelet, and M.Teisseire, "FTMnodes: Fuzzy tree mining based on partial inclusion," Advanced Data Mining and Applications, pp. 2224–2240, 2009. [28] Kuramochi and Karypis. Available: http://glaros.dtc.umn.edu/gkhome/pafi/overview/ [29] M. J. Zaki, "Efficiently Mining Frequent Trees in a Forest," Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD), Edmonton, Canada, p. 10, 2002. This paper may be cited as: Tajedi, S. and Naderi, H., 2014. Present a Way to Find Frequent Tree Patterns using Inverted Index. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 66-78.
  • 81.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 79 An Approach for Customer Satisfaction: Evaluation and Validation Amina El Kebbaj and A. Namir Laboratory of Modeling and Information Technology, Department of Mathematics and Computer Science, Faculty of Sciences Ben M'sik, Hassan2-Mohammedia University Casablanca - 7955, Morocco ABSTRACT The main objective of this work is to develop a practical approach to improve customer satisfaction, which is generally regarded as the pillar of customer loyalty to the company. Today, customer satisfaction is a major challenge. In fact, listening to the customer, anticipating and properly managing his claims are stone keys and fundamental values for the enterprise. From a perspective of the quality of the product, skills, and mostly, the service provided to the customer, it is essential for organizations to differentiate themselves, especially in a more competitive world, in order to ensure a higher level of customer satisfaction. Ignoring or not taking into account customer satisfaction can have harmful consequences on both the economic performances and the organization’s image. For that, it is crucial to develop new methods and have new approaches to THE PROBLEMATIC customer dissatisfaction, by improving the services quality provided to the costumer. This work describes a simple and practical approach for modeling customer satisfaction for organizations in order to reduce the level of dissatisfaction; this approach respects the constraints of the organization and eliminates any action that can lead to loss of customers and degradation of the image of the organization. Finally the approach presented in this document is tested and evaluated. Keywords: Approach, Evaluation, Quality, Satisfaction, Test of homogeneity, Validation. 1. INTRODUCTION “Does the company have the most meaningful information at the right time to make the best possible business decisions?” is the question most companies want to answer. “The purpose of a company is to create and keep a customer (Levitt, 1960)”: through this declaration, the important phases of the life cycle of the customer management, which are acquiring costumers and ensuring their loyalty are clearly identified. Companies are moving towards “customer oriented” management and focus on the life cycle of their customers. According to “Moisand 2002”, the life cycle of the customer is defined as the time interval between the moments for a costumer to change its status from being a “new costumer” to the status of a “lost/former customer”.
  • 82.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 80 In a context of globalized and very competitive market, where the departments moved from a more classic level of management (cost centered) to a value centered approach, the mission of the decision-makers has evolved from proposing services and strategic partnerships to value creation. To achieve this goal it’s necessary to have all the data to enlighten the past, to clarify the present in order to predict the future by avoiding to be confronted with gray areas (caused by lack of information). Business intelligence includes all IT solutions (methods, facilities and tools) used to pilot the company and help to make decisions. This approach can be modeled by the three systems below: 1. Decision System: think, decide and control; 2. Effective System: transform and produce; 3. Information System: links the “Decision System” with the “Effective System”. Its main purposes are:  Generating information  Memorizing information  Broadcasting information  Processing information. Figure 1. The information system The information system is a subsystem of the organization that is responsible for collecting, storing, processing and Broadcasting informations in effective system and decision system. In effective system, the information is a current view of business data (invoice, purchase orders ...), in decision system, the information is more synthetic because it should allow decision making (The list of 3 products less sold in January 2014). So the information system links these two subsystems and must bring to all organizational actors of the company, the information they need to act and
  • 83.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 81 decide. So IS is a representation of reality, it leads to coordinate the activities of the company. This work is situated in this spirit, it consist to give a contribution to maximize customer satisfaction of the company, meaning to propose an approach that eliminates any form of loss of customer inside an organization, then, to evaluate and validate the approach. Finally, to test the homogeneity of the problem in order to measure customer satisfaction to conduct corrective actions based on two dimensions of quality:  The "made" quality 𝐐 𝐫: the product, process or service are conform to what are defined as expected? It is composed of the different evaluation to judge the achievement of target processes, to measure the effects and check if the desired results were achieved.  The "perceived" quality 𝐐 𝐩: what level of satisfaction generated from the customer? It is defined by excellence of the product (Zeithaml, 1988). The ultimate goal is to have 𝑄𝑟 =𝑄 𝑝 Figure 2. Company's qualities The introduction has defined the conceptual framework of the work. It presented the issue addressed and contributions in the domain of company’s governance. The following is composed of 3 sections: In the 2nd paragraph, we expose the approach and then the latter is statistically evaluated from concrete examples. In the 3rd paragraph, we test the homogeneity of the problem. The conclusion shows the outline of this study and our contribution. It also shows the various extensions and possible future works.
  • 84.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 82 2. PROPOSED APPROACH Standish Group (Valery, 2001) did a study which was conducted internationally and evaluated the success and failure of IT projects. Accumulated data over the past ten years are based on a sample of 50,000 projects. This study has identified three levels of evaluation of a project:  The success of project : it is characterized by a system delivered in hours and on time, for a cost within budget and fully compliant to the specifications;  The failure of project : it is characterized by the cessation of the project ;  Finally, the partial success or partial failure of a project: it is characterized by the late delivery of a system partially responsive, especially in terms of business scope, the specifications and a cost of up to 200% of the original budget. Only 29% of projects were successful, 53% partial success or half failure and 18% failed. The proportion of abandoned projects outside the budget or out of time reaches 71%. 2.1 Statement This study shows that the customer satisfaction is not always reached, perceived quality tend towards a desired quality presents a real challenge. Within the company, quality is increasingly focused on customer satisfaction. To win contracts, business leaders rely more on quality than price advantages. Staff involvement, with listening to the customer, is a key element for the success of a quality approach. The latter is the implementation of all the resources available to an establishment to provide a service that meets the needs and expectations of customers. From the customer perspective, a warm welcome and quality service is "normal", it is lack of quality which is penalizing to him. To attract the customer, we must establish standards within the company by identifying the market need. There are international standards that ensure safe products and services, reliable and with high quality. These standards are called ISO Standards. For companies, there are strategic tools for lowering costs, increasing productivity and reducing waste and errors. For companies, getting a certification is the preferred way of knowing the quality of their organization to their customers and their suppliers. 2.2 Steps of the approach Below the 7 best practices for customer satisfaction:
  • 85.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 83 a) To develop team’s skills: do additional training on IT tools to mount the team’s skills. b) To make customer satisfaction a challenge for all the company: the company can use the dissatisfaction of their customers to improve our products and services. Bill Gates, Microsoft CEO, said that "the unhappy customers are the best sources of information." Because customers who express dissatisfaction enable companies to identify and resolve defects services faster. Dissatisfied customers are very expensive for companies, the cost of recruiting a new customer is usually five times higher than the cost of acquired customer retention. It is far better to work to keep its customers than to recruit new ones to replace those who leave. So, according to Jacques-Antoine Granjon, founder of Vente- privee.com, the treatment of customer dissatisfaction should not only be considered as a cost but as an investment. c) To motivate teams: to mark clearly the importance of customer satisfaction, some companies have introduced a variable part in pay for some employees, calculated on the basis of indicators related to customer satisfaction. d) To facilitate contacts customers: there are 5 types of communications channels:  Telephone: Availability (24/24 7/7), Saving time ;  Face to face : Immediate Response, Human Contact ;  E-mail : Traceability (written proof) ;  Website: simplicity ;  Postal mail. e) To anticipate the dissatisfaction : Whatever the quality of claims processing, it may be better to move this claim and make a gesture to customers who had a bad experience product - or where this risk exists - without waiting for them to occur. f) To measure customer satisfaction (evaluate to improve): today it is essential to regularly assess the level of achievement of the final goal of customer satisfaction. For example by sending to all customers who have experienced dissatisfaction after the close of the case, a satisfaction survey designed by the customer service and measuring the accessibility of the service, reception, understanding and treatment of dissatisfaction,
  • 86.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 84 g) To reach out to customers on the Internet : The benefit may also be provided on the Internet by another customer, a social network (Twitter, Facebook ...), Make social media a true extension of customer service, with employees able to participate in discussions and respond directly to customer requests on these media. 3. EVALUATION AND VALIDATION OF THE APPROACH Consider the case of a service company that manages the work of many potential customers as "France Gas". the latter signed a contract with the host company specifying the clauses that must be respected and among the latter is the rate of customer satisfaction which should reach 92% and this percentage is established post-agreement between two parties, and if that percentage is not met, a penalty will be done due to customer dissatisfaction. A development team of the host company supports the realization of applications for "France Gas". This team should produce 22 applications monthly with the dissatisfaction rate should not exceed 8% (2 applications per month). The cause of client dissatisfaction is due to the following:  Application does not answer the need or generate unexpected errors after delivery  Timeout To avoid these situations, companies have an interest in implementing continuous improvement process which ultimate goal is the elimination of all forms of waste, such as customer dissatisfaction. The problem to be solved is, for Pn period, to maximize the number of satisfied customers. To evaluate the approach we will need to test it in a sample for evaluation and validation. We start by making our Statistical hypothesis ( 𝐻0 and 𝐻1 ).  The first - the null hypothesis or Ho. note: H0 : "Qr=Qp". Qr : the proportion of customer satisfaction desired Qp : the real percentages of satisfaction.  The second, the alternative hypothesis H1 : "Qp<Qr" 3.1 Before the approach
  • 87.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 85 3.1.1 Example1: April 2013 The team was unable to process only 10 simple applications. The customer sent feedback to present his degree of satisfaction. There are 3 kind of response: NS (Not Satisfied, S: Satisfied, N: Neutral) Table 1. Customer’s feedback of April 2013 APPLICATIONS CUSTOMER SATISFACTION (S, NS, N) REASONS OF DISSATISFACTION 1 PipRep 2.0 FR NS timeout 2 Contextor 2.8 FR NS timeout 3 Contextor 2,2,3 S ------ 4 Hermes Horizon S ------ 5 Agent SSR 2011 Ns Application does not work correctly 6 Plugin SSR 2011 Ns Application does not work correctly 7 Agent Altiris 2011 S 8 GECO 1.17.3 FR Ns timeout 9 Nexthink collector S -------- 10 Cosmocom 4 FR 1.0 S -------- Once the feedback is received, we proceed to calculate the percentage of the monthly satisfaction as shown in the following table: Table 2. Satisfaction rates of April 2013 Satisfaction type Customer Satisfaction Satisfaction rates S (satisfied) 5 50% NS (unsatisfied) 4 40% N (neutral) 1 10% This table above can be modeled by the following figure:
  • 88.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 86 Figure3. Customer satisfaction of April 2013 Ps (t0)=P(Xt0 = S)=0.5 PNS (t0)=P(Xt0 = NS)=0.4 PN (t0)=P(Xt0 = N)=0. 1 As Qr =92% and the hypothesis H0 = "Qr=Qp" and H1="Qr<Qp». We use here one-tailed left test. If Qp − Qr Qr(1− Qr ) n > −t′ so we accept the hypothesis H0 and we reject H1with error risk α =5% “t” is calculated using the table of the normal distribution: P (−tα ≤ T≤ tα)=1-α=0.95=>tα=1.645 using the table of normal distribution and tα = 1.833 using the table of Student distribution. We have Qr=92% and from the example =50% f − Qr Qr (1−Qr ) n = 0.5 − 0.92 0.92(1−0.92) 10 = −0.42 0.0857 = −4.9 < −1.645 So we accept the hypothesis H1="Qp<Qr" and we reject H0 = "Qr=Qp" with error risk α =5%. And the observed difference is significant. 3.2 After the approach 3.2.1 Example2: December2013 The team treated 22 applications as shown the following figure: 50% 40% 10% S (satisfied) NS (unsatisfied) N (neutral) Customer satisfaction of April 2013
  • 89.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 87 Table 3. Customer’s feedback of December 2013 APPLICATIONS SATISFACTION (S, NS, N) REASONS OF DISSATISFACTION 1 MSC_CASP69 NS timeout 2 MSC_MDX NS timeout 3 Woodmac S ------ 4 Whoswho s ------ 5 Adobe Air Installer s ------ 6 WinZip s ------- 7 MSC_SetupDemdet s ------- 8 Jabber s -------- 9 TrendMicro_Office s -------- 10 ORG+ s -------- 11 QlikView s --------- 12 Q4- Engica N 13 TMS N 14 MSCLink_Core s --------- 15 MIPS s --------- 16 Rsclientprint NS Application does not work correctly 17 TextPad s --------- 18 MSC_DMX s --------- 19 MSC_MSCOMCT2 NS timeout 20 Add-in Excel S ---------- 21 Pre-req Excel S ---------- 22 Ios S ------ We proceed to calculate the percentage of the monthly satisfaction as shown in the following table: Table 4. Satisfaction rates of December 2013 Satisfaction type Customer Satisfaction Satisfaction rates S (satisfied) 16 72.72%
  • 90.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 88 NS (unsatisfied) 4 18.18% N (neutral) 2 9.09% This table above can be modeled by the following figure: Figure4. Customer satisfaction of December 2013 Ps (t0)=P(Xt0 = S)=0.727 PNS (t0)=P(Xt0 = NS)=0.181 PN (t0)=P(Xt0 = N)=0. 091 We have Qr=92% and from the example =72% f − P0 P0(1−P0) n = 0.72 − 0.92 0.92(1−0.92) 22 = −0.2 0.182 = −1.09 > −1.645 And with Student law we have tα = 1.721, so this is also verified. So we accept the hypothesis H0 = "Qr=Qp" and we reject H1="Qp<Qr" with error risk α =5%. The difference between P and P0 observed is due to sampling fluctuations. 3.2.2 Example 3: January 2014 The team treated 21 applications as shown the following table: Table 5. Customer’s feedback of January 2014 Applications Satisfaction (S, NS, N) REASONS OF DISSATISFACTION 1 Windows6.1-KB2574819 S ------ 2 MigrationAssistantTool NS The installation must be silent 73% 18% 9% S (satisfied) NS (unsatisfied) N (neutral) Customer satisfaction of December 2013
  • 91.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 89 3 See Electrical Viewer 4 S ------ 4 Adobe_Flash_Player s ------ 5 MSC_DEPOT s ------ 6 Colibri 2.0 s ------- 7 Navision s ------- 8 OFFICE 2013 s -------- 9 Windows6.1-KB2592687 s -------- 10 CheckPoint VPN s -------- 11 Interlink_MSCLink s --------- 12 CrystalReportsRuntime N 13 InterlinkComponentOne s --------- 14 MSXML s --------- 15 VisualC++Redistributable s --------- 16 ReportViewer_2010 NS Application does not work correctly 17 .Net_Framework s --------- 18 MSCLink_Core s --------- 19 MSCLink_Configuration NS timeout 20 LDOC S ---------- 21 MigrationAssistantTool S ---------- We proceed to calculate the percentage of the monthly satisfaction as shown in the following table: Table 6. Satisfaction rates of January 2014 Satisfaction type Customer Satisfaction Satisfaction rates S (satisfied) 17 80.95% NS (unsatisfied) 3 14.28% N (neutral) 1 4.76% The table above can be modeled by the following figure:
  • 92.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 90 Figure 5. Customer satisfaction of January 2014 Ps (t0)=P(Xt0 = S)=0.81 PNS (t0)=P(Xt0 = NS)=0.14 PN (t0)=P(Xt0 = N)=0. 05 We have Qr=92% and from the example =80% f − P0 P0(1−P0) n = 0.8 − 0.92 0.92(1−0.92) 21 = −0.12 0.187 = −0.64 > −1.645 And with Student law we have tα = 1.721, so this is also verified. So we accept the hypothesis H0 = "Qr=Qp" and we reject H1="Qp<Qr" with error risk α =5%. The difference between P and P0 observed is due to sampling fluctuations. 4. TEST OF HOMOGENEITY We are faced with two samples which are most often not known whether they are from the same source population. It is sought to test whether these samples have the same characteristic ℓ. Two values is observed ℓ1 and ℓ2, the difference between these two values may be due either to sampling fluctuations or the difference of the characteristics of the two original populations. That is to say, from the examination of two samples of size n1 and n2 , are respectively extracts of populations P1 (M1; α1) and P2 (M2;α2), these tests are used to decide between: H0 = « ℓ1= ℓ2»: (we conclude the homogeneity) H1= «ℓ1 ≠ ℓ2»: (we conclude the heterogeneity). In our case we test the homogeneity of 2 proportions: f1= proportion of units having the calculated character X in sample 1; f2= proportion of units having the calculated character X in sample 2; p1= proportion of units having the character X in the population ;
  • 93.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 91 p2= proportion of units having the character X in the population . H0= «P1 =P2=P » and H1= « P1≠P2 » P is replaced by the estimator f = n1f1+n2f2 n1+n2 = 22∗0.72+21∗0.81 22+21 = 0.764  x = 0.81−0.72 0.764∗0.24( 1 22 + 1 21 ) =0.02 > -1.645 So we conclude the homogeneity of the proposed solution. The proposed population is homogeneous and the difference observed is more significant and is due to sampling fluctuations. 4. CONCLUSIONS The work done is to develop a practical and pragmatic approach to maximize customer satisfaction in an organization for a given period. Therefore, an approach has been proposed, evaluation and validation of the latter are described above. This work opens the way to our sense towards diverse perspectives of research which are situated on two plans: a plan of deepening of the realized research and a plan of extension of the domain of research. In terms of deepening of the proposed work, it would be interesting at first to use the Markov chain to model statistically the proposed model and to propose or develop practical tools for implementation of the proposed approach. As for extension of the domain of the research, it would be interesting to connect this approach to governance of information systems and to drive decision-making system which consist to investigate the options and compare them to choose an action that help in making decision. REFERENCES [1] BUFFA, Elwood. Operations Management, 3rd Ed., NY, John Wiley & Sons, 1972. [2] FRITZSIMMONS, James A. et Mona J. FRITZSIMMONS. Service management: Operations, Strategy and Information Technology, 3rd Ed., NY, Irwin/McGraw-Hill, 2001. [3] Z. Adhiri, S. Arezki, A. Namir : What is Application LifeCycle Management?, International Journal of Research and Reviews in Applicable Mathematics and Computer- Science, ISSN: 2249 – 8931, December 2011 [4] http://hal.archives-ouvertes.fr/docs/00/71/95/35/PDF/2010CLF10335.pdf [5] STEVENSON, William J. Introduction to Management Science, 2e edition, Burr Ridge, IL., Richard D. Irwin, 1992. [6] HILLIER, Frederick S., Mark S. HILLIER and Gerald J. LIEBERMAN. Introduction to Management Science : A Modeling and Case Studies Approach with Spreadsheets, New York, Irwin/McGraw-Hill, 2000. [7] A. EL KEBBAJ et A. NAMIR. Modeling customer's satisfaction. Day of Science Engineers, Faculty of Science Ben M’Sik, Casablanca july 29, 2013 [8] http://www.projectsmart.co.uk/docs/chaos-report.pdf [9] http://info.informatique.entreprise.over-blog.com/article-approche-du-systeme-d- information-dans-l-entreprise-69885381.html
  • 94.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 92 [10] http://www.hamadiche.com/Cours/Stat/Cours5.pdf [11] S. ARIZKI : ITGovA : proposed a new approach to the governance of information systems. PhD in Computer Science, defended at the Faculty of Sciences of Ben M'Sik Casablanca 24/02/2013 This paper may be cited as: Kebbaj, A., E. and Namir A., 2014. An Approach for Customer Satisfaction: Evaluation and Validation. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 79-91.
  • 95.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 92 Spam Detection in Twitter - A Review C. Divya Gowri and Professor V. Mohanraj Sona College of Technology, Salem ABSTRACT Social Networking sites have become popular in recent years, among these sites Twitter is one of the fastest growing site. It plays a dual role of Online Social Networking (OSN) and Micro Blogging. Spammers invade twitter trending topics (popular topics discussed by Twitter user) to pollute the useful content. Social spamming is more successful compared to email spamming by using social relationship between the users. Spam detection is important because Twitter is mainly used for commercial advertisement and spammers invade the privacy information of the user and also the reputation of the user is damaged. Spammers can be detected using content and user based attributes. Traditional classifiers are required for spam detection. This paper focuses on study of detecting spam in twitter. Keywords: Social Network Security, Spam Detection, Classification, Content based Detection. 1. INTRODUCTION Web-based social networking services connect people to share interests and activities across political, economic, and geographic borders. Online Social Networking sites like Twitter, Facebook, and MySpace have become popular in recent years. It allows users to meet new people, stay in touch with friends, and discuss about everything including jokes, politics, news, etc., Using Social networking sites marketers can directly reach customers this is not only benefit for the marketers but it also benefits the users as they get more information about the organization and the product. Twitter [1] is one among these social networking sites. Twitter provides a micro blogging (Exchange small elements of content such as short sentences, individual images, or video links) service to users where users can post their messages called tweets. Tweet can be limited to 140 characters only HTTP links and text are allowed. Twitter user is identified by their user name optionally by their real name. The user „A‟ starts following other users and their tweets will appear on A‟s page. User A can be followed back if other user desires. Trending topics in Twitter can be identified with hash tags („#‟). When a user likes a tweet he/she can „retweet‟ that message. Tweets are visible publically by default, but senders can deliver message only to their
  • 96.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 93 followers. The „@‟ sign followed by username is a reply to other user. The most common type of spamming in Twitter is through Tweets. Sometimes it is via posting suspicious links. Spam [14] can arrive in the form of direct tweets to your Twitter inbox. Unfortunately spammers use twitter as a tool to post malicious link, send spam messages to legitimate users. Also they spread viruses, or simply compromise the system reputation. Twitter is mainly used for commercial advertisement, and spammers invade the privacy information of the user and also the reputation of the user is damaged. The attackers advertise on the Twitter for selling products by offering huge discount and free products. When users try to purchase these products, they are asked to provide account information which is retrieved by attackers and they misuse the information. Therefore spam detection in any social networking sites are important. 2. RELATED WORKS McCord et al., [1] has proposed user based and content based features to facilitate spam detection. User Based Features The user based features considered are number of friends, number of followers, user behaviors e.g. the time periods and the frequencies when a user tweets and the reputation (Based on the followers and friends) of the user. Reputation of a user is given by the equation, 𝑅 𝑗 = 𝑛𝑖(𝑗) (𝑛𝑖 𝑗 + 𝑛0 𝑗 ) (2.1) Where 𝑛𝑖(𝑗) represents the number of followers of user „j‟ and 𝑛0 𝑗 represents the number of friends the user „j‟ has. According to Twitter Spam and Abuse Policy „if the users have small number of followers compared to the amount of people the user following then it may be considered as a spam account‟. Spammers tend to be most active during the early morning hours while regular users will tweet much less. Content Based Features The content based features [11] considered in this approach are number of Uniform Resource Locator‟s (URL), replies/mentions, keywords/word weight, retweet, hash tags. Retweet is a reposting someone‟s post, it is like a normal post with author‟s name. It helps to share the entire tweet with all the followers. The „#‟ containing tweets are the popular topics being discussed by the users.
  • 97.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 94 Secondly, they compare four traditional classifiers namely Random forest, Support Vector Machine (SVM), Naïve Bayesian and K-nearest neighbor classifiers which are used to detect Spammers. Among these classifiers Random Forest is found to be effective but this classifier can deal with only imbalanced data set (data set with more regular users than spammers). Alex HaiWang [2] considered 'Follower – Friend‟ relationship in his paper. A „Direct Social Graph‟ is modeled. The author considers content based and graph based features to facilitate spam detection. Graph Based Features A social graph is modeled as a direct graph G= (V, A) where set of V nodes representing user accounts and set A that connect the nodes. An arc a = (i, j) represents user i is following user j. Follower is considered as the Incoming links or in links of a node i.e.., People following you not necessary that you should follow back. A Friend is an Outgoing links or out links. i.e.., People you are following. A Mutual Friend is a Follower and Friend at the same time. When there is no connection between two users then they are considered to be strangers. Fig 2.1 A Simple Twitter Graph In the above figure user A is following user B, user B and user C are following each other. i.e., User B and user C are mutual friends, and User A and User C are strangers. The graph based features considered are number of followers, number of friends, and the reputation of a user. The classifier used in this paper to detect spam is Naïve Bayesian classifier [10]. It is based on Bayes theorem which is given by the equation, 𝑃 𝑌 𝑋 = ((𝑃 𝑋 𝑌 𝑃 𝑌 ) 𝑃(𝑋) (2.2) The twitter account is considered as vector X and each account is assigned with two classes Y spam and non-spam, the assumption is that the features are considered to be conditionally independent. This classifier is easy to implement, it requires small amount of training data set. But, conditionally independence may lead to loss of accuracy. This classifier cannot model independency.
  • 98.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 95 Twitter Account Features: Zi Chu et al., [13] review some of the classification features to detect spammers. These include tweet level features and account level features. The tweet level features include Spam Content Proposition i.e. tweet text checked with spam word list and the final landing URL is checked. The account level features include account Profile which is the self description of short description text and homepage URL and check whether the description contains spam words. Fabricio Benevento et al., [3] have considered the problem of detecting spammers. In this paper approximately 96% legitimate users and 70% spammers were correctly classified. Like [1] user based and content attributes are considered. To detect spammers with accuracy, confusion matrix is introduced Fig 2.2 An Example of Confusion Matrix In the above matrix, „a‟ is the number of spam correctly classified, „b‟ is the number of spam wrongly classified as non-spam, „c‟ is the number of non- spam wrongly classified as spam and „d‟ is the number of non-spam correctly classified. For effective classification some of evaluation metrics are considered. They are Precision, Recall, F-measure (Micro-F1, Macro- F1). Evaluation Metrics: Precision: It is defined as the ratio of the number of users classified correctly to the total predicted users and is given by the equation, 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑝 = 𝑎 (𝑎 + 𝑐) (2.3) Recall: It is defined the ratio of number of users correctly classified to the number of users and is given by the equation, 𝑅𝑒𝑐𝑎𝑙𝑙, 𝑟 = 𝑎 (𝑎 + 𝑏) (2.4) F-measure: It is the harmonic mean between precision and recall and it is given by the equation, 𝐹_𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2𝑝𝑟 (𝑝 + 𝑟) (2.5)
  • 99.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 96 The classifier used to detect spam is SVM. It is a state of the art method in classification and in this approach they use non-linear SVM with the Radial Basic Function kernel that allow SVM to perform with complex boundaries. The biggest limitation of the support vector approach lies in choice of the kernel and high algorithmic complexity. This approach mainly focuses on detecting spam instead of spammers so that it can be useful in filtering spam. Once a spammer is detected it is easy to suspend that account and block the IP address but spammers continue their work from other new account. Puneeta Sharma and Sampat Biswas [4] proposed two key components (1) identifying timestamp gap between two successive tweets and (2) identifying tweet content similarity. They found two common techniques used by spammers (1) Posting duplicate content by modifying small content of the tweet; (2) Post spam within short intervals. Spam Identification approach included BOT Activity Detection and Tweet Similarity Index. Twitter data can be filtered in various ways by user id, by keyword, many spammers post spam messages using BOT (computer program), reducing the frequency between consecutive tweets. To calculate timestamp between tweets, they first cluster tweets based on user id and sort by increasing timestamp. Fig 2.3 BOT activity detection Cluster Tweets Identify Timestamp Gap Gap < 10s Spam Non Spam Cluster tweets based on user id Calculate time difference YES NO
  • 100.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 97 Spammers can be classified as (1) desperate spammers and (2) sophisticated spammers. Desperate spammers use automatic programs to post multiple tweets with small time difference between posts. Sophisticated spammers create time gap between each tweet. Spammers mostly post duplicate tweets in trending topics such as jumbling the words between tweets, using set of words, including numbers in the topic or appending the topic with commercial advertisement. Tweet similarity index approach determines the behavior of spammers and filters spam. They first cluster tweets based on user id and then process each user‟s set of tweets independently. They create buckets of similar tweets by calculating Jaccard and Levenshtein similarity coefficient. As a result, they have buckets containing most similar tweets together resulting in clusters of similar text. Once all the tweets are collected they check the size of each bucket and if it is greater than one then they considered it as spam. Fig 2.4 Tweet Similarity Index Identify Tweet Similarity Create Buckets for Similar Tweets BucketSize > 1 Cluster Tweets Non Spam Spam Cluster tweets based on user id Calculate Jaccard and Levenshtein Distance YES NO
  • 101.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 98 Levenshtein distance It is a string metric measurement for calculating the difference between two sequences or text. Informally, the Levenshtein distance between two words is the minimum number of single-character edits required to change one word into the other including insertion, deletion, and substitution. The edit distance phrase is often used to refer Levenshtein distance. The distance is zero if the strings are equal. For example, the Levenshtein distance between "sitter" and "sitting" is 3 Sitter → sittin (substitution of "i" for "e") Sitter → sittin (substitution of "r" for "n") Sittin → sitting (insertion of "g" in the end). This Levenshtein distance is used to find out the duplicate tweets i.e. if the tweets are duplicates then the distance is zero. Jaccard Index It is also called Jaccard similarity coefficient. It is used for comparing diversity and similarity of sample sets. J (A, B) = |𝐴∩𝐵| |𝐴∪𝐵| (2.6) Jaccard Distance measures dissimilarity between sample sets which is obtained by subtracting the jaccard coefficient from 1. 𝑑𝑗 𝐴, 𝐵 = 1 − 𝐽(𝐴, 𝐵) (2.7) Dolvara Gunatilaka [9] discusses about two privacy issues. First is user‟s identity or user‟s anonymity. The second issue is about user profile or personal information leakage. User anonymity It is that in many social networking sites users use their real name to represent their account. There are two methods to expose user‟s anonymity: (1) de-anonymization attack and (2) neighborhood attack [15]. In the first one, the user‟s anonymity can be revealed by history stealing and group membership information while in the second one, the attacker finds the neighbors of the victim node. Based on user‟s profile and personal information, attackers are attracted by user‟s personal information like their name, date of birth, contact information, relationship status, current work and education background. There can be leakage of information because of poor privacy settings. Many profiles are made public to others i.e. anyone can view their profile. Next is
  • 102.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 99 leakage of information through third party application. Social networking sites provide an Application Programming Interface (API) for third party developers to create applications. Once users access these applications the third party can access their information automatically. Social Worms It discuss about some of the social worms. Among those worms Twitter worm is one of the popular worms. Twitter worm: It is a term to describe worms that are spreading through twitter. There are many versions and two worms that are discussed in this paper are: Profile Spy worm: This worm spreads by posting a link that downloads a third party application called “Profile Spy” (a fake application). When users try to download the application they need to fill some personal information which allows attacker to obtain user‟s information. Once account is infected, it will continuously tweet malicious messages to their followers. Next twitter worm is Google worm which uses shortened Google URL that tricks the users to click the link. The fake link will redirect users to a fake anti- virus website. The website will provide a warning saying that computer got affected and allows user to download the fake antivirus which is actually a malicious code. Sender Receiver Relationship Jonghyuk Song et al., [7] propose a spam filtering techniques based on sender receiver relationship. This paper addresses two problems in detecting spam. First is account features can be fabricated by spammers. Second is account features cannot be collected until number of malicious messages are reported in that account. The spam filtering does not consider account features rather than it uses relational features i.e. the connectivity and the distance between the sender and receiver. Relational features are difficult to manipulate by spammers. Since twitter limits the tweet to 140 characters spammers cannot put enough information in that. For this reason spammers go for posting URL containing spam. They classify the messages as spam based on the sender. Content filtering is not effective in twitter because it contains small amount of text. Restrictions in Twitter Some of the restrictions considered in twitter [9] are: The user must not follow large number of users in a short time. a. Unfollowing and following someone repeatedly. b. Small number of followers when compared to the amount of following.
  • 103.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 100 c. Duplicate tweets or updates. d. Update consisting of only links. The distance between two users is calculated as follows [5][6] when two users are directly connected by an edge, the distance is one. This means that the two users are friends. When the distance is greater than one, they have common friends but not friends themselves. Next the connectivity represents the strength of the relationship. The way to measure connectivity is counting the number of paths. Hence, the connectivity between a spammer and a legitimate user is weaker. The problem of this system is that it identifies messages as normal if it comes from infected friends. Sometimes attackers may send spam messages from legitimate accounts by stealing passwords. D. Karthika Renuka and T. Hamsapriya [8] an unsolicited email is also called spam and it is one of the fastest growing problems associated with Internet. Among many proposed techniques Bayesian filtering is considered as an effective one against spam. It works based on probability of words occurring in spam and legitimate mails. But keywords are used to detect spam mails in many spam detection system. In that case misspellings may arise and hence it needs to be constantly updated. But it is difficult to constantly update the blacklist. For this purpose Word Stemming or Hashing Technique is proposed. This improves the efficiency of content based filter. These types of filters are useless if they don‟t understand the meaning of the meaning of the words. They have employed two techniques to find out the spam content Content based spam filter [10] This filter works on words and phrases of the email content i.e. associates the word with a numeric value. If this value crosses certain threshold it is considered as spam. This can detect only valid words with correct spellings. This filter uses bayes theorem for detecting the spam content. Word stemming or word hashing technique this filter [12] extracts the stem of the modified word so that efficiency of detecting spam content can be improved. Rule-based word stemming algorithm is used for spam detection. Stemming is an algorithm that converts a word into related form. One such transformation is conversion of plurals into singular, removing suffixes.
  • 104.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 101 3. CONCLUSIONS Spammers are the major problem in any online social networking sites. Once a spammer is detected it is easy to suspend his/her account or block their IP address. But they try to spread the spam from other account or IP address. Hence it is recommended to check for spam content in a tweet in the server. If any content matches the spam words present in the data set it is prevented from being displayed. Accuracy is being evaluated in classifying the spam content. Many traditional classifiers are present in classifying spammers from legitimate users but many classifiers wrongly classify non- spammers as spammers. Hence it is efficient to check for spam content in tweets. REFERENCES [1] M. McCord, M. Chuah, “Spam Detection on Twitter Using Traditional Classifiers”. Lecture Notes in Computer Science, Volume 6906, pp. 175-186, September 2011. [2] A. H. Wang, “Don‟t Follow me Spam Detection in Twitter”, Security and Cryptography (SECRYPT). Proceedings of 5th International Conference on Security and Cryptography, July, 2010. [3] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida, ”Detecting Spammers on Twitter”. CEAS 2010 Seventh annual Collaboration, Electronic messaging, Anti Abuse and Spam Conference, July 2010. [4] Puneeta Sharma and Sampat Biswas,”Identifying Spam in Twitter Trending Topics”. American Association for Artificial Intelligence, 2011. [5] “Levenshtein_distance”, http//en.wikipedia.org/wiki/Levenshtein_distance. [6] ”Jaccard_Index”, http//en.wikipedia.org/wiki/Jaccard_index. [7] Jonghyuk Song, Sangho Lee and Jong Kim, “Spam Filtering in Twitter using Sender- Receiver Relationship”. Recent Advances in Intrusion Detection, Lecture Notes in Computer Science, Volume 6961, pp 301-317, 2011 [8] D. Karthika Renuka, T. Hamsapriya, “Email classification for Spam Detection using Word Stemming”. International Journal of Computer Applications 1(5), pg.45–47, February 2010. [9] ”Twitter_Spam_Rules”, http//support.twitter.com/articles/64986-reporting-spam-on- twitter. [10] S. L. Ting, W. H. Ip, Albert H. C. Tsang - “Is Naïve Bayes a Good Classifier for Document Classification?”. International Journal of Software Engineering and Its Applications, Vol. 5, No. 3, July 2011. [11] R. Malarvizhi, K. Saraswathi. "Content - Based Spam Filtering and Detection Algorithms - An Efficient Analysis & Comparison". International Journal of Engineering Trends and Technology (IJETT), Volume 4, Issue 9, Sep 2013. [12] N.S. Kumar, D. P. Rana, R. G. Mehta, “Detecting E-mail Spam Using Spam Word Associations”, International Journal of Emerging Technology and Advanced Engineering , Volume 2, Issue 4, April 2012.
  • 105.
    International Journal ofComputer Science and Business Informatics IJCSBI.ORG ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 102 [13] Zi Chu, Indra Widjaja, Haining Wang, “Detecting Social Spam Campaigns on Twitter”. Lecture Notes in Computer Science, Volume 7341, pp. 455-472, 2012. [14] Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang, “@spam: The Underground on 140 Characters or Less”. Proceedings of the 17th ACM conference on Computer and Communications Security, ACM New York, NY, USA, 2010. [15]Bin Zhou and Jian Pei, “Preserving Privacy in Social Networks Against Neighborhood Attacks,”. Data Engineering, IEEE 24th International Conference, April 2008. This paper may be cited as: Gowri, C. D. and Mohanraj, V., 2014. Spam Detection in Twitter - A Review. International Journal of Computer Science and Business Informatics, Vol. 14, No. 1, pp. 92-102.