Vol 14 No 1 - July 2014

ISSN: 1694-2507 (Print)
ISSN: 1694-2108 (Online)
International Journal of Computer Science
and Business Informatics
(IJCSBI.ORG)
VOL 14, NO 1
JULY 2014

Table of Contents VOL 14, NO 1 JULY 2014
Symmetric Image Encryption Algorithm Using 3D Rossler System........................................................1
Vishnu G. Kamat and Madhu Sharma
Node Monitoring with Fellowship Model against Black Hole Attacks in MANET.................................... 14
Rutuja Shah, M.Tech (I.T.-Networking), Lakshmi Rani, M.Tech (I.T.-Networking) and S. Sumathy, AP [SG]
Load Balancing using Peers in an E-Learning Environment ...................................................................... 22
Maria Dominic and Sagayaraj Francis
E-Transparency and Information Sharing in the Public Sector ................................................................ 30
Edison Lubua (PhD)
A Survey of Frequent Subgraphs and Subtree Mining Methods ............................................................. 39
Hamed Dinari and Hassan Naderi
A Model for Implementation of IT Service Management in Zimbabwean State Universities ................ 58
Munyaradzi Zhou, Caroline Ruvinga, Samuel Musungwini and Tinashe Gwendolyn Zhou
Present a Way to Find Frequent Tree Patterns using Inverted Index ..................................................... 66
Saeid Tajedi and Hasan Naderi
An Approach for Customer Satisfaction: Evaluation and Validation ....................................................... 79
Amina El Kebbaj and A. Namir
Spam Detection in Twitter – A Review...................................................................................................... 92
C. Divya Gowri and Professor V. Mohanraj
IJCSBI.ORG

International Journal of Computer Science and Business Informatics
IJCSBI.ORG
ISSN: 1694-2108 | Vol. 14, No. 1. JUNE-JULY 2014 1
Symmetric Image Encryption
Algorithm Using 3D Rossler System
Vishnu G. Kamat
M Tech student in Information Security and Management
Department of IT, DIT University
Dehradun, India
Madhu Sharma
Assistant Professor
Department of Computer Science, DIT University
Dehradun, India
ABSTRACT
Recently a lot of research has been done in the field of image encryption using chaotic
maps. In this paper, we propose a new symmetric block cipher algorithm using the 3D
Rossler system. The algorithm utilizes the approach used by Mohamed Amin et al.
[Commun. Nonlinear Sci. Numer. Simulat, (2010)] and Vinod Patidar et al. [Commun
Nonlinear SciNumerSimulat, (2009)]. The merits of these algorithms such as the encryption
structure and the diffusion scheme respectively are combined with an approach to split the
key for the three dimensions to use for encryption of color (RGB) images. The
experimentation results suggest an overall better performance of the algorithm.
Keywords
Image Encryption, Rossler System, Block Cipher, Security Analysis.
1. INTRODUCTION
Image encryption is relatively different from text encryption. Image is made
up of pixels and they are highly correlated; so different approaches are
followed for encryption of images [1-12]. One of the approaches is known
as chaotic cryptography. In this approach, for encryption we use chaotic
maps, which generate good pseudo-random numbers. Cryptographic
properties of these maps such as, sensitive dependence on initial parameters,
ergodic and random like behavior, make them ideal for use in designing
secure cryptographic algorithms. Many scholars have proposed various
chaos-based encryption schemes in recent years [4-12].
A scheme proposed by Mohamed Amin et al. [11] uses Tent map as the
chaotic map and the scheme is implemented for gray scale images. They
proposed a new approach of using the plaintext as blocks of bits rather than
block of pixels. Another scheme proposed by Vinod Patidaret al.[12] uses
chaotic standard and logistic maps and they introduce a way of spreading
the bits using diffusion to avoid redundancy. In this paper, we propose an
algorithm which utilizes the merits of the mentioned schemes. The

IJCSBI.ORG
algorithm uses the Rossler system for the chaotic key generation. We
demonstrate a way to split the 3 dimensions of the key for the 3 image
channels i.e. Red, Green and Blue. The algorithm in [11] is used as a base
structure and the diffusion concept from [12] is used to spread the effect of
adding the key. The symmetric Feistel structure, diffusion method and key
splitting of the encryption scheme provide better results.
The rest of the paper is organized as follows: Section 2 provides a brief
overview of the Rossler system. Section 3 provides the algorithmic details.
The results of the security analysis are shown in section 4. Lastly, Section 5
concludes the paper.
2. BRIEF OVERVIEW OF 3D ROSSLER SYSTEM
Rossler system is a system of non-linear differential equations which has
chaotic properties [13]. Otto Rossler defined these equations in 1976. The
equations are as given below
Xn+1 = -Yn-Zn
Yn+1 = Xn + αYn (1)
Zn+1 = β + Zn (Xn-γ)
where, α, β and γ are real parameters. Rossler system's behavior is
dependent on the values of the parameters α, β and γ. For different values of
these parameters the system displays considerable changes. It may be
chaotic, converge toward a fixed point, follow a periodic orbit or escape
towards infinity. The Rossler system displays chaotic behavior for the
values of α=0.432, β=2 and γ=4.
The chaotic behavior refers to the fact that keeping the parameters constant,
even a slight change in the initial value would bring a significant change in
the subsequent values. For example the value of Z0 = 0.3 generates the value
of Z1 = 0.5. After changing the value of Z0to0.6it generates the value of Z1 =
-1. The same chaotic rule applies for the changes of other two dimensions
(X and Y). This chaotic behavior is known as deterministic chaos, i.e. the
knowledge of initial values and parameter values can help in recreating the
same chaotic pattern. Hence the initial conditions have to be shared between
the entities using the system for encryption/decryption process.
3. PROPOSED ALGORITHM
In this section we provide details of our algorithm. The algorithm is
designed to work with color images (RGB). In this scheme the plaintext
(image) is taken as blocks of bits. The block size is 8w, where ‘w’ is the
word size which is 32 bits. Each block of data is divided and stored into 8
w-bit registers and operations are performed on them. The key length

IJCSBI.ORG
depends on the number of rounds ‘r’ i.e. Key length is 4r+8. The number of
rounds can vary from 1-255. We have taken ‘r’ to be 12 for our
experimentation.
The flowchart shown in Fig. 1 displays the various steps performed on the
image during the encryption process. The steps are explained in the
following subsections.
Figure 1. Flowchart of the Encryption Scheme
3.1 Padding
The processing of the image is done on block of data. 256 bits ie.32 bytes of
data are encrypted/decrypted at a time using eight 32-bit registers. The
image size should be a multiple of 256 bits to ensure that there is always a
full block size for encryption. Hence padding is added so as to make the
input block of size 32 bytes when the image size in bytes is not an integral
multiple of 32. A padding of all zeros (1-31 bytes) is appended to the end of
each row to make the bytes in each row a multiple of 32.

IJCSBI.ORG
For example if the image is of dimensions 252 x 252 pixels, a 4 byte
padding of zeros is appended at the end of each row. The last byte of the
image then stores the number of bytes used as padding as a pixel value i.e. 4
in this case. This pixel value is used to remove the padding after decryption.
After retrieving the number of bytes padded ‘n’, all rows are checked to
determine if zeros exist in all the last ‘n’ bytes and in ‘n-1’ bytes of the last
row. The padding is then removed to generate the original image.
3.2 Key Generation
The key is generated by the 3D chaotic Rossler system as shown in (1). The
number of key bytes ‘t’ depends on the number of rounds ‘r’ i.e. t=4r+8. We
use the three equations separately. The random sequence generated by each
equation of the map is used as a key separately during the encryption
process of the red, green and blue channel of the image respectively. The
key generation concept is as shown below. The steps repeat ‘t’ number of
times to generate necessary key bytes.
a. Iterate Rossler system of equations (1) ‘r’ times where ‘r’ is the
number of rounds.
b. Use the decimal part of the X, Y, Z values to generate the key byte.
Xn = abs (Xn - integer part); // decimal part of x
Yn = abs (Yn - integer part); // decimal part of y
Zn = abs (Zn - integer part); // decimal part of z
c. Key byte for each dimension (R,G,B) is taken as X, Y, Z values
respectively by mapping it to a value between 0-255.
d. For the next set of key bytes the number of iterations is changed to a
value obtained by performing exclusive-or on the current set of key
bytes.
Iterations for next key byte = XOR (Xn, Yn, Zn);
3.3 Vertical and Horizontal Diffusion
The diffusion process explained in [12] is used in the algorithm. The
horizontal diffusion in our algorithm is used in a slightly different way i.e. it
is performed separately on each channel after the encryption of the channel
rather than using it on the entire image. The diffusion ensures spread of the
key additions for the channel. The horizontal diffusion moves in the forward
direction from the first pixel of a channel to the last. The second pixel is the
exclusive or of first and second pixel of a channel, the third pixel is the

IJCSBI.ORG
exclusive or of the new second pixel and the third pixel, and so on. Thus the
first pixel of the channel remains unchanged.
The Vertical Diffusion is performed before and after the entire encryption
and horizontal diffusion is performed on the 3 channels of the image. In
Vertical Diffusion the channels are treated collectively. The processing
occurs from the last pixel of the image to the first pixel. It starts by
performing XOR of the green and blue values of the last pixel of the image
with the red value of the second last pixel to form the new red value of the
second last pixel. The green value of the second last pixel is formed by
performing XOR operation on the red and blue values of the last pixel. The
blue value of the second last pixel is formed by XOR operation on the red
and green values of the last pixel. This continues in the backward direction.
Thus the last pixel remains unchanged.
3.4 Encryption/Decryption Scheme
The encryption is performed on 256 bits (32 bytes) of data at a time using
eight 32-bit registers. The algorithm is shown in Fig. 2. In the initial step
four bytes of the key are added to alternate registers. 2’s compliment
addition is performed. Then for ‘r’ rounds arithmetic operations are
performed on the image data. It uses a function ‘f’, the output of which is
used as the number of rotations to be performed on another block of data.
After the swapping operation of the last round, the last four key bytes are
added. The entire encryption structure is displayed in Fig. 3. For decryption
the algorithm follows reverse of the encryption process.
Figure 2.Encryption Algorithm for each Channel (R,G,B)

IJCSBI.ORG
Figure 3. The Image Encryption Structure

IJCSBI.ORG
4. EXPERIMENTATION RESULTS
We performed security analysis on six 256 x 256 color(RGB) images as
shown in Fig. 4. The statistical and differential analysis tests performed
display very favorable results. The results display the strength and security
of the algorithm. The results have been given in [14] to demonstrate the
overcoming of vulnerability in [11].
Figure 4.Plain images (clockwise from top left): Lena, Bridge, Lake, Plane,
Peppersand Mandrill
4.1 Statistical Analysis
Statistical analysis is performed to determine the correlation between the
plain image and the cipher image. For an encryption system to be strong the
cipher image should not be correlated to the plain image and the cipher
image pixels should not have correlation among them. In this section we
provide the histogram and correlation analysis.
4.1.1 Histogram Analysis
When the encrypted image and the plain image do not show high degree of
correlation we can consider the encryption to be secure form information
leakage. Histograms are used to plot the number of pixels at each intensity
level i.e. pixels having values 0-255. This helps in displaying how the pixels
are distributed.
Fig. 5 depicts the histogram for the red, green and blue channels of the plain
image ‘lena’ on the left side (from top to down) and the histograms of the
‘lena’ image after encryption for the three channels respectively on the right
side. They depict that the encryption does not leave any concentration of a
single pixel value.

IJCSBI.ORG
Figure 5.Left Side: Histogram of ‘lena’ plain image for red, green and blue channels
(top to down). Right Side: Histogram of encrypted ‘lena’ image for red, green and
blue channels (top to down).

IJCSBI.ORG
4.1.2 Correlation of Adjacent Pixels
In a plain image the adjacent pixels show a high degree of correlation in
horizontal, vertical and diagonal directions. The encrypted image should
have a very small degree of correlation among its adjacent pixels. We select
1000 random pairs of pixels from an image and the following formula gives
the correlation coefficient.
𝑐𝑜𝑟𝑟𝑥𝑦 =
𝐶(𝑥,𝑦)
𝐷 𝑥 𝐷(𝑦)
(2)
where,
𝐶 𝑥, 𝑦 =
1
𝑁
(𝑥𝑖 − 𝐸(𝑥))(𝑦𝑖 − 𝐸(𝑦))𝑁
𝑖=1 (3)
𝐷 𝑥 =
1
𝑁
𝑥𝑖 − 𝐸 𝑥
2𝑁
𝑖=1 (4)
𝐸 𝑥 =
1
𝑁
𝑥𝑖
𝑁
𝑖=1 (5)
Here xi and yi form the pair of ith
adjacent pixels and N is the total number
of pairs.
Table 1 shows the correlation coefficient values of the six plain images (Fig.
4) between horizontal, vertical and diagonal adjacent pixels. It can be noted
that the adjacent pixels are highly correlated.
Table 1.Correlation Values of Plain-Images
Channels Plain Images Horizontal Vertical Diagonal
RED
Lena 0.9558 0.9781 0.9336
Bridge 0.8680 0.9070 0.8287
Lake 0.9234 0.9201 0.8886
Mandrill 0.8474 0.8032 0.7944
Peppers 0.9371 0.9392 0.9077
Plane 0.9205 0.9092 0.8546
GREEN
Lena 0.9401 0.9695 0.9180
Bridge 0.9055 0.9131 0.8700
Lake 0.9354 0.9272 0.8943
Mandrill 0.7285 0.6674 0.6487
Peppers 0.9657 0.9673 0.9451
Plane 0.8938 0.9174 0.8419
BLUE
Lena 0.9189 0.9495 0.8948
Bridge 0.9354 0.9411 0.9138
Lake 0.9377 0.9401 0.9099
Mandrill 0.8030 0.7914 0.7625
Peppers 0.9259 0.9330 0.8928
Plane 0.9179 0.8912 0.8563

IJCSBI.ORG
Table 2 shows the correlation coefficient values for the Red, Green and Blue
channel of the cipher images formed by encrypting the plain images with the
proposed encryption algorithm. The cipher images bear very little
resemblance to the original images and that the adjacent pixels in the
horizontal, vertical and diagonal directions are correlated to a very small
degree.
Table 2.Correlation Values of Cipher-Images
Channels Plain Images Horizontal Vertical Diagonal
RED
Lena -0.0014 -0.0012 0.0004
Bridge -0.0040 -0.0066 -0.0010
Lake -0.0052 -0.0011 0.0018
Mandrill 0.0034 0.0001 0.0033
Peppers -0.0014 -0.0034 -0.0016
Plane -0.0024 -0.0043 0.0088
GREEN
Lena 0.0004 0.0067 -0.0026
Bridge -0.0053 -0.0017 0.0008
Lake 0.0044 -0.0025 0.0068
Mandrill -0.0031 -0.0041 0.0029
Peppers 0.0008 0.0027 0.0029
Plane 0.0026 -0.0003 0.0014
BLUE
Lena -0.0049 0.0014 -0.0005
Bridge 0.0023 0.0001 0.0037
Lake -0.0010 -0.0044 0.0002
Mandrill 0.0023 0.0001 -0.0014
Peppers -0.0016 -0.0006 0.0013
Plane 0.0040 -0.0007 0.0041
4.1.3 Correlation between plain and cipher image
The previous section showed correlation between adjacent pixels of plain
image or cipher image. But it is also necessary to have no relevant
correlation between the plain image and the corresponding cipher image.
Rather than using the pixel pairs of a single image, we use the pixels of the
plain and cipher image at the same grid position.
The 2D correlation coefficients of the images are calculated by pairing the
three channels of the plain image with the three channels of the cipher
image. These form nine different pairs i.e. correlation between; red channel
of plain image and red channel of cipher image, red channel of plain image
and green channel of cipher image, red channel of plain image and blue

IJCSBI.ORG
channel of cipher image; and so on for the green and blue channels of the
plain image. These are represented as CRR, CRG, CRB, CGR, CGG, CGB, CBR,
CBG, CBB; where for any Cij, i represents a channel (R,G,B) of plain image
and j represents a channel (R,G,B) of cipher image. The coefficient values
given in Table 3 depict that there is little or practically no correlation
between the plain image and its corresponding cipher image. The cipher
image thus displays characteristics of a random image.
Table 3.Correlation Values between Plain Image and Cipher Image
Images CRR CRG CRB CGR CGG CGB CBR CBG CBB
Lena -0.0033 0.0016 0.0047 -0.0026 -0.0008 0.0006 -0.0029 0.0003 -0.0021
Bridge -0.0029 0.0005 0.0003 -0.0020 -0.0006 0.0011 0.0008 0.0007 0.0010
Lake -0.0012 0.0002 0.0005 -0.0041 -0.0007 0.0033 -0.0050 -0.0021 0.0039
Mandrill -0.0019 -0.0004 -0.0024 -0.0035 0.0011 -0.0036 -0.0034 0.0005 -0.0036
Peppers -0.0030 -0.0059 -0.0022 -0.0033 -0.0024 -0.0012 -0.0042 -0.0007 0.0005
Plane 0.0072 0.0014 -0.0003 0.0068 0.0025 0.0015 0.0057 0.0033 0.0033
4.2 Differential Analysis
Differential analysis displays the amount of change that the encryption
performs on the image. The encryption of two very similar images should
not have a similar distribution of pixels in the cipher image. In other words,
cipher images of two plain images having just a single pixel difference,
should not bear any pixel resemblance between them. An adversary should
not be able to extract any meaningful relationship between plaintext and
cipher text, by comparing the 2 different cipher text of similar plaintext.
NPCR (net pixel change rate) and UACI (unified average changing
intensity) are used as measures of differential analysis. NPCR indicates the
percentage of pixel change in the cipher image when a single pixel of plain
image is changed. UACI measures the average intensity of the change
between plain and cipher image.
Let us consider 2 cipher images X1 and X2, obtained by plain images P1 and
P2 having difference of a single pixel. The pixel values at the grid position
of ith
row and jth
column for the cipher images are denoted as X1(i,j) and
X2(i,j). A bipolar array B is defined as follows
𝐵(𝑖, 𝑗) =
0, 𝑖𝑓 X1 𝑖, 𝑗 = X2(𝑖, 𝑗)
1, 𝑖𝑓 X1 𝑖, 𝑗 ≠ X2(𝑖, 𝑗)
(6)

IJCSBI.ORG
Values for NPCR and UACI are calculated as given in equations (7) and (8),
where W and H denote width and height of the cipher images, T denotes the
largest supported pixel value in the cipher images (255 in our case) and
abs() computes the absolute value. The NPCR and UACI values given in
Table 4 show that the encryption algorithm is secure against differential
attacks.
NPCR =
𝐵(𝑖,𝑗)𝑖,𝑗
W x H
x 100% (7)
UACI =
1
W x H
𝑎𝑏𝑠(x1 𝑖,𝑗 −x2 𝑖,𝑗 )
𝑇𝑖,𝑗 x 100% (8)
Table 4.NPCR and UACI Values Obtained for Encryption of 6 Plain images and Same
Images with 1 Pixel Changed
Plain Images NPCR UACI
Lena 99.6333 33.4706
Bridge 99.5722 33.4403
Lake 99.5900 33.5313
Mandrill 99.6089 33.4595
Peppers 99.6185 33.4657
Plane 99.6206 33.4539
5. CONCLUSION
In this paper we proposed a new image encryption algorithm. The merits of
the recent research, based on results, were combined along with a symmetric
approach of encryption to provide a secure algorithm. The diffusion
mechanism along with Feistel structure makes the algorithm stronger. The
3D Rossler system of equations is used for the random key generation. The
splitting of the three dimensions of the key for the three channels makes the
cryptanalysis to obtain the key more difficult. The experimentation
performed depict that the algorithm generates favorable results.
REFERENCES
[1] Chang, C.-C., Hwang, M.-S.and Chen, T.-S., 2001. A New Encryption Algorithm for
Image Cryptosystems. Journal of Systems and Software, Vol. 58, No. 2, pp. 83-91.
[2] Yano, K. and Tanaka, K., 2002. Image Encryption Scheme Based on a Truncated
Baker Transformation. IEICE Transactions on Fundamentals of Electronics,
Communications and Computer Sciences, Vol. E85-A, No. 9, pp. 2025-2035.
[3] Gao, T. and Chen, Z., 2008. Image Encryption Based on a New Total Shuffling
Algorithm.Chaos, Solitons and Fractals, Vol. 38, No. 1, pp. 213-220.

IJCSBI.ORG
[4] Chen, G., Mao, Y. and Chui, C.K., 2004. A Symmetric Image Encryption Based on 3D
Chaotic Cat Maps. Chaos, Solitons and Fractals, Vol. 21, pp. 749-761.
[5] Mao, Y., Chen, G. and Lian, S., 2004. A Novel Fast Image Encryption Scheme Based
on 3D Chaotic Baker Maps. International Journal of Bifurcation and Chaos, Vol. 14,
No. 10, pp. 3613-3624.
[6] Guan, Z.-H., Huang, F. and Guan, W., 2005. Chaos Based Image Encryption
Algorithm. Physics Letters A, Vol. 346, pp. 153-157.
[7] Zhang, L., Liao, X. and Wang, X., 2005. An Image Encryption Approach Based on
Chaotic Maps. Chaos, Solitons and Fractals, Vol. 24, pp. 759-765.
[8] Gao, H., Zhang, Y., Liag, S. and Li, D., 2006. A New Chaotic Algorithm for Image
Encryption. Chaos, Solitons and Fractals, Vol. 29, pp. 393-399.
[9] Pareek, N.K., Patidar, V. and Sud, K.K., 2006. Image Encryption Using Chaotic
Logistic Map. Image and Vision Computing, Vol. 24, pp. 926-934.
[10]Wong, K.-W., Kwok, B.S.-H.and Law, W.-S., 2008. A Fast Image Encryption Scheme
Based on Chaotic Standard Map. Physics Letters A, Vol. 372, pp. 2645-2652.
[11]Amin, M., Faragallah, O.S. and Abd El-Latif, A.A., 2010. A Chaotic Block Cipher
Algorithm for Image Cryptosystems. Communications in Nonlinear Science and
Numerical Simulation, Vol. 15, pp. 3484-3497.
[12]Patidar, V., Pareek, N.K. and Sud, K.K.,2009. A New Substitution-Diffusion Based
Image Cipher Using Chaotic Standard and Logistic Maps. Communications in
Nonlinear Science and Numerical Simulation, Vol. 14, pp. 3056-3075.
[13]Rossler, O.E., 1976. An Equation for Continuous Chaos. Physics Letters A, Vol. 57,
No. 5, pp. 397-398.
[14]Kamat, V.G. and Sharma, M., 2014. Enhanced Chaotic Block Cipher Algorithm for
Image Cryptosystems. International Journal of Computer Science Engineering, Vol. 3,
No. 2, pp. 117-124.
This paper may be cited as:
Kamat V. G. and Sharma M., 2014. Symmetric Image Encryption
Algorithm Using 3D Rossler System. International Journal of Computer
Science and Business Informatics, Vol. 14, No. 1, pp. 1-13.

IJCSBI.ORG
Node Monitoring with Fellowship
Model against Black Hole Attacks in
MANET
Rutuja Shah, M.Tech (I.T.-Networking)
School of Information Technology & Engineering, VIT University
Lakshmi Rani, M.Tech (I.T.-Networking)
S. Sumathy, AP [SG]
Abstract
Security issues have been considerably increased in mobile ad-hoc networks. Due to absence of any
centralized controller, the detection of problems and recovery from such issues is difficult. The packet
drop attacks are one of those attacks which degrade the network performance. In this paper, we
propose an effective node monitoring mechanism with fellowship model against packet drop attacks
by setting up an observance zone where suspected nodes are observed for their performance and
behavior. Threshold limits are set to monitor the equivalence ratio of number of packets received at
the node and transmitted by node inside mobile ad hoc networks. This fellowship model enforces a
binding on the nodes to deliver essential services in order to receive services from neighboring nodes
thus improving the overall network performance.
Keywords: Black-hole attack, equivalence ratio, fair-chance scheme, observance zone, fellowship
model.
1. INTRODUCTION
Mobile ad-hoc networks are infrastructure less and self organized or configured
network of mobile devices connected with radio signals. There is no centralized
controller for the networking activities like monitoring, modifications and updating
of the nodes inside the network as shown in figure 1. Each node is independent to
move in any direction and hence have the freedom to change the links to other nodes
frequently. There have been serious security threats in MANET in recent years.
These usually lead to performance degradation, less throughput, congestion, delayed
response time, buffer overflow etc. Among them is a famous attack on packets
known as black-hole attack which is also a part of DoS(Denial of service) attacks. In
this, a router relays packets to different nodes but due to presence of malicious nodes

IJCSBI.ORG
these packets are susceptible to packet drop attacks. Due to this, there is hindrance is
secure and reliable communication inside network.
Figure 1. MANET Scenario
Section 2 addresses the seriousness of packet drop attacks and related work done so
far in this area. Section 3 elaborates our proposal and defending scheme for packet
drop attacks. Section 4 provides concluding remarks.
2. LITERATURE SURVEY
The packet drop loss in ad-hoc network gained importance because of self-serving
nodes which fail to provide the basic facility of forwarding the packets to
neighboring nodes. This causes an occupational hazard in the functionality of
network. Generally there are two types of nodes- selfish and malicious nodes. Selfish
nodes are those nodes which act in the context of enhancing its performance while
malicious nodes are those which mortifies the functions of network through its
continual activity. The WATCHERS [1] from UC Davis was presented to detect and
remove routers that maliciously drop or misroute packets. A WATCHER was based
on the “principle of packet flow conservation”. But it could not differentiate much
between malicious and genuine nodes. Although it was robust against byzantine
faults, it could not be much effective in today’s internet world to reduce packet loss.
The basic mechanism of packet drop loss is that the nodes do not progress the
packets to other nodes selfishly or maliciously. Packet Drop loss could occur due to
Black hole attack. Sometimes the routers behave maliciously i.e. the routers do not
forwards packets, such kinds of attacks are known as “Grey Hole Attack”. In case of
routers, the attacks can be traced quickly while in the case of nodes it’s a
cumbersome task. Many researchers have worked in this field and have tried to find

IJCSBI.ORG
solutions to this attack [2-6]. Energy level was one of the parameter on which the
researchers have shown their results. This idea works on the basis of the ratio of
fraction of energy committed for a node, to overall energy contributed towards the
network. The node is retained inside the network on the basis of energy level and the
energy level is decided by the activeness of node in a network through mathematical
computations. Mathematical computations are [7] too complicated to clench and
sometimes the results are catastrophic. It can be said that the computations are
accurate but they are very much prone to ambiguity in the case of ad-hoc networks.
Few techniques involve usage of routing table information which is modified after
detecting the MAC address of malicious node which uses jamming style DoS attack
to cease their activities [8]. Another approach to reduce attacks was using historical
evidence trust management based strategy. [9] Direct trust value (DTV) was used
amongst neighboring nodes to monitor the behavior of nodes depending on their past
against black hole attacks. However, there is high possibility that trust values may
get compromised by the malicious nodes. Also the third party used for setting the
trust values is also vulnerable to attacks. Recent methods included an [10]
introduction to a new protocol called RAEED (Robust formally Analyzed protocol
for wirEless sEnsor networks Deployment) which reduces this attack but not by a
considerable percentage. To overcome the issues faced in order to implement these
strategies there is a need of an effective mechanism to curb these attacks and make
network more secure.
3. PROPOSED APPROACH
In this paper, we put forth a mechanism to reduce these packet-drop attacks by
implementing “node monitoring with fellowship” technique. We introduce an
obligation on the nodes inside a particular network to render services to network. If
services are not rendered, the node will be expelled outside the performance.
However, we have kept a “fair-chance” scheme for all nodes which help to make out
whether it is genuine node or malicious node.
3.1 Fellowship of Network
The prime parameter we used in this to address packet drop attacks issue is by
maintaining the count of incoming packets, except the destined one on that node and
the count of outgoing nodes except the ones which are originated at that node,
should be same, referred to as “equivalence ratio”. If that count is same, there is
uniform distribution and forwarding of packets among the nodes inside network.
However, if the count is not same, then that particular node is kept under
“observance zone” in order to monitor its suspicious behavior. We suggest a
periodical reporting of all nodes about their equivalence ratio to neighboring nodes
inside the network.
This will help to decide whether to keep a particular node in “observance zone”
which could be done with polling techniques amongst each other. Inside, observance

IJCSBI.ORG
zone, the suspected node is given “fair-chance” treatment. That is, during
observance zone period, the suspected node is required to submit its “status-
message” to neighboring nodes to prove its genuineness of performance inside
network. The genuine nodes will promptly provide their status-message to
neighboring nodes because they will be willing to stay inside the network to render
services under obligation for the network. However, the malicious nodes may or may
not reply their status-messages to neighboring nodes since they have to degrade
performance of network. But, for such status-messages only fair-chance is given.
That is, a standard threshold level is been set up unanimously amongst neighboring
nodes inside network. Status-messages will be entertained only up to threshold level.
So, even if malicious nodes produce and fake their own status-messages to
neighboring nodes in order to sustain inside network due to threshold limits it will
not degrade network performance much. When threshold is crossed, the neighboring
nodes will be intimated about the node which is under observance zone and a
unanimous decision will be taken to expel that suspected node out of the network.
Because of this scheme, there is possibility that the suspected node is expelled
outside the network under 2 circumstances: either its genuine node (which are
underperforming) or malicious nodes. In both cases, the suspected node needs to be
expelled out of network because it is leading to performance degradation of the
network. The “fair-chance” scheme ensures that genuine nodes are given fair chance
to justify themselves and repair itself soon to prove their genuineness to render
services to network under obligation.
3.2 Scenario Assumptions
Let the nodes inside MANET be connected through wireless links with each other.
Let number of packets be transmitted and received with each other by the nodes. Let
nodes be named alphabetically from A,B,C…and so on till Z. Let node X be
malicious node which drops packets and undergoes black hole attack and hence has
poor equivalence ratio while node Y be the genuine node but has poor equivalence
ratio due to network congestion or may be due to some other network issues. All
nodes inside the network follow the principle of “node monitoring with fellowship”.
Data structures used are the networking parameters which are as follows:
1)equi_ratio = denoting the equivalence_ratio of nodes
2)observance_zone= denoting list of suspected nodes inside observance zone.
3)threshold_value= denoting threshold value decided by the nodes inside MANET.
4)status_message= denoting the status messages exchanged amongst neighboring
nodes.

IJCSBI.ORG
Steps involved:
Step 1: All nodes calculate their own equivalence ratio(equi_ratio) and share it with
their neighboring nodes(let them be at one hop distance) periodically.
Step 2: All nodes unanimously agree upon a standard threshold level (in this case,
threshold_value=3) through exchange of messages using agreement protocols.
Step 3: All nodes will monitor their neighbor’s equi_ratio and if any node has
equi_ratio which is quite poor then that particular node will be kept under
“observance zone” list through mutual exchange of messages of nodes inside
network. These nodes may be suspected as malicious nodes or genuine nodes but
with poor performance.
Step 4: Once the suspected node is kept in “observance zone” list, it is made
mandatory for that node to report the “status_message” to the neighboring nodes to
justify their performance and behavior.
Step 5: If it’s a malicious node (node X) it may either fake its status_message to
show its genuineness and stay inside network or it may just avoid sending its
status_message since it wishes to continue its malicious activities in future too. If it
is genuine node (node Y) it will send status_message in order to prove its
genuineness and try to improve its performance by repairing itself with the network
issues it is facing while sending the packets. However, in both the cases , we have
limited the frequency of justification through status_message by the nodes using fair
chance scheme wherein nodes are allowed to justify themselves only till certain
threshold_value(here, value=3 .i.e. only 3 times the suspected nodes are allowed to
send status_message in order to justify their performance). In short, malicious
nodes and genuine nodes which are underperforming both are kept under
surveillance to observe their behavior.
Step 6: Thus, the nodes which cross the limits of threshold_value, will be
immediately expelled outside the network through exchange of protocols and
messages between the neighboring nodes. In this way, packet-drop attacks can be
considerably reduced. Figure 2 explains the workflow mechanism.

IJCSBI.ORG
Figure 2. Flowchart of proposed mechanism
Set the threshold_value
unanimously and exchange
equi_ratio with neighboring nodes
periodically
Check whether
equi_ratio is
acceptable or
unacceptable
Place the suspected node under
observance_zone
Check if below
threshold_value
Suspected node is expelled
outside network
Continue normal
network activities
Exchange of
status_message
Acceptable
Unacceptable
Above threshold_value
Less than or equal
to threshold_value

IJCSBI.ORG
3.3 Advantages:
1. Fair chance scheme ensures genuineness of innocent nodes.
2. No complex mathematical computations of energy levels at each node.
3. Periodical reporting ensures removal of both underperforming and malicious
nodes from the network.
4. Up gradation of network performance in MANET.
3.4 Disadvantages
However, there is an overhead of exchanging more number of messages among the
neighboring nodes. Optimization on number of messages exchanged during
communication can be addressed and worked upon in future research.
4. CONCLUSION
In this paper, we have proposed a novel scheme to reduce packet drop attacks and
enhance the network performance. However, we anticipate our “node-monitoring
with fellowship” model may lead to increase in number of exchanged messages
amongst neighboring nodes during the agreement protocols inside network but at the
same time it will be robust against attacks and thus increase the availability of nodes
in mobile ad-hoc networks. The outcomes of minimizing packet drop loss have better
utility of channel, resources and QoS guaranteed which results in productive priority
management and a considerable controlled traffic by periodic surveillance over
nodes. The future research on this would be to reduce the exchange of messages
amongst the nodes, minimize the overhead and achieve optimization inside mobile
ad-hoc networks.
5. REFERENCES
[1] K. A. Bradley, S. Cheung, N. Puketza, B. Mukherjee and R. A. Olsson, Detecting Disruptive
Routers: A Distributed Network Monitoring Approach, in the 1998 IEEE Symposium on Security and
Privacy, May 1998.
[2] Y.C. Hu, A. Perrig and D. B. Johnson, Ariadne: A Secure On-demand Routing Protocol for Ad
Hoc Networks, presented at International Conference on Mobile Computing and Networking, Atlanta,
Georgia, USA, pp. 12 - 23, 2002.
[3] P. Papadimitratos and Z. J. Haas, Secure Routing for Mobile Ad hoc Networks, presented at SCS
Communication Networks and Distributed Systems Modeling and Simulation Conference, San
Antonio, TX, January2002.
[4] K. Sanzgiri, B. Dahill, B. N. Levine, C. Shields and E. M. Belding-Royer, A Secure Routing
Protocol for Ad Hoc Networks, presented at 10th IEEE International Conference on Network
Protocols (ICNP'02), Paris, pp. 78 - 89, 2002.
[5] V. Balakrishnan and V. Varadharajan, Designing Secure Wireless Mobile Ad hoc Networks,
presented at Proceedings of the 19th IEEE International Conference on advanced information
Networking and Applications (AINA 2005). Taiwan, pp. 5-8, March 2005.

IJCSBI.ORG
[6] V. Balakrishnan and V. Varadharajan, Packet Drop Attack: A Serious Threat to Operational
Mobile Ad hoc Networks, presented at Proceedings of the International Conference on Networks and
Communication Systems (NCS 2005), Krabi, pp. 89-95, April 2005.
[7] Venkatesan Balakrishnan and Vijay Varadharajan Short Paper: Fellowship in Mobile Ad hoc
Networks presented at Proceedings of the First International Conference on Security and Privacy for
Emerging Areas in Communications Networks (SECURECOMM’05) IEEE.
[8] Raza, M., and Hyder, S.I. A forced routing information modification model for preventing black
hole attacks in wireless Ad Hoc network presented at Applied Sciences and Technology (IBCAST),
2012, 9th International Bhurban Conference, Islamabad, pp. 418-422, January 2012.
[9] Bo Yang , Yamamoto, R., Tanaka, Y. Historical evidence based trust management strategy
against black hole attacks in MANET published in 14th International Advanced Communication
Technology(ICACT), 2012 on pp. 394 – 399.
[10] Saghar, K., Kendall, D.and Bouridane, A. Application of formal modeling to detect black hole
attacks in wireless sensor network routing protocols .Applied Sciences and Technology (IBCAST),
2014, 11th International Bhurban Conference, Islamabad, pp. 191-194, January 2014.
Shah, R., Rani, L. and Sumathy, S. 2014. Node Monitoring with Fellowship Model
against Black Hole Attacks in MANET. International Journal of Computer Science
and Business Informatics, Vol. 14, No. 1, pp. 14-21.

IJCSBI.ORG
Load Balancing using Peers in an
E-Learning Environment
Maria Dominic
Department of Computer Science,
Sacred Heart College, India
Sagayaraj Francis
Department of Computer Science and Engineering,
Pondicherry Engineering College, India
ABSTRACT
When an e-Learning System is installed on a server, numerous learners make use of it and
they download various learning objects from the server. Most of the time the request is for
same learning object and downloaded from the server which results in server performing
the same repetitive task of locating the file and sending it across to the requestor or the
client. This results in wasting the precious CPU usage of the server for the same task which
has been performed already. This paper provides a novel structure and an algorithm which
stores the details of the various clients who have already downloaded the learning objects
in a dynamic hash table and look up that table when a new request comes in and sends the
learning object from that client to the requestor thus saving the precious CPU time of the
server by harnessing the computing power of the clients.
Keywords
Learning Objects, e-Learning, Load Distribution, Load Balancing, Data Structure, Peer –
Peer Distribution.
1. INTRODUCTION
1.1 e-Learning
Education is defined as the conscious attempt to promote learning in others
to acquire knowledge, skills and character [1]. To achieve this mission
different pedagogies were used and later on with the advent of new
information communication technology tools and popularity gained by
internet were used to enhance the teaching learning process and gave way to
the birth of e-learning [2]. This enabled the learner to learn by breaking the
time, geographical barriers and it allowed them to have individualized
learning paths [3]. The perception on e-Learning or electronic learning is
that it is a combination of internet, electronic form and network to
disseminate knowledge. The key factors of e-learning are reusing, sharing
resources and interoperability [4]. At present there are various organizations

IJCSBI.ORG
providing e-learning tools of multiple functionalities and one such is
MOODLE (Modular Object Oriented Dynamic Learning Environment) [5]
which is used in our campus. This in turn created difficulty in sharing the
learning objects between heterogeneous sites and standards such as SCORM
& SCORM LOM [6], IMS & IMS DRI [7], AICC [8] and likewise were
proposed by different organizations. In Berner-Lee’s famous architecture for
Semantic Web, ontology’s are used for sharing and interoperability which
can be used to build better e-learning systems [9]. In order to define
components for e-learning systems the methodology used is the principle of
composibility in Service Oriented Architecture [10] since it enables us to
define the inter-relations between the different e-learning components. The
most popular model used nowadays in teaching learning process is Felder-
Silverman learning style model [11]. The e-Learning components are based
on key topics, topic types and associations and occurrences. VLE – Virtual
Learning Environment is the software which handles all the activities of
learning. Learning Objects are the learning materials which promotes a
conscious attempt to promote visual, verbal, logical and musical intelligence
[12] through presentations, tutorials, problem solving and projects. By the
multimedia, gaming and simulation kin aesthetic intelligence are promoted.
Interpersonal, intrapersonal and naturalistic intelligence are promoted by
means of chat, SMS, e-mail, forum, video, audio conference, survey, voting
and search. Finally assessment is used to test the knowledge acquired by the
learner and the repository is the place which will hold all the learning
materials.
This algorithm is useful when the learners access the learning objects which
are stored in the repository. It reduces the server’s response rate by the
directing a client to respond to the requestor with the file it has already
downloaded from the server.
1.2 Load Balancing
The emergence of large and faster networks with thousands of computers
connected to it provided a challenge to provide effective sharing of resource
around the computers in the network. Load balancing is a critical issue in
peer to peer network [14]. The existing load balancing algorithms for
heterogeneous, P2P networks are organized in a hierarchical fashion. Since
P2P have gained popularity it became mandatory to manage huge volume of
data to make sure that the response time is acceptable to the users. Due to
the requirement for the data from multiple clients at the same instance may
cause some of the peers to become bottleneck, and thereby creating severe
load imbalance and the response time to the user. So to reduce the
bottlenecks and the overhead of the server there was a need to harness the
computing power of the peers [15]. Much work has been done on harnessing

IJCSBI.ORG
the computing power of the computer in the network in high performance
computing and scientific application, faster access to data and reducing the
computing time is still to be explored. In a P2P network the data is de-
clustered across the peers in the network. When there is requirement for a
popular data from across the peer then there occurs a bottleneck and
degrading the system response. So to handle this, a new strategy using a
new structure and an algorithm are proposed in this paper.
2. PROPOSED DATA STRUCTURE AND THE ALGORITHM
The objective of this architecture is to harness the computational power of
the clients in the network. This architecture is with respect to the clients
available in the e-learning network. The network comprises of Master
degree students of Computer Applications accessing learning materials for
their course. The degree programme is a three year programme. So the
clients are categorized to three different clusters namely I MCA, II MCA, III
MCA. We shall name it as class cluster. Every class cluster contains many
clusters inside it, let us name it as file clusters, one cluster for one type of
file since the learning objects can be made up of presentation, video, audio,
picture, animation etc [13]. An address table, named file address table holds
the address of each file cluster in the class cluster. When a request for a file
is received the corresponding cluster is identified by reading the address
from the address table. The following algorithm represents the working
logic of the concept. The data structure is represented in Figure 1. Every file
cluster holds a Dynamic Hash Table (DHT), Linked List and a Binary Tree.
The dynamic hash table holds the address of the linked list which holds the
file names that are already downloaded from the server. The hashing
function used to identify an index in the DHT is as follows,
1. Represent every character in the filename with its position in the
alphabet list and its position in the filename.
Eg: File Name - abc.ppt = 112233, the value for a is got as 11 since
the position of it in the alphabet list is 1 and its position in the file
name is 1.
2. Sum all the digits calculated from step 1.
Eg: 112233 = 12
3. Divide the sum by length of the file name, so 12/3 = 4, which becomes
the index for the file in DHT. The above three steps are mathematical
formulated in equation 1.

IJCSBI.ORG
Address of
PPT File
Cluster
Address of
Audio File
Cluster
Address of
Video File
Cluster
Cluster of
Presentational
Files
Cluster of
Audio
Files
Cluster of
Video
Files
abc.ppt
bac.ppt NUL
L
*
0
1
2
Dynamic Hashing Table Linked List
Binary
Tree
3
**
* *
*
A
b
c
.
p
p
t
b
a
c
.
p
p
t
N
U
L
L
*
0
1
2
Dynamic Hashing Table Linke
d List
Binar
y
Tree
3
**
* *
*
Address Table
 - IP, CPU usage
Time
Figure 1. Proposed Data Structure

IJCSBI.ORG
Every index of DHT holds the starting address of a linked list in which
every node stores the file names that are already downloaded. The linked list
structure is used to avoid index collision between filenames generating the
same index in DHT. The index collision is avoided by creating a new node
in the linked list for the new file name. As shown in Figure 1, every node in
the linked list holds three values namely file name, address of the binary tree
and the last part holds the address of the next node in the list. The nodes of
the binary tree holds the active clients IP’s and its current CPU processing
status. The binary tree is used to identify the leased CPU used client to
transfer the file to the requestor. This activity will harness the computing
power of the least used CPU. The binary tree structure was used to reduce
the search time for the least used client. If the file has not been downloaded
by any clients i.e. when the last node of the linked list is reached, then the
file is transferred from the server.
Algorithm 1 SEND ( ) {
Request directed to the File Cluster
Address of the File Cluster taken from the Address Table
Index Location of the File = HASHED (File Name)
If the index is out of bound
{
The file has not been downloaded by any client
It is sent from the server to the client
}
Else
{
While (not end of Linked List OR the node is not found)
{
If (Node. data == File Name)
{
Node found = true
While (end of binary tree}
{
Least usage time CPU IP = LEASTUSEDCPU()
}
}
If (Node found == true)
Send the requested file from the IP to the Requestor
Else
The file has not been downloaded by any client
It is sent from the server to the client

IJCSBI.ORG
}
End of the SEND ( )
Algorithm 2 int LEASTUSEDCPU ( ) {
While (not end of Binary Tree)
{
Compare the CPU Usage Time of the first
node with the CPU Usage Time of every
other Node
If CPU Usage Time of the new node is
Lesser
Leastusedcpu = IP
}
return ( Leastusedcpu)
End of LEASTUSEDCPU ( )
Algorithm 3 int HASHED (String Filename) {
Len = StringLength(Filename)
While (End of String)
{
IndexString += (Position of the Character in the Filename
+ Position of the Character in the alphabet list)
}
IndexInt = ConverttoInteger(IndexString)
return (IndexInt = IndexInt/Length of the Array)
End of HASHED ( )
3. MATHEMATICAL FORMULATION
Mathematical formulation of the above dealt problem is as follows,
l
index = ( Σ ( a (i+1)+j ) / l ) (1)
i=0
n
Σ index ( Z K (f3)) = X(f2) goto 4, 5 (2)
K=1

IJCSBI.ORG
n
Σ index ( Z K (f3)) ≠ X(f2) goto 6 (3)
K=1
n
Min Σ Y = C (m, m+1) (4)
m = 1
f2 = X -1
Y (f1) (5)
f2 = X -1
S (f1) (6)
where,
index is the index in the Dynamic Hash Table
l is the length of the file name
i is the character position in the file name
j = { 1,2,3,4,5….26 }
k is the number of nodes in the linked list
Z is the node in the linked list
f3 is the file name in the node in the linked list
f2 is the targeted file
m is the nodes in the Binary Tree
Y is the node in the Binary Tree with the minimum C
C is the CPU usage time of the specified IP
S is the Server
4. CONCLUSION
The main advantage from this architecture is that the server time is saved by
harnessing the computational power of the clients who have already
downloaded the file to send across it to the requestor. Another advantage of
the architecture is the file search, which has been fastened due to the
Dynamic Hashing table and Binary tree structures. This algorithm is been
currently implemented using PHP and the results of it will be published in
the further publications. Initial results indicate that there is substantial
reduction of the server’s CPU processing time when this algorithm is
executed on the server.

IJCSBI.ORG
REFERENCES
[1] Lavanya Rajendran, Ramachandran Veilumuthu., 2011. A Cost Effective Cloud
Service for E-Learning Video on Demand, European Journal of Scientific Research,
pp.569-579.
[2] Maria Dominic, Sagyaraj Francis, Philomenraj., 2013. A Study On Users On Moodle
Through Sarasin Model, International Journal Of Computer Engineering And
Technology, Volume 4, Issue 1, pp 71-79.
[3] Maria Dominic, Sagyaraj Francis., 2013. Assessment Of Popular E-Learning Systems
Via Felder-Silverman Model And A Comphrehensive E-Learning System,
International Journal Of Modern Education And Computer Science, Hong Kong,
Volume 5, Issue 11, pp 1-10.
[4] Zhang Guoli, Liu Wanjun, 2010. The Applied Research of Cloud Computing. Platform
Architecture in the E-Learning Area, IEEE.
[5] www.moodle.org
[6] SCORM(Sharable Courseware Object Reference Model), http://www.adlnet.org
[7] IMS Global Learning Consortium, Inc., “Instructional Management System (IMS)”,
http://www.imsglobal.org.
[8] http://www.aicc.org
[9] Uschold, Gruninger., 1996. Ontologies, Principles , Methods and Applications,
Knowledge Engineering Review, Volume 11, Issue 2.
[10]Papazoglou, Heuvel., 2007. Service Oriented Architectures: Approaches,
Technologies, and research issues, The VLDB Journal, , Volume 16, Issue 3, pp. 389-
415.
[11]Graf, Viola, Kinshuk., 2006. Representative Characterestics of Felder-Silverman
Learning Styles: an Empirical Model, IADIS, pp. 235-242.
[12]Lorna Uden, Ernesto Damiani., 2007. The Future of E-Learning: E-Learning
ecosystem, Proceeding of IEEE Conference on Digital ecosystems and Techniques,
Australia, pp. 113-117.
[13]Maria Dominic, Sagyaraj Francis., 2012. Mapping E-Learning System To Cloud
Computing, International Journal Of Engineering Research And Technology, India,
Volume1, Issue 6.
[14]Chyouhwa Chen, Kun-Cheng Tsai., 2008. The Server Reassignment Problem forload
Balancing In Structured P2P Systems, IEEE Transactions On Parallel And Distributed
Systems, Volume 19, Issue 2.
[15]A. Rao, K. Lakshminarayanan, S. Surana, R. Karp, and I. Stoica., 2006. Load
Balancing in Structured P2P Systems. Proc. Second Int’l Workshop Peer-to-Peer
Systems (IPTPS ’03).
Dominic, M. and Francis, S. 2014. Load Balancing using Peers in an E-
Learning Environment. International Journal of Computer Science and
Business Informatics, Vol. 14, No. 1., pp. 22 -29.

IJCSBI.ORG
E-Transparency and Information
Sharing in the Public Sector
Edison Lubua (PhD)
Mzumbe University, P.O. Box 20266,
Dar Es Salaam, 255, Tanzania
ABSTRACT
This paper determines the degree of information sharing in government institutions through e-
transparent tools. First the basis for the study is set through the background, problem statement
and objectives. The discussion then proceeds by focusing on ICT tools for information sharing.
An information sharing model is proposed and the extent of information sharing in the public
sector of Tanzania through online media is discussed; furthermore, the correlation that exists
between the extent of information sharing and factors such as accessibility, understandability,
usability and reliability is established. The paper concludes by providing recommendations on
information sharing and how it can be enhanced through e-transparency systems for public
service delivery in an open society.
Keywords
E-transparency, E-Governance, Information Sharing, Public Sector, ICT.
1. BACKGROUND OF THE STUDY
Generally, information services are an important pillar for any democratic
government. Citizens rely on information for making decisions which impact
upon their social, political and economic lives. In this regard, there are laws
which govern the right to access, and disseminate information locally and
internationally (Hakielimu, LHRC, REPOA, 2005). Locally, government
authority reflects international agreements through different legislations including
the National Constitution (United Republic of Tanzania, 1995).The constitution
in Tanzania entitles every citizen the right of access to information and empowers
citizens with the right to disseminate information.
In his study Onyach-Olaa (2003) commended government authorities, which
make an effort to enhance information sharing with citizens. The government has
to improve interaction with those it governs while addressing information sharing
as its core function. Furthermore, information sharing and transparency in
government operations must become the culture of any democratic republic,
including Tanzania (Mkapa, 2003). Transparency in government operations,
improve the confidence of citizens toward their government, while reminding
government leaders that their decisions and associated impact are transparent to
citizens (Navarra, 2006). Traditionally, information services have been either
provided or received through physical means; mostly, people use oral/listening
and writing/reading methods to issue and receive information. In many cases, the
traditional method of information sharing is characterised by delays, high cost,

IJCSBI.ORG
low transparency and bureaucracy (Im & Jung, 2001); as the result this method
allows for subversion of accountability (Lubua, 2014).
Arguably, communication developments brought by the use of Information and
Communication Technology tools provide a better platform for information
sharing. Instant communication is enhanced through tools such as emails, online
telephoning, video conferencing, chat rooms and social websites. As the result of
these tools challenges that relate to delays, high communication costs and
bureaucratic procedures are addressed.
Apart from the platform provided by online media in enhancing communication,
it is equally important to understand that the efficiency of information sharing is
directly related to the size of the network connecting individuals, groups of
people and organisation (Hatala & Lutta, 2009). The higher the intensity of
networks the more the information received; the organisation enjoys these
benefits if it form a strategic alliance with partners which allow free flow of
information to both ends. This is the reason why the e-governance agency was
instituted in Tanzania.
The appropriate use of e-transparency tools is perhaps the best strategy for the
organisation to enhance information sharing with their stakeholders. The
organisation has to emphasize good qualities of information sharing such as
timely response, accessibility of systems, reliability of data, online security,
completeness of online procedures and openness in service processes. Basically,
this paper discusses different issues, including the need of online information
sharing in the public sector and the extent to which government institution
applies online media for information sharing and service provision. The study is
based on opinions from clients who are consumers of such services.
2. PROBLEM STATEMENT
Business competition compels organisations to invest in information systems to
improve the efficiency of their operations (Barua, Ravindran, & Whinston, 2007).
This investment is made possible through the knowledge of employees, suppliers,
customers, and other key stakeholders. In this regard the organization that shares
its information with stakeholders more efficiently earns a competitive advantage
(Drake, Steckler, & Koch, 2004).
Information sharing is an important resource which should be embraced in order
to enhance the performance of an organisation (Hatala & Lutta, 2009).
Depending on the type of organisation, the extent of information sharing is partly
influenced by organisational policies and practices. The management team,
employees and partners have to work together to foster organisational
information sharing, which guarantees the future existence of the organisation
(Drake, Steckler, & Koch, 2004).

IJCSBI.ORG
The government of Tanzania acknowledges the importance of ICTs in promoting
information sharing in the society. It uses methods such as conferences,
workshops, public portals, and so on to show its intention for maximizing
information sharing. With the growth of the number of users of ICTs, the degree
of information sharing is expected to increase. Therefore, the study intends to
establish the extent to which uses of ICTs have enhanced information sharing.
Further, the correlation between the extent of information sharing and factors
which negatively influence the perception of users will also be established by the
study.
3. OBJECTIVES
This study is designed to cover the following objectives;
i. To determine the extent of information sharing through e-transparency in
the Tanzanian public sector?
ii. To establish the extent to which information usefulness,
understandability, reliability and accessibility influences information
sharing through e-transparency systems.
4. METHODOLOGY
This study was conducted through a mixed research method. First, the study
reviewed a number of literatures to establish its relevance. Then, the Tanzanian
Revenue Authority’s Custom Online System was identified as a case for study,
followed by survey procedures. Data were collected from twenty (20) clearing
and forwarding companies that operate under Custom regulations of the Tanzania
Revenue Authority. The study received and analysed a total of 40 responses. The
study collected data from original sources to enhance validity and relevance. The
analytical models used include the Spearman’s Correlation Model and Regression
Analysis.
5. ICT TOOLS AND GOVERNMENT INFORMATION SHARING
Transparency is one of the pillars of good governance that promotes openness in
conditions and activities; eventually, transparency ensures that the stakeholders
have the information necessary for them to make decisions necessary for the
progress of business and their lives. In this case, information forms the
cornerstone of transparency, more especially in civic organisations.
In the management of civic institutions, information dissemination provides
guidance and education to stakeholders in different matters that influence their
lives; these issues include political, socio-economic and cultural. This availability
of information is clearly influenced by the media used in the capturing, storage
and dissemination process. While electronic media are effective in raising the
level of transparency in the society; the government should take advantage of
these tools in building its relationship with citizens through sharing information,
and hence engage them is supporting planned public development goals (Abu-
Dhabi-Government, 2011; Lubua & Maharaj, 2012).

IJCSBI.ORG
In the Republic of Tanzania, usage of ICT tools for communication and
information sharing increases in a daily bases; users of the internet increased by
450% between 2001 and 2010. Additionally, about 50% of the population of
Tanzania are reported to use either internet or a mobile phones (Kasumuni,
2012). With this increase, understanding the extent to which information from
government institutions is shared enables the government to know how effective
the media are utilized to promote national developments.
6. AN INFORMATION SHARING MODEL
This paper summarises information sharing using a model presented in figure 1.
The abundance and availability of information means that the user needs skill to
determine what it is that they want. In this case, the user of information has the
key role to play in effecting information sharing. The user must be able to use
relevant tools in searching for information and be able to determine the relevance
of accessed data to his/her operations. The ability to use such tools is attained
through learning. Having the knowledge to use the tools for searching the
information, the user must be aware of the problem that they need to solve.
Figure 1: Information Sharing Model
Source: Research Data (2012)
The choice of the information is dictated by the gap which has to be covered.
When this gap is expressed, it becomes a need. Upon responding to the need, the
user of information consults the source which is either electronic or physical. It is
possible that the source may not have the type of information requested or the
information may not be satisfying. Regardless of the status of satisfaction, the
user of information takes action towards covering the gap. In case the public
seeks information from government institutions, dissatisfaction may influence
Information User
Information Needs
Information Source
No Information in the
Source
Satisfaction/
Dissatisfaction
Action

IJCSBI.ORG
members of the public to take action, even against the government, on the other
hand satisfaction influences more support from the government (Lubua, 2014).
The satisfied user of the information applies it to solve the problem identified in
the gap. A good example could be of a farmer, who was searching for a good
market of his/her harvests; s/he will eventually use the information to choose a
better market.
Similarly, the recent Arab uprising represents a possible negative response by
users of the information in case of low satisfaction (Maharaj & Vannikerk, 2011).
The government should therefore respond adequately to inquiries from citizens to
alleviate the possibility of negative response from citizens. It must ensure the
adequate availability of information that address citizens’ daily challenges.
7. INFORMATION SHARING USING E-TRANSPARENCY TOOLS IN
PUBLIC INSTITUTIONS
The introduction of ICT tools brings more opportunities for information sharing
in the organisation through allowing users to receive and send information more
easily (Kilama, 2013). In other cases, stakeholders are able to discuss issues of
different interests through tools such as social networks, chat rooms, e-mail
systems and video/teleconferencing. In other places, the organisation is able to
solicit stakeholders’ opinions before making different decisions (Im & Jung,
2001; Lubua & Maharaj, 2014).
Together with the progress made in information sharing, there is the need to
know the extent which government institutions apply online media for
information sharing. This study is based on opinions from clients who are
consumers of online information from a government institution.
Based on the response from clients of Tanzania Revenue Authority, it was found
that, 70% of respondents agree that the Tanzanian revenue authority, sufficiently
shares its information through online media. These respondents are clients of
Custom services who benefit from the Custom Online System (CULAS). The
following factors influenced a successful deployment of this system:-
a.) Good ICT infrastructure
The ICT infrastructure of Tanzania Revenue Authority is well established; it is
characterised by good interface, reliable data backup systems, power backups and
reliable internet connection. In addition, the revenue authority is among the
organizations benefiting from the massive flow of internet through the National
ICT Backbone (NICTBB). Nevertheless, the study observed that not all
respondents had access to the infrastructure of the revenue authority. Some
lacked computers to access such systems; the presence of the computer room for
clients would be an important ingredient to extension of services offered by the
revenue authority in its custom section. This will equally, facilitate users who are
not based in Dar Es Salaam, but visit for custom services.

IJCSBI.ORG
b.) Technical Skills and Competency
The infrastructure of the information system requires competent staff to maintain
and operate its functions (Badillo-Amador, García-Sánchez, & Vila, 2005;
Cohen, 2012). In many cases, the revenue authority uses its staff to run its
operations; in the case where advanced knowledge is required the institution uses
partnerships with non-governmental organisations to offer technical services. To
a large extent, the revenue authority use trainings to equip its employees.
Nevertheless, the study noted that there were cases where training was not
effective as expected. In fact the analysis that desired to know the degree of
association between training and skills possessed by staff through the Pearson
Correlation Model observed an insignificant association (r = 0.101, p =. 316),
EXCEPT where a follow-up program was instituted (r= 0.292, p=0. 003).
Therefore, it is necessary to incorporate follow-up programs after trainings for
enhanced competency.
c.) Institutional Will
Installing a good ICT infrastructure has to be complemented by the willingness of
members of staff to use the new system exclusively for service provision. The
management of the Tanzania Revenue Authority, Custom Department dedicates
its online system to be the only method for the issuance of services to clients. So
far the experience of operational staff is reported to be outstanding. However, the
lack of important equipment such as computers to some employees and
occasional system breakdown affects the use.
d.) Customer Satisfaction
Changes are to be managed carefully in order to avoid frustrating clients.
Together with implementing new changes for service provision, the Tanzania
Revenue Authority Custom Department established a help desk that attends to
queries from clients about different applications of the new system. Additionally,
documentation is provided that addresses steps to be taken in using the system. In
this study it was discovered that 95% of respondents recommend or strongly
recommend the use of Tanzania Revenue Authority Custom Online systems for
securing services from the institution; these results shows that the extent of users’
satisfaction with the online system is high.
8. INFORMATION USEFULNESS, UNDERSTANDABILITY,
RELIABILITY AND ACCESSIBILITY AND THE EXTENT OF
INFORMATION SHARING
As shown in the previous section, respondents from Tanzania Revenue Authority
have confidence on the extent that government institution shares information with
stakeholders through online media. While this extent is influenced by a number
of factors, this study is interested in the following: Information accessibility,
Information Usefulness, Information Reliability and Information
understandability. This part of the study identifies how information sharing is

IJCSBI.ORG
influenced by these factors and a linear regression model is used to demonstrate
the relationship. The Linear regression analysis was used to establish the
relationship between these variables as shown in Table 1 below.
Table 1:Model Summary
Regression Model R R Square
1 0.724a
0.524
a. Predictors: (Constant), Government Online Information is reliable, Government online
information is useful, the use of Internet has enhanced access to information, and Government
online information is easily understood
According to data reported by clients of Tanzania Revenue Authority, the
value for Coefficient of Relatedness (R) is 0.724a
; this value suggests the
presence of correlation between the variables. At the Tanzania Revenue
Authority, information usefulness, understandability, reliability and accessibility
are important segments of the information provided to users. This is because the
online system is the only means for users to access custom services. The
appreciation of these variable influences the extent of information sharing among
stakeholders. Below is a brief explanation on how these variables are enhanced at
the Tanzania Revenue Authority.
a.) Information Accessibility
The Tanzania Revenue Authority’s Custom online System provides users1
with
credentials which provide access to the system. Within the system, users are able
to trace every stage of their application. Besides, to ensure that the system is
constantly accessible to clients, the link to the online system is published on the
website and supported by servers which are constantly running with the support
of information and power backups. Although accessibility is better compared to
other public institutions, users report that there were cases where they failed to
launch their service applications due to extended system downtime.
b.) Data Reliability
The online system of the Tanzania Revenue Authority ensures reliability by
dedicating few officials who are experts in custom services to manage queries
and applications by clients to the system. Furthermore, employees of the revenue
authority verifies the information sent by clients before they effect the transaction
to ensure the reliability of information involved in transactions. This ensures that
only the information which is both relevant and correct is provided to consumers
through the online media. Moreover, to ensure that the information from users of
an online system is reliable, the system provides guidance to users on different
stages involved in an application for services. The system also dictates the format
1
Who are clearing and forwarding experts

IJCSBI.ORG
of the information to be entered to ensure consistency; further, it grants the user
with the opportunity to proof read their data entry before the information is
completely submitted.
c.) Information Usefulness
The Custom Online System is dedicated to the Customs department only, and is
tailored to meet the needs of Clearing and Forwarding argents to simplify their
tax paying processes. The authority receives feedback from clients on different
aspects of the system, this includes its usefulness for intended use. Although
many respondents agree that the information they receive is useful, the study
noted that a number of users were not comfortable with the use of the English
language for communication. Swahili is the Tanzania’s national language, its
adequate use would improve the ability of users to understand information
context, hence improve usefulness.
d.) Information Understandability
The issue of understanding the information provided through online systems is
critical; the fact that users of Tanzania Revenue Authority are of diverse nature,
suggests differences in analytical and language skills. While Tanzania uses
Kiswahili as the National language, English is used for academic and business
operations. Due to differences in education and analytical skills some clients of
Tanzania Revenue Authority need to make some language consultation before
they understand the content of information. Understanding the presence of this
challenge, the revenue authority has a dedicated helpdesk to clarify issues which
users find difficult to understand.
9. CONCLUSION
The purpose of the study was to establish the degree to which the Tanzanian
public sector uses ICTs to enhance transparency. The assessment was guided by
the fact that Tanzania advocates good governance, of which information sharing
is an important component. Also, the study recognises that ICTs play an
important role in the business sector to ensure that the client access services
efficiently with maximum transparency. The same business experience could be
adopted by the government to raise the level of satisfaction of citizens about
government services. The study observed that many people are aware of the
importance of ICTs in ensuring transparency in government operations. However,
there are several cases where the performance in government operations did not
meet users’ expectations. Factors such as low reliability of the system and
ineffectiveness of officials operating the system were among those which affected
the use of ICTs for enhanced transparent services. While training was identified
to be important is equipping users with the required technical skills; it was
occasionally observed to be the opposite. Training required follow-ups to ensure
that it meets expected goals. Equally to this, information accessibility, reliability,
usefulness and user understanding ability have great impact on the experience of
users towards online media.

IJCSBI.ORG
REFERENCES
[1] Badillo-Amador, L., García-Sánchez, A., & Vila, L. E. (2005). Mismatches In The Spanish
Labor Market: Education Vs. Competence Match. International Advances in Economic
Research, Vol 11, 93-109.
[2] Barua, A., Ravindran, S., & Whinston, A. (2007). Enabling Information Sharing Within
Organizations. Information Technology and Management, Vol (3), 31 - 45 .
[3] Cohen, J. (2012). Benefits Of On Job Training. Retrieved February 7, 2013, from
http://jobs.lovetoknow.com
[4] Drake, D., Steckler, N., & Koch, M. (2004). Information Sharing In And Across Government
Agencies: The Role And Influence Of Scientist, Politician, And Bureaucratic Subcultures.
Social Science Computer Research, 22(1), , 67–84.
[5] HAKIELIMU, LHRC, REPOA. (2005). Access To Information In Tanzania: Is Still A
Challenge. Retrieved September 11, 2012, from
http://www.tanzaniagateway.org/docs/Tanzania_Information_Access_Challenge.pdf
[6] Hatala, J.-P., & Lutta, J. (2009). Managing Information Sharing Within an Organisation
Settings: A social Network Perspective. Retrieved September 13, 2012, from
http://www.performancexpress.org/wp-content/uploads/2011/11/Managing-Information-
Sharing.pdf
[7] Im, B., & Jung, J. (2001). Using ICT For Strengthening Government Transparency. Retrieved
May 10, 2011, from http://www.oecd.org/dataoecd/53/55/2537402.pdf
[8] Kilama, J. (2013). Impacts Of Social Networks In Citizen Involvements To Politics . Dar es
Salaam: Mzumbe University.
[9] Mkapa, B. (2003). Improving Public Communication Of The Government Policies And
Enhancing Media Relations. Bagamoyo.
[10]Navarra, D. D. (2006). Governance Architecture Of Global ICT Programme: The Case Of
Jordan. London: London School of Economics and Political Science.
[11]United Republic of Tanzania. (1995). The Constitution of United Republic of Tanzania. Dar
Es Salaam, Tanzania: Government Printer.
[12]Van Niekerk, B., Pillay, K., & Maharaj, M. (2011). Analyzing the Role of ICTs in the
Tunisian and Egyptian Unrest. International Journal of Communication, 5(1406–1416).
Lubua, E. 2014. E-Transparency and Information Sharing in the Public Sector.
International Journal of Computer Science and Business Informatics, Vol. 14,
No. 1, pp. 30 -38.

IJCSBI.ORG
A Survey of Frequent Subgraphs
and Subtree Mining Methods
Hamed Dinari and Hassan Naderi
Department of Computer Engineering
Iran University of Science and Technology
Tehran, Iran
ABSTRACT
A graph is a basic data structure which, can be used to model complex structures and the
relationships between them, such as XML documents, social networks, communication
networks, chemical informatics, biology networks, and structure of web pages. Frequent
subgraph pattern mining is one of the most important fields in graph mining. In light of
many applications for it, there are extensive researches in this area, such as analysis and
processing of XML documents, documents clustering and classification, images and video
indexing, graph indexing for graph querying, routing in computer networks, web links
analysis, drugs design, and carcinogenesis. Several frequent pattern mining algorithms have
been proposed in recent years and every day a new one is introduced. The fact that these
algorithms use various methods on different datasets, patterns mining types, graph and tree
representations, it is not easy to study them in terms of features and performance. This
paper presents a brief report of an intensive investigation of actual frequent subgraphs and
subtrees mining algorithms. The algorithms were also categorized based on different
features.
Keywords
Graph Mining, Subgraph, Frequent Pattern, Graph indexing.
1. INTRODUCTION
Today we are faced with ever-increasing volumes of data. Most of these
data naturally are of graph or tree structure. The process of extracting new
and useful knowledge from graph data is known as graph mining [1] [2]
Frequent subgraph patterns mining [3] is an important part of graph mining.
It is defined as “process of pattern extraction from a database that the
number frequency of which is greater than or equal to a threshold defined by
the user.” Due to its wide utilization in various fields, including social
network analysis [4] [5] [6], XML documents clustering and classification
[7] [8], network intrusion [9] [10], VLSI reverse [11], behavioral modeling
[12], semantic web [13], graph indexing [14] [15] [16] [17] [18], web logs
analysis[19], links analysis[20], drug design [21] [22] [23], and
Classification of chemical compounds[24] [25] [26], this field has been
subject matter of several works.

IJCSBI.ORG
The present paper is an attempt to survey subtree and subgraph mining
algorithms. A comparison and classification of these algorithms, according
to their different features, is also made. The next section discusses the
literature review followed by section three that deals with the basic ideas
and concepts of graphs and trees. Mining algorithms, frequent subgraphs are
discussed in section four from different viewpoint such as criteria of
representing graphs (adjacency matrix and adjacency list), generation of
subgraphs, number of replications, pattern growth-based and apriori-based
classifications, classification based on search method, classification based
on transactional and single inputs, classification based on type of output,
and also Mining based on the logic. Fifth section focuses on frequent
Mining algorithm from different angles such as trees representation method,
type of algorithms input, tree-based Mining, and Mining based on
Constraints on outputs.
2. RELATED WORKS
H.J.Patel1, R.Prajapati,et al. [27] Classified graph mining and mentioned
two types of the algorithms, apriori-based and pattern growth based.
K.Lakshmi1,T.Meyyappan [28] studied apriori based and pattern growth
based, taking into account aspects such as input/output type, how to display
a graph, how to generate candidates, and how many times a candidates is
repeated in the graph dataset. In [29] D.Kavitha, B.V.Manikyala, et al.
suggested the third type of graph mining algorithms named as inductive
logic programming. Here a complete survey of graph mining concepts and a
very useful set of examples to ease the understanding of the concept come
next.
3. BASIC CONCEPTS
3.1 Garph
A graph G (V, E) is composed of a set of vertices (V) connected to each
other by and a set of edges (E).
3.2 Tree
A tree T is a connected graph that has no cycle. In other words, there is only
and only one path between any two vertices.
3.3 Subgraph
A subgraph G '(V', E') is a subgraph of G (V, E), which vertices and edges
are subsets of V and E respectively:
 V’⊆V
 ⊆

IJCSBI.ORG
One may say that a subgraph of a graph is a pattern of that graph.
Concerning trees two types of patterns can be defined:
3.3.1 Induced pattern
The definition is exactly the same as the definition of subtree in a tree
(Figure.1.a, Figure.1.c). It means that the vertices and the edges of
Figure.1.a. Can be seen in Figure.1.c as well
3.3.2 Embedded pattern:
Almost the same as induced pattern, except that there may be one or more
supplementary vertices between the two parents and child nodes of pattern,
For example vertex A in Figure.1.c is parent of vertex D; and in Figure.1.b
an embedded pattern of Figure 1.c is seen.
Figure.1. An example of the Induced and embedded subtree pattern
3.3.3 Isomorphism
Two graphs are isomorph, if there are one to one relationships among their
vertices and edges.
3.3.4 Frequent Subgraph
Suppose a graph G and a set of graphs D = {g1, g2, g3,…, gn} are given,
support(G) is:
Support (G) =
A graph G in a dataset D is called Frequent if its support is not less than of
a predefined threshold.
4. AN OVERVIEW OF FREQUENT SUBGRAPH MINING
ALGORITHM ACCORDING TO DIFFERENT CRITERIA
This section discusses different criteria for classification of frequent graph
mining algorithms, including: graph representation, input type, constraint-
based, inductive logic programming, search strategy, and completeness/-
incompleteness of outputs.
4.1 Graph Representation
4.1.1 Adjacency Matrix
A graph can be demonstrated as an adjacency matrix, in this case the row
and the column represent vertex of graph and the entries represents edges of
graph (i.e. when there is an edge between two vertices, entries constituted
by the junction of the row and the column is filled by “1” and otherwise by

IJCSBI.ORG
”0”). Furthermore, the nodes are represented on the main diameter of the
matrix (Figure.2).To Show the graph as a string a combination of nodes and
edges as a sequence in particular order can be used, and since every
permutation of the nodes may generate a specific string, a series of
maximum or minimum canonical adjacency matrix (CAM) must be taken
into account. An advantage of this is that two isomorphism graphs will have
the same maximum/minimum CAM.
Figure.2. Left side a graph and right side corresponding adjacency matrix
4.1.2 Adjacency List
Another way to represent a graph is adjacency list. When the graph is
sparse, several “zeros” are generated in the adjacency matrix, which is a
great waste of memory and to avoid this, adjacency list is an answer as it
assigns memory dynamically
4.2 Subgraph Generation
Two subgraphs can be mixed to generate candidate subgraph and the result
will be a new subgraph. However, given that many copied subgraphs might
be generated in the mixing process, the way of generating candid subgraphs
is critical. Among the available methods are extension and right most
expansion. In the latter case, the subgraphs are expanded in one direction
and no duplicate candidate is generated.
4.3 Frequency Counting
To check if the generated candidates is a duplicate or not, the frequency of
each must be determined and compared with the support value. Of the data
structures, which are used to count the frequency of each candidate are
embedding list and TSP tree.
5. A SURVEY OF FREQUENT SUBGRAPH MINING
ALGORITHMS
5.1 Classification Based on Algorithmic Approach
5.1.1 Apriori-Based (Breadth First Search)
This category of algorithms uses generates and test method and surface
search to find a subgraph from the network that consist the database.
Therefore, before the subgraph with length of k +1 ((k+1)-candidate), all
frequent subgraphs with length of k must be found. Thus, each candidate
with length of k +1 is obtained by connecting two frequent subgraphs with

IJCSBI.ORG
length of k. However, in this method, all state of candidate subgraph
generated is considered .Maintenance and processing need plenty of time
and memory, which tackles the performance [30] [2].
5.1.2 Pattern Growth-Based
In FP-growth-based methods a candidate subgraph with length of k+1 is
obtained by extending a frequent pattern with length of k. Since extending a
frequent subgraph with length of k may generate several candidate of length
k+1, thus the way a frequent subgraph is expanded is critical in reducing
generation of copied subgraphs.Table1 lists apriori and pattern growth
algorithms [2].
Table 1. Frequent Subgraph Mining Algorithms
AprioriPattern Growth
FARMER [31]
FSG [3]
HSIGRAM
GREW [32]
FFSM [4]
ISG
SPIN [33]
Dynamic GREW [34]
AGM [35]
MUSE [36]
SUBDUE [37]
AcGM [38]
DPMine
gFSG [39]
MARGIN [40]
GSPan [41]
CloseGraph [42]
Gaston [43]
TSP [44]
MOFa [45]
RP-FP [46]
RP-GD [46]
JPMiner [47]
MSPAN
VSIGRAM [48]
FPF [49]
Gapprox [50]
HybridGMiner
FCPMiner [51]
RING [52]
SCMiner [53]
GraphSig [54]
FP-GraphMiner [55]
gPrune [56]
CLOSECUT [57]
FSMA [58]
5.2 Classification Based on Search Strategy
There are two search strategies to find frequent subgraphs. These two
methods include breadth first search (BFS) and depth first search (DFS).
5.3 Classification Based on Nature of the Input
Depending on input type of algorithms, here tried to be divided two
categories presented as following:

IJCSBI.ORG
5.3.1 Single Graph Database
Database consists of a single large graph
5.3.2 Transactional Graph Database
Database consists of a large number of small graphs. Figure.3 shows a
database consist of a set of graphs and two subgraphs and their frequency.
In Figure.3 (left side, g, g2, and g2) demonstrates a transactional graph
database and frequency of two frequent subgraphs (right side).
Figure.3. A database consisting of three graph g1, g2, g3 and two subgraph
and frequency of each
5.4 Classification Based on Nature of the Output
5.4.1 Completeness of the Output
While, some algorithms find all frequent patterns, some other only mines
part of frequent patterns. Frequent patterns mining is closely related to
performance. When the total size of dataset is too high, it is better to use
algorithms that are faster in execution so that reduction of the performance
is avoided, although, not all frequent patterns are minded. Table 2 lists the
completeness of output [29].
Table 2. Completeness of Output
Complete OutputIncomplete Output
FARMER
gSpan
FFSM
Gaston
FSG
HSIGRAM
SUBDUE
GREW
CloseGraph
ISG
5.4.2 Constraint-Based
With increase of size database, the number of frequent pattern is increased.
This makes maintenance and analyzing more difficult as it needs more
memory space. Reducing the number of frequent patterns without losing the
data is achievable through mining and maintains more comprehensive
patterns. Given that each pattern satisfies the condition of being frequent the

IJCSBI.ORG
whole subset satisfies the condition, to achieve more comprehend patterns
we can use the following terms:
5.4.2.1 Maximal Pattern
Subgraph g1 is maximal pattern if the pattern is frequent and does not
consist of any super-pattern, so that g2 ‫با‬ g2⊃g1.
5.4.2.2 Closed Pattern
Subgraph g1 is closed if it is frequent and does not consist of any frequent
super-pattern such as g2, g2⊃g1 (i.e. support).Table3 lists maximal and
closed subgraph algorithms.
Table 3. Frequent Subgraph Mining (Constraintd)
MaximalClosed
SPIN
MARGIN
ISG
GREW
CloseGraph
CLOSECUT
TSP
RP-FP
RP-GD
5.5 Logic-Based Mining
Also known as inductive logic programming, which also an area of machine
learning, mainly in biology. This method uses inductive logic to display
structured data. ILP core uses the logic to display for search and the basic
assumptions of that structured way (e.g. WARMR, FOIL, and C-PROGOL),
which is derived from background knowledge [29]. Table 4 lists the Pattern
Growth and Table 5 indicates apriori-based algorithms categorized from
different aspects [59] [27] [60] [61] [62] [28] [30] [63].
Table 4. Frequent Subgraph Mining Algorithms (Pattern Growth-based)
Frequency
Counting
Subgraph
Generation
Graph
Representation
Input TypeAlgorithms

IJCSBI.ORG
DFS
DFS
DFS
TSP Tree
DFS
DFS
DFS
DFS
DFS
DFS
M-DFSC
Normalize Matrix
R-tree, DFS
DFS
Rightmost Extension
Rightmost Extension
Extension
Extension
Rightmost Extension
Rightmost Extension
Rightmost Extension
Rightmost Extension
Rightmost Extension
Extension
Iteration
Extension
Extension
Merge and Extension
Adjacence Matrix
Adjacence Matrix
Hash Table
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
Adjacence Matrix
BitCode
Adjacency matrix
incident matrix
Invariant vector
Feature vector
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
GSpan
CloseGeaph
Gaston
TSP
MOFA
RP-FP
RP-GD
JPMiner
MSPAN
FP-Graph-Miner
gPrune
FSMA
RING
GraphSig
Table 5. Frequent Subgraph Mining Algorithms (Apriori-based)
Frequency CountingSubgraph GenerationGraph
Representation
Input TypeAlgorithms
MDFS
Trie data structure
TID list
Maximal independent set
Maximal independent set
Suboptimal canonical
adjacency matrix tree
TID List
Canonical Spanning Tree
Suffix trees
Canonical Labeling
DFS coding
CAM
CAM
Hashtree
Level-wise Search
Level-wise Search, ILP
One Edge Extension
Iterative Merging
Iterative Merging
Merging and Extension
Edge Triple Extension
Join Operation
Iterative Merging
Vertex Extension
Disjunctive Normal
Form
Join
Join
Iterative Merging
Adjacence Matrix
Trie structure
Adjacency List
Adjacency Matrix
Sparse graph
Adjacency Matrix
Edge Triple
Adjacency Matrix
Sparse graph
Adjacency Matrix
Search Tee
Lattice
Adjacency Matrix
Adjacency Matrix
Single Large
Graph
Set of graphs
Set of graphs
Single large Graph
Single large Graph
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
Set of graphs
SUBDUE
FARMER
FSG
HSIGRAM
GREW
FFSM
ISG
SPIN
Dynamic
GREW
AGM
MUSE
MARGIN
AcGM
gFSG
Here several algorithms related to graph/tree mining are discussed in more
details.
 Gp-Growth Algorithm
The algorithm consists of three main steps:
1. Candidate generation by join operation.

IJCSBI.ORG
2. Using a new method for tree representation and look up table that allows
quick access to the information nodes in the candidate generation phase
without having to read the trees of the database.
3. using right most expansion to candidate generation that guaranteed not
generate duplicate candidate.
This algorithm uses lookup table that is implemented as Hash table to store
input trees information. It is the key part, represented as the pair of (T,pos),
where T is identification of input tree and pos is number in preorder
traversal, and value part, represented as (l,s), where l is label and s is scope
of node. In this algorithm a new candidate is generated using scope of each
node That means, first node, which is added to the other node should be
added along the right most expansion and that within the scope of the first
node to be added continually this process other frequent pattern is found
[64].
 Fp-Graph Miner Algorithm
This algorithm uses FP-growth method to find frequent subgraphs, Its input
is a set of graphs (Transactional database). First a BitCode for each edge is
defined, then a set of edge is defined for each edge .When, edge is found in
the each of graphs, the BitCode is ‘1’ and otherwise ‘0’. Then a frequency
table is sorted in ascending order based on equivalent BitCode belongs to
each edge and afterward, FP tree is constructed and frequent subgraphs are
obtained through depth traversal [55].
6. FREQUENT SUBTREES MINING ALGORITHMS
CLASSIFICATION
6.1 Trees Representation
A tree can be encoded as a sequence of nodes and edges. Some of most
important ways of encoding trees are introduced below:
6.1.1 DLS (Depth Label Sequence)
Let T be a Labeled Ordered Tree and depth-label pairs including labels and
depth for each node are belonged to V. For example, (d(vi),l(vi)) are added
to string s throughout DFS traversal of tree T. Depth-label sequence of tree
T is obtained as { d(v1), l(v1)), …,(d(vk), l(vk) }. For instance, DLS for tree
in Figure.4 can be presented as follow:
{(0,a),(1,b),(2,e),(3,a),(1,c),(2,f),(3,b),(3,d),(2,a),(1,d),(2,f)(3,c)}
6.1.2 DFS – LS (Depth First Sequence)-(Label Sequence)
Assumed a labeled ordered tree, Labels is added to string of s
during the DFS traversal of Tree T. During backtrack ‘-1’or‘$’or ‘/’ is added

IJCSBI.ORG
to the string, DFS-LS code for tree T is illustrated in in Figure.4
{abea$$$cfb$d$$a$$dfc$$$}
6.1.3 BFCS (Breadth First Canonical String)
Let T be an unordered tree. Several sequence encoded string can be
generated using the BFS method and through changing the order of children
of a node. Thus, one may say that BFCS tree T equals to the smallest
lexicographic order of this encoded string. BFCS of tree T is showed in
Figure.4. {a$bcd$e$fa$f$a$bd$$c#}
6.1.4 CPS (Consolidate Prufer Sequence)
Let T be a labeled tree T, and CPS encoding method consists of two parts:
NPS as extended prufer sequence, which uses vertex numbers traversal as
set of unique label is obtained; an d LS (Label Sequence) as a sequence
consisting of labels in prefix traversal after the leafs is removed is achieved.
Both NPS and LS generate a unique encoding for labeled tree. NPS and LS
obtained for the tree presented in Figure.4 is as follow respectively:
{ebaffccafda-}, {aebbdfaccfda}. To obtain NPS, a leaf from the tree is
removed in each step and the parent of the leaf is taken get as output. This is
repeated until only the roots remain and ‘-’ is added to note as the end of the
string. Regarding LS (Label Sequence) the same postfix traversal of the tree
is taken as LS. Table9 remarks this category of trees [65].
Figure.4. A Tree Example

IJCSBI.ORG
Table 6. Frequent subtree Mining Algorithms (Tree Representation)
Tree RepresentationAlgorithms
DLS
DFS-LS
DLS
FST-Forest
BFCS
DLS
DFS-LS
DLS
DLS
DFS string
DFS-LS
CPS
BFCS
DFS-LS
BFCS
DFS-LS
uFreqt
SLEUTH
Unot
Path Join
RootedTreeMiner [66]
FREQT
TreeMiner
Chopper
XSPanner
AMIOT
IMB3Miner
TRIPS
FreeTreeMiner
CMTreeMiner
HybridTreeMiner [67]
GP-Growth
6.2 Input Types
6.2.1 Rooted Ordered Trees
Rooted Ordered sub-tree is a kind of tree in which a single node is considered
as the “root” of the tree and there is a relationship between children of each
node so that each child is greater than or equal to its siblings that are placed at
the left hand side of it; moreover it is less than or equal to ones that are
placed at its right hand side. If we elate or definition of rooted ordered tree
such that there was no need to consider the relationship between siblings we
have a rooted unordered sub-tree.in Table7 rooted ordered tree mining
algorithms is shown.
Table 7. Rooted Ordered Tree mining Algorithms
InducedEmbedded
FREQT [68]
AMIOT [69]
IMB3Miner [70]
TRIPES [65]
TIDES [65]
TreeMiner [71]
Chopper [72]
XSPanner [72]
IMB3-Miner
6.2.2 Rooted Unordered Trees
In this type of trees, a node is considered as the root, however, there is no
particular order between the descendants of each node,In Table 8 rooted
unordered tree mining algorithms is listed.

IJCSBI.ORG
Table 8. Rooted unordered Tree mining Algorithms
InducedEmbedded
uFreqT [73]
Unot [74]
PathJoin [65]
Rooted TreeMiner [75]
TreeFinder [76]
TreeFinder
Cousin Pair [77]
SLEUTH [78]
6.3 Tree Base Data Mining
Frequent subtrees mining algorithm can be categorized into two major
categories, aprior-based and pattern growth-based. Table 9 lists the apriori
and pattern growth algorithms of trees [79] [76] [80].
Table 9. Frequent Subtree Mining Algorithms
7. CONCLUSIONS AND FUTURE WORKS
Frequent subgraph Mining algorithms were first examined from different
viewpoints such as different ways of representing a graph (e.g. adjacency
matrix and adjacency list), generation of subgraphs, frequency counting,
pattern growth-based and apriori-based algorithm classification, search
based classification, input-based classification (single, transactional), output
based classification. Furthermore, Mining based on logic was discussed.
Afterward, frequent subtrees traversal algorithms were examined from
different viewpoints such as trees representation methods, type of inputs,
tree-based traversal, and also Mining based on Constraints of outputs. Given
the results, it is concluded that in absence of generating patterns by pattern-
AprioriPattern Growth
TreeFinder
AMIOT
FreeTreeMiner
TreeMine [81]
SLEUTH
CMTreeMiner [82]
Pattern Matcher [71]
W3Miner [83]
FTMiner [84]
CFFTree [85]
IMB3-Miner
uFreqt
Unot
FREQT
TRIPS
TIDES
Path Join
XSPanner
Chopper
PrefixTreeISpan [86]
PCITMiner [87]
F3TM [88]
GP-Growth [64]

IJCSBI.ORG
growth, it is featured with less computation work and needs smaller memory
size. Moreover, these algorithms are specifically designed for trees and
graphs and cannot be used for other purposes. On the other hand, as they
work on variety of datasets, it is not easy to find tradeoffs between them.
The same frequent patterns can be used for searching similarity, indexing,
classifying graphs and documents in future studies. Parallel methods and
technologies such as Hadoop can also be needed when working with
excessive data volume.
8. ACKNOWLEDGMENTS
Authors are thankful to Mohammad Reza Abbasifard for their support of the
investigations.
REFERENCES
[1] A.Rajaraman, J.D.Ullman, 2012. Mining of Massive Datasets, 2nd ed.
[2] J.Han, M.Kamber, 2006, Data Mining Concepts and Techniques. USA: Diane
Cerra.
[3] Kuramochi, Michihiro, and G.Karypis., 2004. An efficient algorithm for
discovering frequent subgraphs, in IEEE Transactions on Knowledge and
Data Engineering, pp. 1038-1051.
[4] J.Huan, W.Wang, J. Prins, 2003. Efficient Mining of Frequent Subgraph in the
presence of isomorphism, in Third IEEE International Conference on Data
Minign (ICDM).
[5] (2013, Dec.) Trust Network Datasets - TrustLet. [Online].
http://www.trustlet.org
[6] L.YAN, J.WANG, 2011. Extracting regular behaviors from social media
networks, in Third International Conference on Multimedia Information
Networking and Security.
[7] Ivancsy,I. Renata, I.Vajk., 2009. Clustering XML documents using frequent
subtrees, Advances in Focused Retrieval, Vol. 3, pp. 436-445.
[8] J.Yuan, X.Li, L.Ma, 2008. An Improved XML Document Clustering Using
Path Features, in Fifth International Conference on Fuzzy Systems and
knowledge Discovery, Vol. 2.
[9] Lee, Wenke, and Salvatore J. Stolfo, 2000. A framework for constructing
features and models for intrusion detection systems, in ACM transactions on
Information and system security (TiSSEC), pp. 227-261.
[10] Ko, C, Logic induction of valid behavior specifications for intrusion detection
, 2000. in In IEEE Symposium on Security and Privacy (S&P), pp. 142–155.

IJCSBI.ORG
[11] Yoshida, K. and Motoda, 1995. CLIP: Concept learning from inference
patterns, in Artificial Intelligence, pp. 63–92.
[12] Wasserman, S., Faust, K., and Iacobucci. D, 1994. Social network analysis :
Methods and applications. Cambridge university Press.
[13] Berendt, B., Hotho, A., and Stumme, G., 2002. semantic web mining, in In
Conference International Semantic Web (ISWC), pp. 264–278.
[14] S.C.Manekar, M.Narnaware, May 2013. Indexing Frequent Subgraphs in
Large graph Database using Parallelization, International Journal of Science
and Research (IJSR), Vol. 2 , No. 5.
[15] Peng, Tao, et al., 2010. A Graph Indexing Approach for Content-Based
Recommendation System, in IEEE Second International Conference on
Multimedia and Information Technology (MMIT), pp. 93-97.
[16] S.Sakr, E.Pardede, 2011. Graph Data Management: Techniques and
Applications, in Published in the United States of America by Information
Science Reference.
[17] Y.Xiaogang, T.Ye, P.Tao, C.Canfeng, M.Jian, 2010. Semantic-Based Graph
Index for Mobile Photo Search," in Second International Workshop on
Education Technology and Computer Science, pp. 193-197.
[18] Yildirim, Hilmi, and Mohammed Javeed Zaki., 2010. Graph indexing for
reachability queries, in 26th International Conference on Data Engineering
Workshops (ICDEW)IEEE, pp. 321-324.
[19] R.Ivancsy and I.Vajk, 2006. Frequent Pattern Mining in Web Log Data, in
Acta Polytechnica Hungarica, pp. 77-90.
[20] G.XU, Y.zhang, L.li, 2010. Web mining and Social Networking. melbourn:
Springer.
[21] S.Ranu, A.K. Singh, 2010. Indexing and mining topological patterns for drug,
in ACM, Data mining and knowlodge discovery, Berlin, Germany.
[22] (2013, Dec.) Drug Information Portal. [Online]. http://druginfo.nlm.nih.gov
[23] (2013, Dec.) DrugBank. [Online]. http://www.drugbank.ca
[24] Dehaspe,Toivonen, and King, R.D., 1998. Finding frequent substructures in
chemical compounds, in In Proc. of the 4th ACM International Conference on
Knowledge Discovery and Data Mining, pp.30-36.
[25] Kramer, S., De Raedt, L., and Helma, C., 2001. Molecular feature mining in
HIV data, in In Proc. of the 7th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining (KDD-01), pp. 136–143.
[26] Gonzalez, J., Holder, L.B. and Cook, 2001. Application of graph-based
concept learning to the predictive toxicology domain, in In Proc. of the

IJCSBI.ORG
Predictive Toxicology Challenge Workshop.
[27] H.J.Patel, R.Prajapati, M.Panchal, M.Patel, Jan. 2013. A Survey of Graph
Pattern Mining Algorithm and Techniques, International Journal of
Application or Innovation in Engineering & Management (IJAIEM), Vol. 2,
No. 1.
[28] K.Lakshmi, T. Meyyappan, 2012. FREQUENT SUBGRAPH MINING
ALGORITHMS - A SURVEY AND FRAMEWORK FOR
CLASSIFICATION, computer science and information technology, pp. 189–
202.
[29] D.Kavitha, B.V.Manikyala Rao and V. Kishore Babu, 2011. A Survey on
Assorted Approaches to Graph Data Mining, in International Journal of
Computer Applications, pp. 43-46.
[30] C.C.Aggarwal,Wang, Haixun, 2010. Managing and Mining Graph Data.
Springer,.
[31] B.Wackersreuther, Bianca, et al. , 2010. Frequent subgraph discovery in
dynamic networks, in ACM, Proceedings of the Eighth Workshop on Mining
and Learning with Graphs, Washington DC USA, pp. 155-162.
[32] Kuramochi, Michihiro, and G.Karypis, 2004. Grew-a scalable frequent
subgraph discovery algorithm, in Fourth IEEE International Conference on
Data Mining (ICDM), pp. 439-442.
[33] Huan, Jun, SPIN: mining maximal frequent subgraphs from graph databases,
2004. in Proceedings of the tenth ACM SIGKDD international conference on
Knowledge discovery and data mining.
[34] Borgwardt, Karsten M., H-P. Kriegel, and P.Wackersreuther, 2006. Pattern
mining in frequent dynamic subgraphs, in Sixth International Conference on
Data Mining (ICDM), pp. 818-822.
[35] Inokuchi, Akihiro, T.Washio, and H.Motoda, 2000. An apriori-based
algorithm for mining frequent substructures from graph data, in Principles of
Data Mining and Knowledge Discovery, pp. 13-23, Springer Berlin
Heidelberg.
[36] Zou, Zhaonian, et al, 2009. Frequent subgraph pattern mining on uncertain
graph data, in Proceedings of the 18th ACM conference on Information and
knowledge management, pp. 583-592.
[37] Ketkar, N.S, Lawrence B.Holder, and D.J.Cook, 2005. Subdue: compression-
based frequent pattern discovery in graph data, in ACM, Proceedings of the
1st international workshop on open source data mining: frequent pattern
mining implementations, pp. 71-76.
[38] A. Inokuchi, T. Washio, and H. Motoda, 2003. Complete mining of frequent
patterns from graphs: Mining graph data, in Machine Learning, pp. 321-354.

IJCSBI.ORG
[39] Kuramochi, Michihiro, and G.Karypis, 2007. Discovering frequent geometric
subgraphs, in Information Systems, pp. 1101-1120.
[40] Thomas, Lini T, Satyanarayana R. Valluri, and K.Karlapalem, 2006. Margin:
Maximal frequent subgraph mining, in IEEE Sixth International Conference
on Data Mining (ICDM), pp. 1097-1101.
[41] Yan, Xifeng, and J.Han, 2002. gspan: Graph-based substructure pattern
mining, in Proceedings International Conference on Data Mining.IEEE, pp.
721-724.
[42] Yan, Xifeng, and Jiawei Han, 2003. CloseGraph: mining closed frequent
graph patterns, in Proceedings of the ninth ACM SIGKDD international
conference on Knowledge discovery and data mining, pp. 286-295.
[43] Nijssen, Siegfried, and J.N. Kok., 2005. The gaston tool for frequent subgraph
mining, in Electronic Notes in Theoretical Computer Science, pp. 77-87.
[44] Hsieh, Hsun-Ping, and Cheng-Te Li, 2010. Mining temporal subgraph patterns
in heterogeneous information networks, in IEEE Second International
Conference on Social Computing (SocialCom), pp. 282-287.
[45] Wörlein, Marc, et al, 2005. A quantitative comparison of the subgraph miners
MoFa, gSpan, FFSM, and Gaston, in Knowledge Discovery in Databases:
PKDD , Springer Berlin Heidelberg, pp. 392-403.
[46] S.J.Suryawanshi,S.M.Kamalapur, Mar 2013. Algorithms for Frequent
Subgraph Mining, International Journal of Advanced Research in Computer
and Communication Engineering, Vol. 2, No. 3.
[47] Liu, Yong, Jianzhong Li, and Hong Gao, 2009. JPMiner: mining frequent
jump patterns from graph databases, in IEEE, Sixth International Conference
on Fuzzy Systems and Knowledge Discovery, pp. 114-118.
[48] Reinhardt, Steve, and G.Karypis, 2007. A multi-level parallel implementation
of a program for finding frequent patterns in a large sparse graph, in IEEE
International Parallel and Distributed Processing Symposium (IPDPS), pp. 1-
8.
[49] Schreiber, Falk, and H.Schwobbermeyer., 2005. Frequency concepts and
pattern detection for the analysis of motifs in networks, in Transactions on
computational systems biology III, pp. 89-104, Springer Berlin Heidelberg.
[50] Chent, Chen, et al., 2007. gapprox: Mining frequent approximate patterns
from a massive network, in Seventh IEEE International Conference on Data
Mining (ICDM), pp. 445-450.
[51] Ke, Yiping, J.Cheng, and Jeffrey Xu Yu, 2009. Efficient discovery of frequent
correlated subgraph pairs, in Ninth IEEE International Conference on Data
Mining (ICDM), pp. 239-248.

IJCSBI.ORG
[52] Zhang, Shijie, J.Yang, and Shirong Li, 2009. Ring: An integrated method for
frequent representative subgraph mining, in Ninth IEEE International
Conference on Data Mining (ICDM), pp. 1082-1087.
[53] Fromont, Elisa, Céline Robardet, and A.Prado, 2009. Constraint-based
subspace clustering, in International conference on data mining, pp. 26-37.
[54] Ranu, Sayan, and Ambuj K. Singh., 2009. Graphsig: A scalable approach to
mining significant subgraphs in large graph databases, in IEEE 25th
International Conference on Data Engineering (ICDE), pp. 844-855.
[55] R. Vijayalakshmi,R. Nadarajan, J.F.Roddick,M. Thilaga, 2011. FP-
GraphMiner, A Fast Frequent Pattern Mining Algorithm for Network Graphs,
Journal of Graph Algorithms and Applications, Vol. 15, pp. 753-776.
[56] Zhu, Feida, et al., 2007. gPrune: a constraint pushing framework for graph
pattern mining, in Advances in Knowledge Discovery and Data Mining, , pp.
388-400, Springer Berlin Heidelberg.
[57] Yan, Xifeng, X. Zhou, and Jiawei Han, 2005. Mining closed relational graphs
with connectivity constraints, in Proceedings of the eleventh ACM SIGKDD
international conference on Knowledge discovery in data mining, pp. 324-
333.
[58] Wu, Jia, and Ling Chen, 2008. A fast frequent subgraph mining algorithm, in
The 9th International Conference for Young Computer Scientists (ICYCS), pp.
82-87.
[59] Krishna, Varun, N. N. R. R. Suri, G. Athithan, 2011. A comparative survey of
algorithms for frequent subgraph discovery, Current Science(Bangalore), pp.
1980-1988.
[60] K.Lakshmi, T. Meyyappan, Apr. 2012. A COMPARATIVE STUDY OF
FREQUENT SUBGRAPH MINING ALGORITHMS, International Journal
of Information Technology Convergence and Services (IJITCS), Vol. 2, No. 2.
[61] C.Jiang, F.Coenen, M.Zito, 2004. A Survey of Frequent Subgraph Mining
Algorithms, The Knowledge Engineering Review, pp. 1-31.
[62] M.Gholami, A.Salajegheh, Sep. 2012. A Survey on Algorithms of Mining
Frequent Subgraphs, International Journal of Engineering Inventions, Vol. 1,
No. 5, pp. 60-63.
[63] V.Singh, D.Garg, Jul. 2011. Survey of Finding Frequent Patterns in Graph
Mining: Algorithms and Techniques, International Journal of Soft Computing
and Engineering (IJSCE), Vol. 1, No. 3.
[64] Hussein, M.MA, T. H.Soliman, O.H. Karam, 2007. GP-Growth: A New
Algorithm for Mining Frequent Embedded Subtrees. 12th IEEE Symposium on
Computers and Communications.

IJCSBI.ORG
[65] Tatikonda, Shirish, S.Parthasarathy,T.Kurc., 2006. TRIPS and TIDES: new
algorithms for tree mining, in Proceedings of the 15th ACM international
conference on Information and knowledge management.
[66] Tung, Jiun-Hung, 2006. MINT: Mining Frequent Rooted Induced Unordered
Tree without Candidate Generation.
[67] Chi, Yun, Y.Yang, and Richard R. Muntz., 2004. HybridTreeMiner: An
efficient algorithm for mining frequent rooted trees and free trees using
canonical forms, in Proceedings 16th International Conference on Scientific
and Statistical Database Management.
[68] T.Asai, H.Arimura, T.Uno, S.Nakano and K.Satoh, 2008. Efficient tree
mining using reverse search.
[69] S.Hido, and H. Kawano., 2005. AMIOT: Induced Ordered Tree Mining in
Tree-structured Databases, in Proceedings of the Fifth IEEE International
Conference on Data Mining (ICDM’05).
[70] H.Tan, T.S. Dillon, F.Hadzic, E.Chang, and L.Feng, 2006. IMB3-Miner:
Mining Induced/Embedded Subtrees by Constraining the Level of Embedding,
in Advances in Knowledge Discovery and Data Mining, Springer Berlin
Heidelberg, pp. 450–461.
[71] M.J.Zaki, 2002. Efficiently mining frequent trees in a forest, in In Proceedings
of the 8th International Conference on Knowledge Discovery and Data
Mining (ACM SIGKDD), pp. 71-80.
[72] C.Wang, M.Hong, J.Pei, H.Zhou, W.Wang, 2004. Efficient pattern-growth
methods for frequent tree pattern mining, in Advances in Knowledge
Discovery and Data Mining, Springer Berlin Heidelberg, pp. 441-451.
[73] S.Nijssen and J.N.Kok, 2003. Efficient Discovery of Frequent Unordered
Trees, in Proc. First Intl Workshop on Mining Graphs Trees and Sequences,
pp. 55-64.
[74] T. Asai, H. Arimura, T.Uno and S. Nakano., 2003. Discovering Frequent
Substructures in Large Unordered Trees, in procceding sixth conference on
Discovery Science, pp. 47-61.
[75] Y.Chi, Y.Yang, and R. Muntz., May 2004. Canonical Forms for Labeled
Trees and Their Applications in Frequent Subtree Mining, Knowledge and
Information Systems, No. 8.2, pp. 203-234.
[76] Chi, Yun, et al.,2005. Frequent subtree mining-an overview, in Fundamenta
Informaticae, pp. 161-198.
[77] Shasha, Dennis, J.Tsong-Li Wang and Sen Zhang.,2004. Unordered tree
mining with applications to phylogeny, in IEEE Proceedings 20th
International Conference on Data Engineering, pp. 708-719.

IJCSBI.ORG
[78] M.J.Zaki., 2005. Efficiently Mining Frequent Embedded Unordered Trees, in
IOS Press, pp. 1-20.
[79] Jimenez, Aida, F.Berzal,J.Cubero.,2008. Mining induced and embedded
subtrees in ordered, unordered, and partially-ordered trees, in EEE
Transactions on Knowledge and Data Engineering, Springer Berlin
Heidelberg, pp. 111-120.
[80] Jimenez, Aida,F. Berzal Juan-Carlos Cubero.,2006. Mining Different Kinds of
Trees: A Tree Mining Overview, in Data Mining.
[81] B.Bringmann.,2004. Matching in Frequent Tree Discovery, in Fourth IEEE
International Conference on Data Mining.
[82] Chi, Yun, et al. Mining.,2004. Cmtreeminer: Mining both closed and maximal
frequent subtrees, in Advances in Knowledge Discovery and Data , Springer
Berlin Heidelberg, pp. 63-73.
[83] AliMohammadzadeh, Rahman, et al., Aug 2006. Complete Discovery of
Weighted Frequent Subtrees in Tree-Structured Datasets, International
Journal of Computer Science and Network Security (IJCSNS ), Vol. 6, No. 8,
pp. 188-196.
[84] J.HU, X.Y.LI., Mar 2009. Association Rules Mining Including Weak-Support
Modes Using Novel Measures," WSEAS Transactions on Computers, Vol. 8,
No. 3, pp. 559-568.
[85] Zhao, Peixiang, and J.X.Yu.,2007. Mining closed frequent free trees in graph
databases, in Advances in Databases: Concepts, Systems and Applications,
Springer Berlin Heidelberg, pp. 91-102.
[86] Zou, Lei, et al.,2006. PrefixTreeESpan: A pattern growth algorithm for mining
embedded subtrees, in Web Information Systems (WISE), Springer Berlin
Heidelberg, pp. 499-505.
[87] Kutty, Sangeetha, R.Nayak, Y.Li., 2007. PCITMiner: prefix-based closed
induced tree miner for finding closed induced frequent subtrees, in
Proceedings of the sixth Australasian conference on Data mining and
analytics, Vol. 70, Australian Computer Society.
[88] Zhao, Peixiang, and J.X.Yu., 2008. Fast frequent free tree mining in graph
databases, in Springer World Wide Web, Hong Kong, pp. 71-92.
Dinari, H. and Naderi, H. 2014. A Survey of Frequent Subgraphs and
Subtree Mining Methods. International Journal of Computer Science and
Business Informatics. Vol. 14, No. 1, pp. 39-57.

IJCSBI.ORG
A Model for Implementation of IT
Service Management in
Zimbabwean State Universities
Munyaradzi Zhou, Caroline Ruvinga,
Samuel Musungwini and Tinashe Gwendolyn Zhou
Department of Computer Science and Information Systems
Gweru, Zimbabwe
ABSTRACT
Several IT service management (ITSM) frameworks have been deployed and are being
adopted by companies and institutes without redefining the framework to a model which
suits their IT departments’ operating environment and requirements. An IT service
management model is proposed for Zimbabwean universities and is a holistic approach
through integration of Operational Level Agreements (OLAs), Service Level Agreement
(SLAs) and IT Service Catalogues (ITSCs). OLA is considered as the domain for
describing IT Service management and its attainment is geared by organizational
management and IT section personnel in alignment with the mission, vision and values of
the organization. Explicitly defining OLAs will aid management in identification of key
services and processes in both qualitative and quantitative form (SLAs). After defining
SLAs then ITSCs can be formulated, a measure which is both customer and IT service
provider centric and acts as the nucleus of the model. Redefining IT Service Management
from this this perspective will result in deriving value from IT service management
frameworks and customer satisfaction.
Keywords: SLAs, OLAs, ITSCs, ITSM.
1. INTRODUCTION
The IT service management is a modern concept adopted by the IT
community for improved IT services delivery and productivity to attain
customer satisfaction and control costs. IT Service Management is an
integration of IT services provisioning between service providers and end
users to arrive at end-to-end service through the implementation of
measures such as Service Level Agreements (SLAs), Operations Level
Agreements (OLAs) and IT Service Catalogues (ITSCs) (Almeroth &
Hasan, 2002). Service management frameworks in IT industry have been
developed such as Control Objectives for Information and related
Technology (COBIT), and IT Infrastructure Library (ITIL) but have not
been related for a specific IT sections given its operating environment and

IJCSBI.ORG
constraints. IT Service is the nucleus in accomplishment of business
processes at a University, thus it supports academic research, learning and
teaching. Universities offers IT Services to staff, researchers and students,
visitors and partners on platforms such as Electronic Learning (ELearning),
library services, staff directory and email, learning resources which are
crucial to learning, teaching, and collaboration as the community becomes
global. The IT department must offer better services to these stakeholders in
a resource constraint environment (staff and financial resources) (University
of Birmingham, 2014).
2. RELATED WORKS
An ITS service consists of three key elements namely, a Service Level
Agreements (SLAs), Operational Level Agreements (OLAs) and Service
Catalogue page/s. Operational Level Agreements (OLAs) are agreements
between the ITS teams and such as hardware, software and networking
teams on how they will collaborate to ensure the appropriate service level is
met for a particular service under supervision of a coordinator and it defines
the expectations and commitments needed to deliver Service Level
Agreements (SLAs) (University of California, 2012). Service Level
Agreements (SLAs) are agreements between the Information Technology
Services (ITS) team or teams and their clients which define the level of
service the client should receive. An IT service catalogue is a mapping
database of an institute’s available technological resources and products/ IT
services in offer and about to be rolled out (Griffiths, Lawes, & Sansbury,
2012; Moeller, 2013). The ITS Service Catalogue is the division of services
offered at an institute into components with policies, guidelines and
responsibilities of parties involved, SLAs and delivery conditions (Bon et
al., 2007).
The service level catalogue should be readily accessible to authorised users
and facilitate them to create a service request on behalf of themselves and
others, and contain facilities to approve service requests. IT service
catalogues should be tested by both IT and key users so that the product
complies with the prescribed technical functionality and usability metrics.
The IT catalogue should be developed in such a way that it facilitates
effective communication between IT management and stakeholders
involved and acts as an effective tool for good governance (Griffiths et al.,
2012; Moeller, 2013).
Basically an IT service catalogue is divided into business service catalogue
and technical service catalogue. A business service catalogue is client
centric and must meet users’ requirements thus the user community should
be engaged in requirement gathering and design. Alternatively, a technical
service catalogue is service provider centric and focuses on specific services
description in IT terms including services constructs and their

IJCSBI.ORG
interrelationships. IT managerial and technical staff work processes are
explicitly defined and the technical service catalogue access is mainly
restricted to organizational (Troy, Rodrigo, & Bill, 2007).
A SLA should consist of the following elements namely, placement of
services into categories (sections for catalogue), listing of each category as a
service catalogue section, establishing integrated/packaged/bundled service
products, identification of modular service products , definition of each
service product, establishing service owner and supplier, defining
procurement procedures(how and the cost), specifying service level metrics
(availability, reliability, response), defining limits of service and defining
customers responsibilities thus it provides a basis for managing the
relationship between the service provider and the customer, describing the
agreement between the service provider and customer for the service to be
delivered, including how the service is to be measured (Hiles, 2000).
A service must provide a bridge from the developers and engineers’ point of
view to the end-user’s perspective and identifies internal processes
necessary to offer and maintain the services. Services change management
and continuous processes improvement is important in addressing
stakeholders’ needs (University of California, 2012). A Service Lifecycle
basically focuses on defining a service strategy thus maintaining it and
implementing it, service designing which focuses on the methodology and
architectural design to offer the service, thirdly, service transition, which
focuses on testing and integration of services offered for quality and control
compliance, and finally service operation focuses on smooth running of
daily IT services and continuous improvement which aligns the life cycle
stages thus offering room for best practices and improvement in value
delivery (Office of Government Commerce, 2010).
A Service Level Agreement (SLA) is a blue-print which governs service
provision parameters between the service provider and the client (University
of California, 2012). Mainly, a SLA consists services being provided by the
IT service provider and how they will deliver them (they must meet user
requirements and standards agreed upon by parties involved and be
attainable thus communication is key in all processes), definition of key
performance parameters, assigning IT service providers personnel and users
to measure specific performance using specific metrics (continuously
monitor, manage and measure Service Level commitments), identification
of rewards or penalties levied if service delivery is being offered effectively
or they’re failing to render the services (SLA matrices should have
performance buffers to allow for the recovery from breaches) (Dube &
Gulati, 2005; Lahti & Peterson, 2007).

IJCSBI.ORG
4. METHODOLOGY
The research questions in this study examine ITS personnel services
delivery in relation to SLAs, OLAs and ITSCs. Research approach is the
way the researcher approaches the research either by gathering data and
formulates a theory or the researcher develops a theory and hypotheses and
then tests or validate it. An inductive approach was adopted since it allowed
the researchers to develop a theory during data analysis of the collected data
(Saunders, Lewis, & Thornhill, 2009). The researchers used questionnaires
to carry out the research since they facilitated saturation, the questionnaires
were distributed in proportion to personnel in each ITS department team. 20
Questionnaires were distributed in the Hardware Section, 7 in the Software
section and 7 in the Networking section. The response rates were 80
percent, 71.43% and 85.71% percent respectively. The data was coded
manually.
5. RESULTS
The hardware section team is not aware of any agreements with the
software team and the networking department which ensure appropriate
service level is met for particular services within the ITS department. If
OLAs agreements are in place personnel felt that the ITS department
Director and or other senior officers should facilitate and maintain these
agreements since they increase efficiency and they allow alignment of work
processes with organizational objectives.
The hardware section team is not aware of any agreements with the
software and networking team which define the level of service the students
and staff members should receive and this should be led by the chief
technician. Personnel act on intuition to work and tasks when called upon or
infer to those which are his/her job description. All respondents agreed to
the notch that the adoption of SLAs will improve service delivery to the
clients and helps in setting boundaries on personnel’s duties and how they
would execute them with confidence. Furthermore, it results in process
standardization and improved accuracy in execution of tasks. 10% of the
respondents strongly agree, 60% Agree, 15% are Neutral and also 15%
Disagree that the use SLAs will improve and differentiate services by
defining performance and its measures and this will help in building
actionable performance tracking and controls.
There is no policy about IT services currently in offer and ready to be
delivered which the respondents felt they should be monitored by
supervisors responsible for a specific services being offered. In hardware
maintenance, personnel from other departments are called upon to offer all
related activities on ad-hoc basis. ITSCs offers a platform to evaluate
services being offered if they’re meeting the required standard. Top

IJCSBI.ORG
management such as directors and supervisors are key stakeholders in
implementation of IT service management.
The networking section team do not have any agreements with the software
and hardware teams to ensure appropriate service level is met for particular
services in the ITS department. The service level, which students and staff
should receive is not defined such as the uptime and download speed
available in both the wireless and wired network. Staff portal services and
the students’ electronic learning (E-Learning) accounts being monitored by
the software team are dependent upon network availability and the server
capacity which is the responsibility of the networking and hardware section
respectively even though there are no OLAs among the departments
concerned. Staff and students are informally consulted on their requirements
on the services being offered by the ITS department. Students and staff
members should be given a platform to request additional functionalities
‘add-on’ on their E-Learning and staff portal services accounts.
IT service management model
A University-wide IT service management model was developed which
consists of the Operational Level Agreements which is viewed as the
cornerstone of IT service management implementation, the service level
agreements which is the sub-domain linking OLAs and ITSCs, and finally
the IT service catalogues which is referred to as the nucleus of IT service
management. Leadership support is important from personnel such as IT
directors, projects managers and Chief IT technicians since they will initiate
setting of specific benchmarks for performance measurement and facilitate
effective feedback mechanism and communication. Top management will
help in organizing seminars or workshops in form of refresher courses or
awareness campaigns about execution of their work processes.
Explicitly defining OLAs will aid management in identification of key
services and processes in both qualitative and quantitative form while
monitoring them and taking corrective measures where necessary (SLAs).
After defining SLAs then ITSCs can be formulated, a measure which is both
customer and IT service provider centric and acts as the nucleus of the
model. Services being offered should be end-user centric rather than the
provider’s point of view such as the website should be navigated easily and
there must be a distinction between administrative issues and other
information to be displayed on the homepage. Support services including
how to access the website using mobile phones and those which are
supported or compatible mobile browsers should be availed to clients.
Additionally, key future plans such as general upgrade of the site (time it
will be expected to be down during maintenance should be communicated),

IJCSBI.ORG
upgrading to mobile site, modification of functionalities on the webpage,
and phasing out of specific service should be communicated. Figure 1
overleaf shows the developed model.
Figure 1: IT service management implementation model
OPERATIONAL LEVEL
AGREEMENTS (IT service provider
centric)
SERVICE LEVEL
AGREEMENTS
Identify key services and processes to
achieve the required goal.
Define services in qualitative and
quantitative form.
Monitor the key services and processes
while corrective measures are being taken
where necessary.
SERVICE CATALOGUE
(Customer centric)
Details of services and products
offering
Give reports on website availability
(response time, uptime percentage
etc.)
Support services (e.g. installation
of preliminary software, mobile
browser support/types of mobile
phones compatible)
Key policies
Terms and conditions
Service Level Agreements (SLAs)
Key future plans (upgrading to
mobile, modification of
functionality etc. or phasing out of
a service
OLAs DRIVING
FORCES
Leadership support
Setting specific
performance benchmarks
Rewards and recognition
or penalties in
relationship response on
adopting OLAs
Education and awareness
campaigns to ITS
department sections
personnel
Ensure effective
feedback mechanism and
communication.
Definition of services
required to deliver
services
Explicitly define
responsibilities of IT
service provider and
recipient

IJCSBI.ORG
6. CONCLUSIONS
An enabling collaborative approach to quality improvement should be
explored by the ITS teams while involving their clients (staff and students)
so that their needs are satisfied. In achieving ITSM, goals must be
benchmarked and reviewed by the monitoring and evaluation committee
being steered by the project manager. The committee must ensure
availability of human and financial resources for example through lobbying
top management support and training of employees. In addition, the
committee should facilitate a cyclical communication system with
stakeholders and top management so as to ensure their support and
commitment even during the review process. The institutional goals, vision
and mission should be aligned with ITSM strategy adopted. A service
catalogue which acts as a blue-print to clients in understanding and making
an informed decision about the services they use or intends to use must
always be availed to clients, and also it acts as a benchmark for quality
assurance on services the ITS department offers to clients.
OLA between IT service provider and a procurement department or other
departments to obtain hardware or other resources in agreed times and
between a service desk and a support group to provide incident resolution in
agreed times should be defined to ensure appropriate service level is met
(Rudd, 2010). Adoption of OLAs will result in better service delivery and
management of duties and responsibilities. Universities must integrate
various IT teams within departments across the various campuses while
explicitly defining implementation of SLAs, OLAs and ITSCs and also
emphasise on performance reporting which must be facilitated by a team
leaders from all IT sections. Additionally, institutes must identify
facilitating and clogging conditions for successful ITSM and this can be
necessitated through conducting seminars and or workshops on relevant IT
aspects. Conducting post training evaluation on deliberations on ITSM will
help in continuous improvement in service delivery. Relating COBIT and
ITIL to IT service management constructs (OLAS, SLAs and ITSCs)
presents an interesting area for further research.
REFERENCES
[1] Almeroth, K.C. and Hasan, M., 2002. Management of Multimedia on the Internet: 5th
IFIP/IEEE Proceedingds of the International Conference on Management of
Multimedia Networks and Services, MMNS 2002, Santa Barbara, CA, USA, October 6-
9, 200. CA: Springer, p.356.
[2] Bon, J. van et al., 2007. IT Service Management: An Introduction. Van Haren
Publishing, p.514.
[3] Dube, D.P. and Gulati, V.P., 2005. Information System Audit and Assurance. Tata
McGraw-Hill Education, p.671.
[4] Griffiths, R., Lawes, A. and Sansbury, J., 2012. IT Service Management: A Guide for
ITIL Foundation Exam Candidates. BCS, The Chartered Institute for IT, p.200.

IJCSBI.ORG
[5] Hiles, A., 2000. Service Level Agreements: Winning a Competitive Edge for Support &
Supply Services. Rothstein Associates Inc, p.287.
[6] Lahti, C.B. and Peterson, R., 2007. Sarbanes-Oxley IT Compliance Using Open Source
Tools. Syngress, p.466.
[7] Moeller, R.R., 2013. Executive’s Guide to IT Governance: Improving Systems
Processes with Service Management, COBIT, and ITIL. John Wiley & Sons, p.416.
[8] Office of Government Commerce, 2010. Introduction to the ITIL service lifecycle. The
Stationery Office, p.247.
[9] Rudd, C., 2010. ITIL V3 Planning to Implement Service Management. The Stationery
Office, p.320.
[10] Saunders, M., Lewis, P., & Thornhill, A. (2009). Research methods for business
students. (5th Edition, Ed.) Pearson Education Limited, Essex, England.
[11]Troy, D. M., Rodrigo, F., & Bill, F. (2007). Defining IT Success Through the Service
Catalog: A Practical Guide about the Positioning, Design and Deployment of an
Actionable Catalog of IT Services (1st Edition, Ed.). US: Van Haren Publishing.
[12]University of Birmingham, 2014. IT Services - University of Birmingham. [Online]
Available at: <http://www.birmingham.ac.uk/university/professional/it/index.aspx>
[Accessed 18 Mar. 2014].
[13] University of California, 2012. ITS Service Management: Key Elements. [online]
Available at: <http://its.ucsc.edu/itsm/servicemgmt.html> [Accessed 18 Mar. 2014].
Zhou, M., Ruvinga, C.,Musungwini, S. and Zhou, G., T., 2014. A Model for
Implementation of IT Service Management in Zimbabwean State
Universities. International Journal of Computer Science and Business
Informatics, Vol. 14, No. 1, pp. 58-65.

IJCSBI.ORG
Present a Way to Find Frequent
Tree Patterns using Inverted Index
Saeid Tajedi
Lorestan Science and Research Branch, Islamic Azad University
Lorestan, Iran
Hasan Naderi
Iran University of Science and Technology
Tehran, Iran
ABSTRACT
Among all patterns occurring in tree database, mining frequent tree is of great importance.
The frequent tree is the one that occur frequently in the tree database. Frequent subtrees not
only are important themselves but are applicable in other tasks, such as tree clustering,
classification, bioinformatics, etc. In this paper, after reviewing different methods of
searching for frequent subtrees, a new method based on inverted index is proposed to
explore the frequent tree patterns. This procedure is done in two phases: passive and active.
In the passive phase, we find subtrees on the dataset, and then they are converted to strings
and will be stored in the inverted index. In the active phase easily, we derive the desired
frequent subtrees by the inverted index. The proposed approach is trying to take advantage
of times when the CPU is idle so that the CPU utilization is at its highest in in evaluation
results. In the active phase, frequent subtrees mining is performed using inverted index
rather than be done directly onto dataset, as a result, the desired frequent subtrees are found
in the fastest possible time. One of the other features of the proposed method is that, unlike
previous methods by adding a tree to the dataset is not necessary to repeat the previous
steps again. In other words, this method has a high performance on dynamic trees. In
addition, the proposed method is capable of interacting with the user.
Keywords: Tree Mining, Inverted Index, Frequent pattern mining, tree patterns.
1. INTRODUCTION
Data mining or knowledge discovery deals with finding interesting patterns
or information that is hidden in large datasets. Recently, researchers have
started proposing techniques for analyzing structured and semi-structured
datasets. Such datasets can often be represented as graphs or trees. This has
led to the development of numerous graph mining and tree mining
algorithms in the literature. In this article we present an efficient algorithm
for mining trees.

IJCSBI.ORG
Data mining has evolved from association rule mining, sequence mining, to
tree mining and graph mining. Association rule mining and sequence mining
are one-dimensional structure mining, and tree mining and graph mining are
two-dimensional or higher structure mining. The applications of tree mining
arise from Web usage mining, mining semi-structured data, and
bioinformatics, etc.
Basic and fundamental ideas of tree mining, roughly since the early '90s
were seriously discussed and during this decade were completed. These can
be stated that the origin and beginning of these ideas is their application
especially on the web. First, some essential and basic concepts are
described, and then describe the proposed method and finally the results will
be evaluated.
2. Related Works
2.1 Pre Order Tree Traversal
There are several ways to navigate through the ordered trees; pre order
traversal is one of the most important and most widely used of them. In this
way, we are acting like Depth First Search algorithm. This means that on the
tree like T starting from the root, then the left child and finally the right
child is navigating; this method is done recursively on all nodes of the tree.
2.2 Post Order Tree Traversal
It is also among the most important and widely used methods of ordered
trees traversal. In this method, we first on the tree like T starting from the
left child, then right child and finally the root is navigating, the operation is
performed recursively on all nodes of the tree.
Using either method, the above display, we can assign a number to each of
the nodes that in fact, it is represents a time to meet each node. If we use the
Post Order Traversal, that number is called PON.
2.3 RMP, LMP
LMP is the acronym Left Most Path represents a path from the root to the
leftmost leaf and the RMP is the acronym Right Most Path represents a path
from the root to the rightmost leaf.
2.4 Prüfer Sequence [23]
This algorithm was introduced in 1918 and used to convert the tree to string.
The algorithm works as follows in the tree like T, in every step the node
with the smallest label has been removed and label the parent node of this
tree is added to the Prüfer Sequence. This process is repeated n-2 times to 2
nodes remain.

IJCSBI.ORG
2.5 Label Sequence
The next concept is Label Sequence. This sequence is produced according to
the Post Order Traversal. In other words, in the Post Order Traversal, label
of each node that will be scanned, add to the sequence.
2.6 Support
Simply this implies that the S pattern has been repeated several times in the
tree T.
(1)
Where S is a tree pattern and D is a database of trees. This concept for
determining the number of occurrences of each subtree in a set of trees is
being used.
2.7 Inverted Index [24]
Inverted Index is a structure that used to indexing frequent string elements
in set of documents and is consists of two main parts: Dictionary and
Posting List. Frequent string elements uniquely stored in Dictionary and the
number of occurrences of each of these elements in total documents is
determined. Informations about the frequent elements such as the document
name, number of occurrences in each document are determined in Posting
List.
3. An overview of research history
In recent years, much research about the frequent subtrees mining has been
done. Yongqiao Xiao et al in 2003 used the Path Join Algorithm and a
compact data structure, called FST-Forest to find frequent subtrees [25]. In
this way, we first find frequent root path in all directions and then with
integrate these paths, frequent subtrees are reached. Shirish Tatikonda et al
published an article in 2006 on the basis of the pattern growth[26]; In this
way that all trees in the database tree are converted to strings, that is done
with the two different methods: Prüfer Sequence and DFS algorithms; then
scroll all strings in which there is a subtree or pattern such as S, we are
seeking a new edge can be added to S. Then, concurrently with the previous
step, as production of the candidate subtrees, the threshold values are
evaluated for be frequent. In 2009, Federico Del Razo Lopez et al presented
an idea to make flexible the tightly constrained tree mining In non-
fuzzy[27]. This paper used the principle of Partial Inclusion; that to say that
there is a pattern S in a tree T, it is no need to exist all the pattern nodes in
the tree. The proposed algorithm uses Apriori property for pruning
undesirable patterns.

IJCSBI.ORG
4. The proposed approach
This procedure is done in two phases: Passive and Active. In the Passive
phase, first we need to find all subtrees available in all trees and then must
be store in Inverted Index. In the Active phase, simply use it and will extract
frequent tree patterns.
4.1 Passive Phase
This phase is done in two stages, in first stage; we must find all subtrees of
every tree in dataset then they will be converted to a string called that tree
and in second stage; produced strings in the first stage should be stored in
the Inverted Index.
4.1.1 First stage of Passive phase
The first important point is that in each tree, each node label in the tree can
be repeated many times, but every node in every tree has a unique label; to
solve this problem, we use the method of Prüfer Sequence. This means that
each tree can be traced to Post Order and In fact, Prüfer Sequence Algorithm
works based on the PON. As a result, each node label of a tree will be
marked with a unique number.
The next issue is that the Prüfer Sequence able to cover all the nodes,
therefore, the algorithm implementation process rather than n-2 steps
process will continue until n steps and rather than the parent label of the last
node, put the number 0. In Figure 1 you can see an example of this method
is that the purpose of the NPS is Prüfer Sequence that has been achieved
using Post Order.
The next thing is that every subtree should be displayed uniquely; to this
end, must obtain CPS for each node. In fact, CPS will merge Prüfer
Sequence and Label Sequence. In other words, CPS(T) = ( NPS,LS )(T).
CPS can uniquely display a rooted and labeled tree. As you can see in
Figure 1, the T1 tree can be displayed uniquely using two strands that are
complementary.
Figure 1. An example of the Prüfer Sequence and Label Sequence for T1 tree

IJCSBI.ORG
The next thing that we need to ensure that in each tree can produce all
subtrees and each subtree is created only once, for this purpose, we use the
LMP to generate the subtree. This means that if we show the T tree using
Prüfer sequence and n is the subtree, A node such as v that is to be added to
the n should be included in the LMP of the T tree and since the PON is built
on the Prüfer sequence, just the v node should be after the last node of the n
and attached to that in the Prüfer sequence of the T tree. So is guaranteed to
be generated only once for each subtree and if it is done for all the nodes,
entire subtree of each tree will produce.
Now we will introduce the algorithm. The proposed algorithms for
generating subtrees and convert them into a string can be seen in Figure 2.
Insert CPS(T) in Array A
For i=n downto 1 do
{
Subtree=A[n]
Insert CPS(A[n]) in Treestringi
Sub(subtree,i,A, stack1,stack2)
}
Sub(subtree,index,A[],stack1,stack2)
c=0
t=0
For j= 1 to index-1 do
If index in A[j] then
{
stack3=stack1
stack4=stack2
subtree2=subree
while stack3 not empty
{
t++
Pop x from stack1
Pull y from stack2
Subtree2=subtree2+x
if t>0 then
{
Insert CPS(subtree2) in treestringi
Sub(subtree2,y, A[],stack3,stack4)
}
}
If c>0 then
{
Temptree push in stack1
TempIndex push in stack2
}
Temptree= a[j]
TempIndex=j
c++
Subtree=subtree+a[j]

IJCSBI.ORG
Insert CPS(subtree) in treestringi
Sub(subtree,j,A[],stack1,stack2)
while stack1 not empty
{
c--
Pull x from stack1
Pull y from stack2
Insert CPS(subtree+x) in treestringi
Sub(subtree+x,y, A[],stack1,stack2)
}
}
Figure 2. The algorithm of the subtrees generation and convert them to a string
In the following we examined the algorithm works with an example. In the
beginning starting from the first tree and CPS (T) are stored in the array A.
As a result, the array will be completed for T1 according to Figure 3.
Figure 3. Production of the array using CPS (T)
In this step we identify all existing subtrees and store them in a string. To do
this, we start from the root node of T1, therefore the last element of the
array namely A0 and respectively, the branching subtrees from this node
should be stored in the string. As a result, at first A0 is stored in the string
according to the algorithm, next we run sub function. Considering that the
index of previous node is equal to 9 to find the subtrees with two nodes,
respectively start from the first element of the array and review to the
element with the pre Index of the previous node namely 8, If the value
contains the index of the previous node namely 9, It has added to the
previous tree namely A and CPS of the found subtree can be inserted in the
string of this tree that here A0C2 and A0E2 are stored in the string and
recursively repeat the same steps for new generated subtrees. Given that
both produced subtrees branched from a node, adding node with smaller
index from Stack1 and its index from stack2 are extracted and added to the
subtree with a larger index and its CPS is stored in the string,therefore in
this step is also added A0E3C3 and also this is repeated for the whole
produced subtrees with larger index in the next step. Similarly, the work
continues recursively until all subtrees branching from the first node of the
array to be stored in string. Then do the same procedure for the next
elements of the array, until complete the string of the subtrees of the tree
and then we proceed next trees until for each tree, the string is created for all
subtrees of the tree.

IJCSBI.ORG
4.1.2 First stage of Passive phase
In the second stage of this phase we use the Inverted Index. Thus, the strings
created in the previous stage are inserted into the Inverted Index. CPS and
the number of occurrences of each subtree in the all trees are stored in the
Dictionary and the name of the trees that are containing the subtree will be
stored in the corresponding Posting List.
Figure 4. Part of the Inverted Index made for the collection of trees T1, T2
As can be seen in the subtrees are stored in the Dictionary and the parent
trees of the corresponding subtrees are stored in the Posting List.
4.2 Active Phase
In this phase, simply use inverted index made in the previous phase and will
extract frequent tree patterns. Simply types of queries about frequent subtree
mining to be answered quickly by using inverted index made in the previous
phase. Then we will examine several types of different queries.
4.2.1 Find the occurrence of the desired pattern in tree set
First, we are achieved CPS of the desired pattern and then search it into
Dictionary of the inverted index and easily extract the number of the
occurrence and name of trees that contain the desired pattern from the
Posting List of the inverted index. For example, to find the number of
occurrences of the S pattern on the collection of trees T1, T2 in Figure 5,
should search CPS (S) ie A0C3B3 into Inverted Index that T1 and T2 will
be the result.

IJCSBI.ORG
Figure 5. Part of the Inverted Index made for the collection of trees T1, T2
4.2.2 Find frequent subtrees in considering the Support
If we want to find some subtrees that them Support are greater than a
threshold, must find the subtrees with their occurrence compared to the total
trees is greater than the Support. So we can search in the inverted index and
easily find subtrees that the length of them Posting List compared to the
total trees is at least equal to Support.
4.2.3 Find frequent subtrees in considering the Support and minimum nodes
In this case, in addition to Support, the number of nodes is also the criterion,
so easily search in Inverted Index and only show the subtrees with the
following conditions. First, in Dictionary Length the subtree is greater than
the minimum number of nodes and Second, length of corresponding Posting
List compared to the total trees is at least equal to Support.
5. Evaluation
In this section, the proposed method will be evaluated from various aspects.
We present the experimental evaluation of the proposed approach on
synthetic datasets. In the following discussion, dataset sizes are expressed in
terms of number of trees. In the graphs is used from symbolizes Algorithm
to display proposed method. Name and details of synthetic datasets are
shown in Table 1.
Table 1. Name and details of synthetic datasets
Name Description
DS1 -T 10 -V 100
DS2 -T 10 -V 50
As shown in Table 1, the synthetic datasets DS1 and DS2 are generated
using the PAFI[28] toolkit developed by Kuramochi and Karypis (PafiGen).
Since PafiGen can create only graphs we have extracted spanning trees from
these graphs and used in our analysis. We also used minsup to analyze the
various factors. This means if the number of replicated subtree is less than
minsup value, the tree won't be indexed in Inverted Index. Minsup value is
from 1 to infinity, which is the default value is equal to 1 in the proposed
algorithm. In addition, we also use from maxnode in evaluations. Maxnode
is the symbol to specify the maximum number of nodes in each subtree in
Inverted Index. This means if the number of subtree nodes reach maxnode
amount in the proposed algorithm, production of its subtree will halt.

IJCSBI.ORG
Maxnode value is from 1 to infinity, and the default value is equal to
infinity.
5.1 Evaluating the performance of the proposed method
At the beginning, we evaluated our proposed algorithm on two synthetic
datasets DS1 and DS2. The performance of the proposed algorithm for
frequent tree minig on synthetic datasets is shown in Diagram 1. In this
experiment, the minsup equal to one and the maxnode is equal to infinity.
Given that the Subtrees are indexed in passive phase at times when the
system is idle, mining time in Inverted Index rises with a gentle slope By
increasing the number of trees. So clearly spelled out the introduced
algorithm is scalable.
Diagram 1: The performance of the algorithm on synthetic datasets
5.2 Evaluating effect of minsup on the number of indexed patterns
We examine effect of minsup on the number of indexed patterns in Diagram
2. This experiment has been done on synthetic datasets DS1 and DS2
generated by Pafi and with size 50K. In this experiment the maxnode is the
default value ie infinity. As can be seen in the diagram, the number of
indexed patterns is increasing exponentially by decreasing minsup.
Diagram 2: Effect of minsup on the number of indexed patterns
0
1
2
3
4
5
6
7
8
9
10
10K 20K 30K 40K 50k
miningtime
# of trees
DS1
DS2
0.001
0.01
0.1
1
10
100
1000
10000
100000
1000000
2,500 500 250 50 25 5 1
#ofindexedpatterns
Thousands
minsup
DS1
DS2

IJCSBI.ORG
5.3 Evaluating effect of maxnode on usage memory
We examine the effect the maximum number of nodes in the indexed
subtrees on usage memory in passive phase. This experiment has been done
on synthetic datasets DS1 and DS2 generated by Pafi and with size 50K. In
this experiment the minsup is the default value ie 1. As can be seen, the
usage memory of the algorithm is increasing by increasing the number of
indexed nodes in each subtree.
Diagram 3: Effect of maxnode on the usage memory
5.4 Evaluation of CPU utilization compared with the Tree Miner
In diagram 4, the comparison is performed between the proposed algorithm
with Tree Miner that was introduced by Zaki and is one of the best
algorithms for tree mining[29]. This experiment has been done on synthetic
dataset DS1 generated by Pafi and with size 50K. Given that in passive
phase the proposed algorithm is searching for subtrees and adding them to
inverted index, consequently, as can be seen in the diagram, CPU utilization
is close to 100 percent in most situations while the average CPU utilization
on TreeMining algorithm is approximately 90%.
Diagram 4: Comparison CPU utilization between TreeMiner and the algorithm
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
1 5 25 50 250 500 2,500
VirtualMemory(MB)
Maximum # of node in Subtrees
DS1
DS2
0
10
20
30
40
50
60
70
80
90
100
10K 20K 30K 40K 50K
Cpuutilization(%)
# of trees
TreeMiner
Algorithm

IJCSBI.ORG
6. Conclusions and Recommendations
In this paper, a new method based on the Inverted Index in order to frequent
pattern mining was introduced to overcome many of the disadvantages of
previous methods. One problem with existing approaches is that mainly act
as a static on the set of trees and if a new tree is added to the set of trees, all
mining operations must be done from scratch again. This problem has been
overcome by the Inverted Index in the proposed approach. This means that
all the trees are indexing in the Passive phase and if a new tree is added to
the treeset at any stage, just the tree is indexed and there is no need to repeat
the previous operations. This algorithm will result in a high performance on
a collection of dynamic trees. Another advantage of this method compared
to other methods is that it is scalable. As listed in Section 5.1, the
performance of this algorithm is not slowed by increasing the treeset. As
listed in Section 5.4, one of the most striking features of this algorithm is
efficient use of CPU. In this method, the user interaction is also present.
As Listed in Section 5.2, the number of indexed patterns increases
exponentially by decreasing minsup, while the generally patterns with low
occurrences doesn't matter to us. As a result, we can speed up indexing in
passive phase with determining the appropriate amount of the minsup. As
Listed in Section 5.3, the usage memory increases by increasing the
maximum number of nodes in the indexed subtrees, while the usually
subtrees with very large number of nodes doesn't matter to us. As a result,
we can manage the usage memory with determining the appropriate amount
of the maxnode.
REFERENCES
[1] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets
based on WIT-trees," International Journal of Advanced Computer Research, p. 9,
2013.
[2] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of
Semi Structured data: a Survey," International Journal of Advanced Computer
Research, p. 5, 2013.
[3] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large
networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013.
[4] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear
Prefix tree," International Journal of Advanced Computer Research, p. 15, 2014.
[5] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for
Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and
Data Mining, p. 13, 2013.
[6] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining
Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic
Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.

IJCSBI.ORG
[7] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets
from Sparse Data," Web-Age Information Management, p. 7, 2013.
[8] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent
pattern mining over data streams," Advances in Knowledge Discovery and Data
Mining, p. 15, 2014.
[9] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering
Patterns of Reposting Behavior in Microblog," Advanced Data Mining and
Applications, p. 13, 2013.
[10] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering
weight conditions over data streams", Advances in Knowledge Discovery and Data
Mining, 2014.
[11] B. Kimelfeld and P. G. Kolaitis ," The complexity of mining maximal frequent
subgraphs," Proceedings of the 32nd symposium on Principles of database systems, p.
12, 2013.
[12] B. Vo, F. Coenen, and B. Le, "A new method for mining Frequent Weighted Itemsets
based on WIT-trees," International Journal of Advanced Computer Research, p. 9,
2013.
[13] L. A. Deshpande and R. S. Prasad, "Efficient Frequent Pattern Mining Techniques of
Semi Structured data: a Survey," International Journal of Advanced Computer
[14] A. M. Kibriya and J. Ramon, "Nearly exact mining of frequent trees in large
networks," Data Mining and Knowledge Discovery (DMKD), p. 27, 2013.
[15] G. Pyun, U. Yun, and K. H. Ryu, "Efficient frequent pattern mining based on Linear
Prefix tree" International Journal of Advanced Computer Research, p. 15, 2014.
[16] C. K.-S. Leung and S. K. Tanbeer, "PUF-Tree: A Compact Tree Structure for
Frequent Pattern Mining of Uncertain Data," Advances in Knowledge Discovery and
Data Mining, p. 13, 2013.
[17] A. Fariha, C. F. Ahmed, C. K.-S. Leung, S. M. Abdullah, and L. Cao, "Mining
Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic
Graphs," Advances in Knowledge Discovery and Data Mining, p. 12, 2013.
[18] J. J. Cameron, A. Cuzzocrea, F. Jiang, and C. K. Leung, "Mining Frequent Itemsets
from Sparse Data," Web-Age Information Management, p. 7, 2013.
[19] G. Lee, U. Yun, and K. H. Ryu, "Sliding window based weighted maximal frequent
pattern mining over data streams," International Journal of Advanced Computer
[20] H. He, Z. Yu, B. Guo, X. Lu, and J. Tian, "Tree-Based Mining for Discovering
Patterns of Reposting Behavior in Microblog," Advanced Data Mining and
Applications, p. 13, 2013.
[21] U. Yun, G. Lee, and K. H. Ryu, "Mining maximal frequent patterns by considering
weight conditions over data streams," International Journal of Advanced Computer
Research, 2014.
[22] B. Kimelfeld, and P. G. Kolaitis, "The complexity of mining maximal frequent
subgraphs," Proceedings of the 32nd symposium on Principles of database systems,
p. 12, 2013.
[23] H. Prüfer. Prüfer sequence. Available:
http://en.wikipedia.org/wiki/Pr%C3%BCfer_sequence
[24] C. D. Manning, P. Raghavan, and H. Schütze, An Introduction to Information
Retrieval. Cambridge, England: Cambridge University Press, 2008.

IJCSBI.ORG
[25] Y. Xiao, J.-F. Yao, Z. Li, and M. H. Dunham, "Efficient data mining for maximal
frequent subtrees," Proceedings of 3rd IEEE International Conference on Data
Mining, p. 8, 2003.
[26] S. Tatikonda, S. Parthasarathy, and T. Kurc, "TRIPS and TIDES: New Algorithms for
Tree Mining," Proceedings of 15th ACM International Conference on Information
and Knowledge Management (CIKM), p. 12, 2006.
[27] F. D. R. Lopez, A.Laurent, P.Poncelet, and M.Teisseire, "FTMnodes: Fuzzy tree
mining based on partial inclusion," Advanced Data Mining and Applications, pp.
2224–2240, 2009.
[28] Kuramochi and Karypis. Available: http://glaros.dtc.umn.edu/gkhome/pafi/overview/
[29] M. J. Zaki, "Efficiently Mining Frequent Trees in a Forest," Proceedings of the 8th
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(SIGKDD), Edmonton, Canada, p. 10, 2002.
Tajedi, S. and Naderi, H., 2014. Present a Way to Find Frequent Tree Patterns
using Inverted Index. International Journal of Computer Science and Business

IJCSBI.ORG
An Approach for Customer Satisfaction:
Evaluation and Validation
Amina El Kebbaj and A. Namir
Laboratory of Modeling and Information Technology, Department of
Mathematics and Computer Science,
Faculty of Sciences Ben M'sik, Hassan2-Mohammedia University
Casablanca - 7955, Morocco
ABSTRACT
The main objective of this work is to develop a practical approach to improve customer
satisfaction, which is generally regarded as the pillar of customer loyalty to the company.
Today, customer satisfaction is a major challenge. In fact, listening to the customer,
anticipating and properly managing his claims are stone keys and fundamental values for
the enterprise. From a perspective of the quality of the product, skills, and mostly, the
service provided to the customer, it is essential for organizations to differentiate
themselves, especially in a more competitive world, in order to ensure a higher level of
customer satisfaction. Ignoring or not taking into account customer satisfaction can have
harmful consequences on both the economic performances and the organization’s image.
For that, it is crucial to develop new methods and have new approaches to THE
PROBLEMATIC customer dissatisfaction, by improving the services quality provided to
the costumer. This work describes a simple and practical approach for modeling customer
satisfaction for organizations in order to reduce the level of dissatisfaction; this approach
respects the constraints of the organization and eliminates any action that can lead to loss of
customers and degradation of the image of the organization. Finally the approach presented
in this document is tested and evaluated.
Keywords: Approach, Evaluation, Quality, Satisfaction, Test of homogeneity, Validation.
1. INTRODUCTION
“Does the company have the most meaningful information at the right time
to make the best possible business decisions?” is the question most
companies want to answer. “The purpose of a company is to create and
keep a customer (Levitt, 1960)”: through this declaration, the important
phases of the life cycle of the customer management, which are acquiring
costumers and ensuring their loyalty are clearly identified. Companies are
moving towards “customer oriented” management and focus on the life
cycle of their customers. According to “Moisand 2002”, the life cycle of the
customer is defined as the time interval between the moments for a costumer
to change its status from being a “new costumer” to the status of a
“lost/former customer”.

IJCSBI.ORG
In a context of globalized and very competitive market, where the
departments moved from a more classic level of management (cost
centered) to a value centered approach, the mission of the decision-makers
has evolved from proposing services and strategic partnerships to value
creation. To achieve this goal it’s necessary to have all the data to enlighten
the past, to clarify the present in order to predict the future by avoiding to be
confronted with gray areas (caused by lack of information). Business
intelligence includes all IT solutions (methods, facilities and tools) used to
pilot the company and help to make decisions.
This approach can be modeled by the three systems below:
1. Decision System: think, decide and control;
2. Effective System: transform and produce;
3. Information System: links the “Decision System” with the “Effective
System”. Its main purposes are:
 Generating information
 Memorizing information
 Broadcasting information
 Processing information.
Figure 1. The information system
The information system is a subsystem of the organization that is
responsible for collecting, storing, processing and Broadcasting
informations in effective system and decision system. In effective system,
the information is a current view of business data (invoice, purchase orders
...), in decision system, the information is more synthetic because it should
allow decision making (The list of 3 products less sold in January 2014). So
the information system links these two subsystems and must bring to all
organizational actors of the company, the information they need to act and

IJCSBI.ORG
decide. So IS is a representation of reality, it leads to coordinate the
activities of the company.
This work is situated in this spirit, it consist to give a contribution to
maximize customer satisfaction of the company, meaning to propose an
approach that eliminates any form of loss of customer inside an
organization, then, to evaluate and validate the approach. Finally, to test the
homogeneity of the problem in order to measure customer satisfaction to
conduct corrective actions based on two dimensions of quality:
 The "made" quality 𝐐 𝐫: the product, process or service are
conform to what are defined as expected? It is composed of the
different evaluation to judge the achievement of target processes, to
measure the effects and check if the desired results were achieved.
 The "perceived" quality 𝐐 𝐩: what level of satisfaction generated
from the customer? It is defined by excellence of the product
(Zeithaml, 1988).
The ultimate goal is to have 𝑄𝑟 =𝑄 𝑝
Figure 2. Company's qualities
The introduction has defined the conceptual framework of the work. It
presented the issue addressed and contributions in the domain of company’s
governance. The following is composed of 3 sections:
In the 2nd paragraph, we expose the approach and then the latter is
statistically evaluated from concrete examples. In the 3rd paragraph, we test
the homogeneity of the problem. The conclusion shows the outline of this
study and our contribution. It also shows the various extensions and possible
future works.

IJCSBI.ORG
2. PROPOSED APPROACH
Standish Group (Valery, 2001) did a study which was conducted
internationally and evaluated the success and failure of IT projects.
Accumulated data over the past ten years are based on a sample of 50,000
projects. This study has identified three levels of evaluation of a project:
 The success of project : it is characterized by a system delivered in
hours and on time, for a cost within budget and fully compliant to
the specifications;
 The failure of project : it is characterized by the cessation of the
project ;
 Finally, the partial success or partial failure of a project: it is
characterized by the late delivery of a system partially responsive,
especially in terms of business scope, the specifications and a cost of
up to 200% of the original budget.
Only 29% of projects were successful, 53% partial success or half
failure and 18% failed. The proportion of abandoned projects outside the
budget or out of time reaches 71%.
2.1 Statement
This study shows that the customer satisfaction is not always reached,
perceived quality tend towards a desired quality presents a real challenge.
Within the company, quality is increasingly focused on customer
satisfaction. To win contracts, business leaders rely more on quality than
price advantages. Staff involvement, with listening to the customer, is a key
element for the success of a quality approach. The latter is the
implementation of all the resources available to an establishment to provide
a service that meets the needs and expectations of customers. From the
customer perspective, a warm welcome and quality service is "normal", it is
lack of quality which is penalizing to him.
To attract the customer, we must establish standards within the company by
identifying the market need. There are international standards that ensure
safe products and services, reliable and with high quality. These standards
are called ISO Standards. For companies, there are strategic tools for
lowering costs, increasing productivity and reducing waste and errors. For
companies, getting a certification is the preferred way of knowing the
quality of their organization to their customers and their suppliers.
2.2 Steps of the approach
Below the 7 best practices for customer satisfaction:

IJCSBI.ORG
a) To develop team’s skills: do additional training on IT tools to
mount the team’s skills.
b) To make customer satisfaction a challenge for all the company:
the company can use the dissatisfaction of their customers to
improve our products and services. Bill Gates, Microsoft CEO, said
that "the unhappy customers are the best sources of information."
Because customers who express dissatisfaction enable companies to
identify and resolve defects services faster.
Dissatisfied customers are very expensive for companies, the cost of
recruiting a new customer is usually five times higher than the cost
of acquired customer retention. It is far better to work to keep its
customers than to recruit new ones to replace those who leave. So,
according to Jacques-Antoine Granjon, founder of Vente-
privee.com, the treatment of customer dissatisfaction should not only
be considered as a cost but as an investment.
c) To motivate teams: to mark clearly the importance of customer
satisfaction, some companies have introduced a variable part in pay
for some employees, calculated on the basis of indicators related to
customer satisfaction.
d) To facilitate contacts customers: there are 5 types of
communications channels:
 Telephone: Availability (24/24 7/7), Saving time ;
 Face to face : Immediate Response, Human Contact ;
 E-mail : Traceability (written proof) ;
 Website: simplicity ;
 Postal mail.
e) To anticipate the dissatisfaction : Whatever the quality of claims
processing, it may be better to move this claim and make a gesture to
customers who had a bad experience product - or where this risk
exists - without waiting for them to occur.
f) To measure customer satisfaction (evaluate to improve): today it is
essential to regularly assess the level of achievement of the final goal
of customer satisfaction. For example by sending to all customers
who have experienced dissatisfaction after the close of the case, a
satisfaction survey designed by the customer service and measuring
the accessibility of the service, reception, understanding and
treatment of dissatisfaction,

IJCSBI.ORG
g) To reach out to customers on the Internet : The benefit may also
be provided on the Internet by another customer, a social network
(Twitter, Facebook ...), Make social media a true extension of
customer service, with employees able to participate in discussions
and respond directly to customer requests on these media.
3. EVALUATION AND VALIDATION OF THE APPROACH
Consider the case of a service company that manages the work of many
potential customers as "France Gas". the latter signed a contract with the
host company specifying the clauses that must be respected and among the
latter is the rate of customer satisfaction which should reach 92% and this
percentage is established post-agreement between two parties, and if that
percentage is not met, a penalty will be done due to customer
dissatisfaction. A development team of the host company supports the
realization of applications for "France Gas". This team should produce 22
applications monthly with the dissatisfaction rate should not exceed 8% (2
applications per month). The cause of client dissatisfaction is due to the
following:
 Application does not answer the need or generate unexpected errors
after delivery
 Timeout
To avoid these situations, companies have an interest in implementing
continuous improvement process which ultimate goal is the elimination of
all forms of waste, such as customer dissatisfaction. The problem to be
solved is, for Pn period, to maximize the number of satisfied customers. To
evaluate the approach we will need to test it in a sample for evaluation and
validation.
We start by making our Statistical hypothesis ( 𝐻0 and 𝐻1 ).
 The first - the null hypothesis or Ho. note: H0 : "Qr=Qp".
Qr : the proportion of customer satisfaction desired
Qp : the real percentages of satisfaction.
 The second, the alternative hypothesis H1 : "Qp<Qr"
3.1 Before the approach

IJCSBI.ORG
3.1.1 Example1: April 2013
The team was unable to process only 10 simple applications. The customer
sent feedback to present his degree of satisfaction. There are 3 kind of
response: NS (Not Satisfied, S: Satisfied, N: Neutral)
Table 1. Customer’s feedback of April 2013
APPLICATIONS CUSTOMER
SATISFACTION
(S, NS, N)
REASONS OF
DISSATISFACTION
1 PipRep 2.0 FR NS timeout
2 Contextor 2.8 FR NS timeout
3 Contextor 2,2,3 S ------
4 Hermes Horizon S ------
5 Agent SSR 2011 Ns Application does not
work correctly
6 Plugin SSR 2011 Ns Application does not
work correctly
7 Agent Altiris 2011 S
8 GECO 1.17.3 FR Ns timeout
9 Nexthink collector S --------
10 Cosmocom 4 FR 1.0 S --------
Once the feedback is received, we proceed to calculate the percentage of the
monthly satisfaction as shown in the following table:
Table 2. Satisfaction rates of April 2013
Satisfaction
type
Customer
Satisfaction
Satisfaction
rates
S (satisfied) 5 50%
NS (unsatisfied) 4 40%
N (neutral) 1 10%
This table above can be modeled by the following figure:

IJCSBI.ORG
Figure3. Customer satisfaction of April 2013
Ps (t0)=P(Xt0 = S)=0.5
PNS (t0)=P(Xt0 = NS)=0.4
PN (t0)=P(Xt0 = N)=0. 1
As Qr =92% and the hypothesis H0 = "Qr=Qp" and H1="Qr<Qp». We use
here one-tailed left test.
If
Qp − Qr
Qr(1− Qr )
n
> −t′
so we accept the hypothesis H0 and we reject H1with
error risk α =5%
“t” is calculated using the table of the normal distribution:
P (−tα ≤ T≤ tα)=1-α=0.95=>tα=1.645 using the table of normal distribution
and tα = 1.833 using the table of Student distribution.
We have Qr=92% and from the example =50%
f − Qr
Qr (1−Qr )
n
=
0.5 − 0.92
0.92(1−0.92)
10
=
−0.42
0.0857
= −4.9 < −1.645
So we accept the hypothesis H1="Qp<Qr" and we reject H0 = "Qr=Qp" with
error risk α =5%. And the observed difference is significant.
3.2 After the approach
3.2.1 Example2: December2013
The team treated 22 applications as shown the following figure:
50%
40%
10%
S (satisfied)
NS (unsatisfied)
N (neutral)
Customer satisfaction of April 2013

IJCSBI.ORG
Table 3. Customer’s feedback of December 2013
APPLICATIONS SATISFACTION
(S, NS, N)
REASONS OF
DISSATISFACTION
1 MSC_CASP69 NS timeout
2 MSC_MDX NS timeout
3 Woodmac S ------
4 Whoswho s ------
5 Adobe Air Installer s ------
6 WinZip s -------
7 MSC_SetupDemdet s -------
8 Jabber s --------
9 TrendMicro_Office s --------
10 ORG+ s --------
11 QlikView s ---------
12 Q4- Engica N
13 TMS N
14 MSCLink_Core s ---------
15 MIPS s ---------
16 Rsclientprint NS Application does not work
correctly
17 TextPad s ---------
18 MSC_DMX s ---------
19 MSC_MSCOMCT2 NS timeout
20 Add-in Excel S ----------
21 Pre-req Excel S ----------
22 Ios S ------
We proceed to calculate the percentage of the monthly satisfaction as shown
in the following table:
Table 4. Satisfaction rates of December 2013
Satisfaction
type
Customer
Satisfaction
Satisfaction
rates
S (satisfied) 16 72.72%

IJCSBI.ORG
NS (unsatisfied) 4 18.18%
N (neutral) 2 9.09%
This table above can be modeled by the following figure:
Figure4. Customer satisfaction of December 2013
Ps (t0)=P(Xt0 = S)=0.727
PNS (t0)=P(Xt0 = NS)=0.181
PN (t0)=P(Xt0 = N)=0. 091
f − P0
P0(1−P0)
n
=
0.72 − 0.92
0.92(1−0.92)
22
=
−0.2
0.182
= −1.09 > −1.645
And with Student law we have tα = 1.721, so this is also verified.
So we accept the hypothesis H0 = "Qr=Qp" and we reject H1="Qp<Qr"
with error risk α =5%. The difference between P and P0 observed is due to
sampling fluctuations.
3.2.2 Example 3: January 2014
The team treated 21 applications as shown the following table:
Table 5. Customer’s feedback of January 2014
Applications Satisfaction
(S, NS, N)
REASONS OF
DISSATISFACTION
1 Windows6.1-KB2574819 S ------
2 MigrationAssistantTool NS The installation must
be silent
73%
18%
9% S (satisfied)
NS (unsatisfied)
N (neutral)
Customer satisfaction of December 2013

IJCSBI.ORG
3 See Electrical Viewer 4 S ------
4 Adobe_Flash_Player s ------
5 MSC_DEPOT s ------
6 Colibri 2.0 s -------
7 Navision s -------
8 OFFICE 2013 s --------
9 Windows6.1-KB2592687 s --------
10 CheckPoint VPN s --------
11 Interlink_MSCLink s ---------
12 CrystalReportsRuntime N
13 InterlinkComponentOne s ---------
14 MSXML s ---------
15 VisualC++Redistributable s ---------
16 ReportViewer_2010 NS Application does not
work correctly
17 .Net_Framework s ---------
18 MSCLink_Core s ---------
19 MSCLink_Configuration NS timeout
20 LDOC S ----------
21 MigrationAssistantTool S ----------
We proceed to calculate the percentage of the monthly satisfaction as shown
in the following table:
Table 6. Satisfaction rates of January 2014
Satisfaction type Customer Satisfaction Satisfaction rates
S (satisfied) 17 80.95%
NS (unsatisfied) 3 14.28%
N (neutral) 1 4.76%
The table above can be modeled by the following figure:

IJCSBI.ORG
Figure 5. Customer satisfaction of January 2014
Ps (t0)=P(Xt0 = S)=0.81
PNS (t0)=P(Xt0 = NS)=0.14
PN (t0)=P(Xt0 = N)=0. 05
f − P0
P0(1−P0)
n
=
0.8 − 0.92
0.92(1−0.92)
21
=
−0.12
0.187
= −0.64 > −1.645
And with Student law we have tα = 1.721, so this is also verified.
So we accept the hypothesis H0 = "Qr=Qp" and we reject H1="Qp<Qr"
with error risk α =5%. The difference between P and P0 observed is due to
sampling fluctuations.
4. TEST OF HOMOGENEITY
We are faced with two samples which are most often not known whether
they are from the same source population. It is sought to test whether these
samples have the same characteristic ℓ. Two values is observed ℓ1 and ℓ2,
the difference between these two values may be due either to sampling
fluctuations or the difference of the characteristics of the two original
populations. That is to say, from the examination of two samples of size n1
and n2 , are respectively extracts of populations P1 (M1; α1) and P2 (M2;α2),
these tests are used to decide between:
H0 = « ℓ1= ℓ2»: (we conclude the homogeneity)
H1= «ℓ1 ≠ ℓ2»: (we conclude the heterogeneity).
In our case we test the homogeneity of 2 proportions:
f1= proportion of units having the calculated character X in sample 1;
f2= proportion of units having the calculated character X in sample 2;
p1= proportion of units having the character X in the population ;

IJCSBI.ORG
p2= proportion of units having the character X in the population .
H0= «P1 =P2=P » and H1= « P1≠P2 »
P is replaced by the estimator f =
n1f1+n2f2
n1+n2
=
22∗0.72+21∗0.81
22+21
= 0.764
 x =
0.81−0.72
0.764∗0.24(
1
22
+
1
21
)
=0.02 > -1.645
So we conclude the homogeneity of the proposed solution. The proposed
population is homogeneous and the difference observed is more significant
and is due to sampling fluctuations.
4. CONCLUSIONS
The work done is to develop a practical and pragmatic approach to maximize
customer satisfaction in an organization for a given period. Therefore, an
approach has been proposed, evaluation and validation of the latter are
described above. This work opens the way to our sense towards diverse
perspectives of research which are situated on two plans: a plan of deepening of
the realized research and a plan of extension of the domain of research. In terms
of deepening of the proposed work, it would be interesting at first to use the
Markov chain to model statistically the proposed model and to propose or
develop practical tools for implementation of the proposed approach. As for
extension of the domain of the research, it would be interesting to connect this
approach to governance of information systems and to drive decision-making
system which consist to investigate the options and compare them to choose an
action that help in making decision.
REFERENCES
[1] BUFFA, Elwood. Operations Management, 3rd
Ed., NY, John Wiley & Sons, 1972.
[2] FRITZSIMMONS, James A. et Mona J. FRITZSIMMONS. Service management:
Operations, Strategy and Information Technology, 3rd
Ed., NY, Irwin/McGraw-Hill, 2001.
[3] Z. Adhiri, S. Arezki, A. Namir : What is Application LifeCycle Management?,
International Journal of Research and Reviews in Applicable Mathematics and Computer-
Science, ISSN: 2249 – 8931, December 2011
[4] http://hal.archives-ouvertes.fr/docs/00/71/95/35/PDF/2010CLF10335.pdf
[5] STEVENSON, William J. Introduction to Management Science, 2e edition, Burr Ridge,
IL., Richard D. Irwin, 1992.
[6] HILLIER, Frederick S., Mark S. HILLIER and Gerald J. LIEBERMAN. Introduction to
Management Science : A Modeling and Case Studies Approach with Spreadsheets, New
York, Irwin/McGraw-Hill, 2000.
[7] A. EL KEBBAJ et A. NAMIR. Modeling customer's satisfaction. Day of Science
Engineers, Faculty of Science Ben M’Sik, Casablanca july 29, 2013
[8] http://www.projectsmart.co.uk/docs/chaos-report.pdf
[9] http://info.informatique.entreprise.over-blog.com/article-approche-du-systeme-d-
information-dans-l-entreprise-69885381.html

IJCSBI.ORG
[10] http://www.hamadiche.com/Cours/Stat/Cours5.pdf
[11] S. ARIZKI : ITGovA : proposed a new approach to the governance of information
systems. PhD in Computer Science, defended at the Faculty of Sciences of Ben M'Sik
Casablanca 24/02/2013
Kebbaj, A., E. and Namir A., 2014. An Approach for Customer Satisfaction:
Evaluation and Validation. International Journal of Computer Science and
Business Informatics, Vol. 14, No. 1, pp. 79-91.

IJCSBI.ORG
Spam Detection in Twitter - A
Review
C. Divya Gowri and Professor V. Mohanraj
Sona College of Technology, Salem
ABSTRACT
Social Networking sites have become popular in recent years, among these sites Twitter is
one of the fastest growing site. It plays a dual role of Online Social Networking (OSN) and
Micro Blogging. Spammers invade twitter trending topics (popular topics discussed by
Twitter user) to pollute the useful content. Social spamming is more successful compared to
email spamming by using social relationship between the users. Spam detection is
important because Twitter is mainly used for commercial advertisement and spammers
invade the privacy information of the user and also the reputation of the user is damaged.
Spammers can be detected using content and user based attributes. Traditional classifiers
are required for spam detection. This paper focuses on study of detecting spam in twitter.
Keywords: Social Network Security, Spam Detection, Classification, Content based
Detection.
1. INTRODUCTION
Web-based social networking services connect people to share interests and
activities across political, economic, and geographic borders. Online Social
Networking sites like Twitter, Facebook, and MySpace have become
popular in recent years. It allows users to meet new people, stay in touch
with friends, and discuss about everything including jokes, politics, news,
etc., Using Social networking sites marketers can directly reach customers
this is not only benefit for the marketers but it also benefits the users as they
get more information about the organization and the product. Twitter [1] is
one among these social networking sites. Twitter provides a micro blogging
(Exchange small elements of content such as short sentences, individual
images, or video links) service to users where users can post their messages
called tweets. Tweet can be limited to 140 characters only HTTP links and
text are allowed. Twitter user is identified by their user name optionally by
their real name. The user „A‟ starts following other users and their tweets
will appear on A‟s page. User A can be followed back if other user desires.
Trending topics in Twitter can be identified with hash tags („#‟). When a
user likes a tweet he/she can „retweet‟ that message. Tweets are visible
publically by default, but senders can deliver message only to their

IJCSBI.ORG
followers. The „@‟ sign followed by username is a reply to other user. The
most common type of spamming in Twitter is through Tweets. Sometimes it
is via posting suspicious links.
Spam [14] can arrive in the form of direct tweets to your Twitter inbox.
Unfortunately spammers use twitter as a tool to post malicious link, send
spam messages to legitimate users. Also they spread viruses, or simply
compromise the system reputation. Twitter is mainly used for commercial
advertisement, and spammers invade the privacy information of the user and
also the reputation of the user is damaged. The attackers advertise on the
Twitter for selling products by offering huge discount and free products.
When users try to purchase these products, they are asked to provide
account information which is retrieved by attackers and they misuse the
information. Therefore spam detection in any social networking sites are
important.
2. RELATED WORKS
McCord et al., [1] has proposed user based and content based features to
facilitate spam detection.
User Based Features
The user based features considered are number of friends, number of
followers, user behaviors e.g. the time periods and the frequencies when a
user tweets and the reputation (Based on the followers and friends) of the
user. Reputation of a user is given by the equation,
𝑅 𝑗 = 𝑛𝑖(𝑗) (𝑛𝑖 𝑗 + 𝑛0 𝑗 ) (2.1)
Where 𝑛𝑖(𝑗) represents the number of followers of user „j‟ and 𝑛0 𝑗
represents the number of friends the user „j‟ has. According to Twitter Spam
and Abuse Policy „if the users have small number of followers compared to
the amount of people the user following then it may be considered as a spam
account‟. Spammers tend to be most active during the early morning hours
while regular users will tweet much less.
Content Based Features
The content based features [11] considered in this approach are number of
Uniform Resource Locator‟s (URL), replies/mentions, keywords/word
weight, retweet, hash tags. Retweet is a reposting someone‟s post, it is like a
normal post with author‟s name. It helps to share the entire tweet with all
the followers. The „#‟ containing tweets are the popular topics being
discussed by the users.

IJCSBI.ORG
Secondly, they compare four traditional classifiers namely Random forest,
Support Vector Machine (SVM), Naïve Bayesian and K-nearest neighbor
classifiers which are used to detect Spammers. Among these classifiers
Random Forest is found to be effective but this classifier can deal with only
imbalanced data set (data set with more regular users than spammers). Alex
HaiWang [2] considered 'Follower – Friend‟ relationship in his paper. A
„Direct Social Graph‟ is modeled. The author considers content based and
graph based features to facilitate spam detection.
Graph Based Features
A social graph is modeled as a direct graph G= (V, A) where set of V nodes
representing user accounts and set A that connect the nodes. An arc a = (i, j)
represents user i is following user j. Follower is considered as the Incoming
links or in links of a node i.e.., People following you not necessary that you
should follow back. A Friend is an Outgoing links or out links. i.e.., People
you are following. A Mutual Friend is a Follower and Friend at the same
time. When there is no connection between two users then they are
considered to be strangers.
Fig 2.1 A Simple Twitter Graph
In the above figure user A is following user B, user B and user C are
following each other. i.e., User B and user C are mutual friends, and User A
and User C are strangers. The graph based features considered are number
of followers, number of friends, and the reputation of a user.
The classifier used in this paper to detect spam is Naïve Bayesian classifier
[10]. It is based on Bayes theorem which is given by the equation,
𝑃 𝑌 𝑋 = ((𝑃 𝑋 𝑌 𝑃 𝑌 ) 𝑃(𝑋) (2.2)
The twitter account is considered as vector X and each account is assigned
with two classes Y spam and non-spam, the assumption is that the features
are considered to be conditionally independent. This classifier is easy to
implement, it requires small amount of training data set. But, conditionally
independence may lead to loss of accuracy. This classifier cannot model
independency.

IJCSBI.ORG
Twitter Account Features:
Zi Chu et al., [13] review some of the classification features to detect
spammers. These include tweet level features and account level features.
The tweet level features include Spam Content Proposition i.e. tweet text
checked with spam word list and the final landing URL is checked. The
account level features include account Profile which is the self description
of short description text and homepage URL and check whether the
description contains spam words.
Fabricio Benevento et al., [3] have considered the problem of detecting
spammers. In this paper approximately 96% legitimate users and 70%
spammers were correctly classified. Like [1] user based and content
attributes are considered. To detect spammers with accuracy, confusion
matrix is introduced
Fig 2.2 An Example of Confusion Matrix
In the above matrix, „a‟ is the number of spam correctly classified, „b‟ is the
number of spam wrongly classified as non-spam, „c‟ is the number of non-
spam wrongly classified as spam and „d‟ is the number of non-spam
correctly classified. For effective classification some of evaluation metrics
are considered. They are Precision, Recall, F-measure (Micro-F1, Macro-
F1).
Evaluation Metrics:
Precision: It is defined as the ratio of the number of users classified
correctly to the total predicted users and is given by the
equation,
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛, 𝑝 = 𝑎 (𝑎 + 𝑐) (2.3)
Recall: It is defined the ratio of number of users correctly classified to the
number of users and is given by the equation,
𝑅𝑒𝑐𝑎𝑙𝑙, 𝑟 = 𝑎 (𝑎 + 𝑏) (2.4)
F-measure: It is the harmonic mean between precision and recall and it is
given by the equation,
𝐹_𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = 2𝑝𝑟 (𝑝 + 𝑟) (2.5)

IJCSBI.ORG
The classifier used to detect spam is SVM. It is a state of the art method in
classification and in this approach they use non-linear SVM with the Radial
Basic Function kernel that allow SVM to perform with complex boundaries.
The biggest limitation of the support vector approach lies in choice of the
kernel and high algorithmic complexity. This approach mainly focuses on
detecting spam instead of spammers so that it can be useful in filtering
spam. Once a spammer is detected it is easy to suspend that account and
block the IP address but spammers continue their work from other new
account.
Puneeta Sharma and Sampat Biswas [4] proposed two key components (1)
identifying timestamp gap between two successive tweets and (2)
identifying tweet content similarity. They found two common techniques
used by spammers (1) Posting duplicate content by modifying small content
of the tweet; (2) Post spam within short intervals. Spam Identification
approach included BOT Activity Detection and Tweet Similarity Index.
Twitter data can be filtered in various ways by user id, by keyword, many
spammers post spam messages using BOT (computer program), reducing
the frequency between consecutive tweets. To calculate timestamp between
tweets, they first cluster tweets based on user id and sort by increasing
timestamp.
Fig 2.3 BOT activity detection
Cluster Tweets
Identify
Timestamp Gap
Gap < 10s
Spam Non
Spam
Cluster tweets
based on user id
Calculate time
difference
YES NO

IJCSBI.ORG
Spammers can be classified as (1) desperate spammers and (2) sophisticated
spammers. Desperate spammers use automatic programs to post multiple
tweets with small time difference between posts. Sophisticated spammers
create time gap between each tweet. Spammers mostly post duplicate
tweets in trending topics such as jumbling the words between tweets, using
set of words, including numbers in the topic or appending the topic with
commercial advertisement. Tweet similarity index approach determines the
behavior of spammers and filters spam.
They first cluster tweets based on user id and then process each user‟s set of
tweets independently. They create buckets of similar tweets by calculating
Jaccard and Levenshtein similarity coefficient. As a result, they have
buckets containing most similar tweets together resulting in clusters of
similar text. Once all the tweets are collected they check the size of each
bucket and if it is greater than one then they considered it as spam.
Fig 2.4 Tweet Similarity Index
Identify Tweet
Similarity
Create Buckets
for Similar
Tweets
BucketSize
> 1
Cluster Tweets
Non
Spam
Spam
Cluster tweets
based on user id
Calculate Jaccard and
Levenshtein Distance
YES NO

IJCSBI.ORG
Levenshtein distance
It is a string metric measurement for calculating the difference
between two sequences or text. Informally, the Levenshtein distance
between two words is the minimum number of single-character edits
required to change one word into the other including insertion, deletion, and
substitution. The edit distance phrase is often used to refer Levenshtein
distance. The distance is zero if the strings are equal. For example, the
Levenshtein distance between "sitter" and "sitting" is 3
Sitter → sittin (substitution of "i" for "e")
Sitter → sittin (substitution of "r" for "n")
Sittin → sitting (insertion of "g" in the end).
This Levenshtein distance is used to find out the duplicate tweets i.e. if the
tweets are duplicates then the distance is zero.
Jaccard Index
It is also called Jaccard similarity coefficient. It is used for comparing
diversity and similarity of sample sets.
J (A, B) =
|𝐴∩𝐵|
|𝐴∪𝐵|
(2.6)
Jaccard Distance measures dissimilarity between sample sets which is
obtained by subtracting the jaccard coefficient from 1.
𝑑𝑗 𝐴, 𝐵 = 1 − 𝐽(𝐴, 𝐵) (2.7)
Dolvara Gunatilaka [9] discusses about two privacy issues. First is user‟s
identity or user‟s anonymity. The second issue is about user profile or
personal information leakage.
User anonymity
It is that in many social networking sites users use their real name to
represent their account. There are two methods to expose user‟s anonymity:
(1) de-anonymization attack and (2) neighborhood attack [15]. In the first
one, the user‟s anonymity can be revealed by history stealing and group
membership information while in the second one, the attacker finds the
neighbors of the victim node. Based on user‟s profile and personal
information, attackers are attracted by user‟s personal information like their
name, date of birth, contact information, relationship status, current work
and education background.
There can be leakage of information because of poor privacy settings. Many
profiles are made public to others i.e. anyone can view their profile. Next is

IJCSBI.ORG
leakage of information through third party application. Social networking
sites provide an Application Programming Interface (API) for third party
developers to create applications. Once users access these applications the
third party can access their information automatically.
Social Worms
It discuss about some of the social worms. Among those worms Twitter
worm is one of the popular worms.
Twitter worm: It is a term to describe worms that are spreading through
twitter. There are many versions and two worms that are discussed in this
paper are:
Profile Spy worm: This worm spreads by posting a link that downloads a
third party application called “Profile Spy” (a fake application). When users
try to download the application they need to fill some personal information
which allows attacker to obtain user‟s information. Once account is infected,
it will continuously tweet malicious messages to their followers. Next
twitter worm is Google worm which uses shortened Google URL that tricks
the users to click the link. The fake link will redirect users to a fake anti-
virus website. The website will provide a warning saying that computer got
affected and allows user to download the fake antivirus which is actually a
malicious code.
Sender Receiver Relationship
Jonghyuk Song et al., [7] propose a spam filtering techniques based on
sender receiver relationship. This paper addresses two problems in detecting
spam. First is account features can be fabricated by spammers. Second is
account features cannot be collected until number of malicious messages are
reported in that account. The spam filtering does not consider account
features rather than it uses relational features i.e. the connectivity and the
distance between the sender and receiver. Relational features are difficult to
manipulate by spammers. Since twitter limits the tweet to 140 characters
spammers cannot put enough information in that. For this reason spammers
go for posting URL containing spam. They classify the messages as spam
based on the sender. Content filtering is not effective in twitter because it
contains small amount of text.
Restrictions in Twitter
Some of the restrictions considered in twitter [9] are: The user must not
follow large number of users in a short time.
a. Unfollowing and following someone repeatedly.
b. Small number of followers when compared to the amount
of following.

IJCSBI.ORG
c. Duplicate tweets or updates.
d. Update consisting of only links.
The distance between two users is calculated as follows [5][6] when two
users are directly connected by an edge, the distance is one. This means that
the two users are friends. When the distance is greater than one, they have
common friends but not friends themselves. Next the connectivity
represents the strength of the relationship. The way to measure connectivity
is counting the number of paths. Hence, the connectivity between a
spammer and a legitimate user is weaker. The problem of this system is that
it identifies messages as normal if it comes from infected friends.
Sometimes attackers may send spam messages from legitimate accounts by
stealing passwords.
D. Karthika Renuka and T. Hamsapriya [8] an unsolicited email is also
called spam and it is one of the fastest growing problems associated with
Internet. Among many proposed techniques Bayesian filtering is considered
as an effective one against spam. It works based on probability of words
occurring in spam and legitimate mails. But keywords are used to detect
spam mails in many spam detection system.
In that case misspellings may arise and hence it needs to be constantly
updated. But it is difficult to constantly update the blacklist. For this
purpose Word Stemming or Hashing Technique is proposed. This improves
the efficiency of content based filter. These types of filters are useless if
they don‟t understand the meaning of the meaning of the words. They have
employed two techniques to find out the spam content
Content based spam filter [10] This filter works on words and phrases of
the email content i.e. associates the word with a numeric value. If this value
crosses certain threshold it is considered as spam. This can detect only valid
words with correct spellings. This filter uses bayes theorem for detecting the
spam content.
Word stemming or word hashing technique this filter [12] extracts the
stem of the modified word so that efficiency of detecting spam content can
be improved. Rule-based word stemming algorithm is used for spam
detection. Stemming is an algorithm that converts a word into related form.
One such transformation is conversion of plurals into singular, removing
suffixes.

IJCSBI.ORG
3. CONCLUSIONS
Spammers are the major problem in any online social networking sites.
Once a spammer is detected it is easy to suspend his/her account or block
their IP address. But they try to spread the spam from other account or IP
address. Hence it is recommended to check for spam content in a tweet in
the server. If any content matches the spam words present in the data set it is
prevented from being displayed. Accuracy is being evaluated in classifying
the spam content. Many traditional classifiers are present in classifying
spammers from legitimate users but many classifiers wrongly classify non-
spammers as spammers. Hence it is efficient to check for spam content in
tweets.
REFERENCES
[1] M. McCord, M. Chuah, “Spam Detection on Twitter Using Traditional Classifiers”.
Lecture Notes in Computer Science, Volume 6906, pp. 175-186, September 2011.
[2] A. H. Wang, “Don‟t Follow me Spam Detection in Twitter”, Security and
Cryptography (SECRYPT). Proceedings of 5th
International Conference on Security
and Cryptography, July, 2010.
[3] Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida,
”Detecting Spammers on Twitter”. CEAS 2010 Seventh annual Collaboration,
Electronic messaging, Anti Abuse and Spam Conference, July 2010.
[4] Puneeta Sharma and Sampat Biswas,”Identifying Spam in Twitter Trending Topics”.
American Association for Artificial Intelligence, 2011.
[5] “Levenshtein_distance”, http//en.wikipedia.org/wiki/Levenshtein_distance.
[6] ”Jaccard_Index”, http//en.wikipedia.org/wiki/Jaccard_index.
[7] Jonghyuk Song, Sangho Lee and Jong Kim, “Spam Filtering in Twitter using Sender-
Receiver Relationship”. Recent Advances in Intrusion Detection, Lecture Notes in
Computer Science, Volume 6961, pp 301-317, 2011
[8] D. Karthika Renuka, T. Hamsapriya, “Email classification for Spam Detection using
Word Stemming”. International Journal of Computer Applications 1(5), pg.45–47,
February 2010.
[9] ”Twitter_Spam_Rules”, http//support.twitter.com/articles/64986-reporting-spam-on-
twitter.
[10] S. L. Ting, W. H. Ip, Albert H. C. Tsang - “Is Naïve Bayes a Good Classifier for
Document Classification?”. International Journal of Software Engineering and Its
Applications, Vol. 5, No. 3, July 2011.
[11] R. Malarvizhi, K. Saraswathi. "Content - Based Spam Filtering and Detection
Algorithms - An Efficient Analysis & Comparison". International Journal of
Engineering Trends and Technology (IJETT), Volume 4, Issue 9, Sep 2013.
[12] N.S. Kumar, D. P. Rana, R. G. Mehta, “Detecting E-mail Spam Using Spam Word
Associations”, International Journal of Emerging Technology and Advanced
Engineering , Volume 2, Issue 4, April 2012.

IJCSBI.ORG
[13] Zi Chu, Indra Widjaja, Haining Wang, “Detecting Social Spam Campaigns on
Twitter”. Lecture Notes in Computer Science, Volume 7341, pp. 455-472, 2012.
[14] Chris Grier, Kurt Thomas, Vern Paxson, Michael Zhang, “@spam: The Underground
on 140 Characters or Less”. Proceedings of the 17th
ACM conference on Computer and
Communications Security, ACM New York, NY, USA, 2010.
[15]Bin Zhou and Jian Pei, “Preserving Privacy in Social Networks Against Neighborhood
Attacks,”. Data Engineering, IEEE 24th
International Conference, April 2008.
Gowri, C. D. and Mohanraj, V., 2014. Spam Detection in Twitter - A
Review. International Journal of Computer Science and Business

Vol 14 No 1 - July 2014

More Related Content

What's hot

Similar to Vol 14 No 1 - July 2014

More from ijcsbi

Recently uploaded

Vol 14 No 1 - July 2014