SlideShare a Scribd company logo
1 of 44
CERTIFICATE
This is to certify that this project titled “Query Pattern Access and Fuzzy Clustering Based
Intrusion Detection System” submitted by Shivam Gupta (2K16/CO/295), Shivam Maini
(2K16/CO/299), Shubham (2K16/CO/309) and Simran Seth (2K16/CO/317) in partial
fulfilment for the requirements for the award of Bachelor of Technology degree in Computer
Engineering (COE) at Delhi Technological University is an authentic work carried out by the
students under my supervision and guidance.
To the best of my knowledge, the matter embodied in the thesis has not been submitted to any
other university or institute for the award of any degree or diploma.
Ms. Indu Singh
(Assistant Professor)
Department of CSE
Delhi Technological University
DECLARATION
We hereby certify that the work which is presented in the project entitles “Query Pattern
Access and Fuzzy Clustering Based Intrusion Detection System” in fulfilment for the
requirement for the award of the degree of Bachelor of Technology and submitted to the
Department of Computer Engineering, Delhi Technological University is an authentic record
of our own carried out during a period from April 2018 to February 2019, under the supervision
of Ms. Indu Singh (Assistant Professor, CSE Department).
The matter presented in this report has not been submitted by me for the award of any other
degree of this or any other Institute/University.
Shivam Gupta (2K16/CO/295) Shivam Maini (2K16/CO/299)
Shubham (2K16/CO/309) Simran Seth (2K16/CO/317)
ACKNOWLEDGEMENTS
“The successful completion of any task would be incomplete without accomplishing the people
who made it all possible and whose constant guidance and encouragement secured us the
success.”
We owe a debt of gratitude to our guide Ms. Indu Singh (Assistant Professor CSE
Department) for incorporating in us the idea of a creative project, helping us in undertaking
this project and for being there whenever we need her assistance.
I also place on record, my sense of gratitude to one and all, who directly or indirectly have lent
their helping hand in this venture.
I feel proud and privileged in expressing my deep sense of gratitude to all those who have
helped me in presenting this project.
Last but never the least, we thank our parents for always being with us, in every sense.
PROBLEM STATEMENT
The aim of the project is to build an intrusion detection system that provides the following
functionalities:
 The designed system must be able to detect any anomalous behaviour by any user and
raise an alarm and take necessary response against such behaviour.
 Our system must be robust to user behaviour.
 Detect and prevent insider frauds in a credit card company.
 Provide higher level of access control to critical data items (like CVV)
 The designed system should be free from any vulnerabilities of an outsider attack such
as session hijacking, session fixation, data theft etc.
 The system must block all transactions that don’t fall under the user’s jurisdiction by
maintaining user behaviour logs.
MOTIVATION
An Anomaly-Based Intrusion Detection System is a system for detecting computerised
intrusions and misuse by monitoring system activity and classifying it as either normal or
anomalous. The classification is based on heuristics or rules, rather than patterns or signatures,
and will detect any type of misuse that differs significantly from normal system operation.
Earlier, IDSs relied on some hand coded rules designed by security experts and network
administrators. However, given the requirements and the complexities of today’s network
environments, we need a systematic and automated IDS development process rather than the
pure knowledge based and engineering approaches which rely only on intuition and experience.
This encouraged us to study some Data Mining based frameworks for Intrusion Detection.
These frameworks use data mining algorithms to compute activity patterns from system audit
data and extract predictive features from the patterns. Machine learning algorithms are then
applied to the audit records that are processed according to the feature definitions to generate
intrusion detection rules.
The Data Mining based approaches that we have studied can be divided into two main
categories :-
1. Supervised Learning
a. Association Rule Mining
2. Unsupervised Learning
a. Clustering
OBJECTIVE
The main purpose of our paper is to monitor user access. Our Intrusion Detection System (IDS)
pays special attention to certain semantically critical data elements along with those elements
which can be used to infer them. We present an innovative approach to combine a user’s
historic and present access pattern, and hence classify the incoming transaction as malicious or
non-malicious. Using Fuzzy C-Means, we partition the users into fuzzy clusters. Each of these
clusters contains a set of rules in their cluster profiles. New transactions are checked in the
detection phase using these clusters. The main advantage of our IDS lies in its ability to prevent
inference attacks on Critical Data Elements and take into account the user’s historic behaviour.
ABSTRACT
Hackers and malicious insiders perpetually try to steal, manipulate and corrupt sensitive data
elements and an organization’s database servers are often the primary targets of these attacks.
In the broadest sense, misuse (witting or unwitting) by authorized database users, database
administrators, or network/systems managers are potential insider threats that our project
intends to address. Insider threats are more menacing because in contrast to outsiders (hackers
or unauthorised users), insiders have authorised access to the database and have knowledge
about the critical nuances of the database. Database security involves using multitude of
information security controls to protect databases against breach of confidentiality, integrity
and availability (CIA). QPAFCS (Query Pattern Access and Fuzzy Clustering System)
involves plethora of controls such as technical, procedural/administrative and physical. We
hence intend to propose an Intrusion Detection System (IDS) that monitors a database
management system and prevents inference attacks on sensitive attributes, by means of auditing
user access patterns.
Keywords: Intrusion Detection, Fuzzy Clustering, User Access Pattern, Insider Attacks,
Dubiety Score
1. INTRODUCTION
Data protection from insider threats is essential to most organizations. Attacks from insiders
could be more damaging than those from outsiders, since in most cases insiders have full or
partial access to the data; therefore, traditional mechanisms for data protection, such as
authentication and access control, cannot be solely used to protect against insiders. Since recent
work has shown that insider attacks are accompanied by changes in the access patterns of users,
user access pattern mining [1] is a suitable approach for the detection of these attacks. It creates
profiles of the normal access patterns of users using past logs of users accesses. New accesses
are later checked against these profiles and mismatches indicate potential attacks.
A security technique called Access control [2] can regulate who can view or use resources in a
computing environment. There are diverse access control systems that perform authorization,
identification, authentication and access approval. Intrusion Detection Systems [3]. scrutinise
and unearth surreptitious activities perpetrated by malevolent users. IDS work by either looking
for signatures of known attacks or deviations of normal activity. Normally, IDS undergo a
training phase with intrusion free data wherein they maintain a log of benign transactions.
Pattern matching [4] is then used to detect whether or not an action is malign. This is called
anomaly-based detection [5]. When errors are detected using their known “signatures” from
previous knowledge of the attack, it is called signature-based detection [6]. These malicious
actions once detected are then either blocked or probed depending upon the organisation’s
policy. However, IDS need to be dynamic, robust and quick. Different architectures for IDS
function differently and have different measures of performance. Every organisation needs to
make sure that the IDS it uses satisfies its requisites.
Several AD techniques have been proposed to detect anomalous data accesses. Some rely on
the analysis of input queries based on the syntax. Although these approaches are
computationally efficient, they are unable to detect anomalies in scenarios like the following
one. Consider a clerk in an organization who issues queries to a relational database that
typically selects a few rows from specific tables. An access from this clerk that selects all or
most of the rows of these tables should be considered anomalous with respect to the daily
access pattern of the clerk. However, approaches based on syntax only are not able to classify
such access as anomalous. Thus, syntactic approaches have to be extended to take into account
the semantic features of queries such as the number of result rows. An important requirement
is that queries should be inspected before their execution in order to prevent malicious queries
from making changes to the database.
From the technical perspective, the main purpose is to ensure the effective enforcement of
security regulations. Audit is an important technique of examining whether user behaviours in
a system are in conformance with security policies. Many methods audit a database processing
by comparing a user SQL query expression against some predefined patterns so as to find out
an anomaly. But a malicious query may be made up as good looking so as to evade such
syntactic detection. To overcome this shortcoming, the data-centric method further audits
whether the data a user query actually accessed has involved any banned information.
However, such audit concerns a concrete policy rather than the overall view of multiple security
policies. It requires clear audit commands that are articulated by experienced professionals and
much interactive analysis. Since in practice an anomaly pattern cannot be articulated in
advance, it is difficult to detect such fraud by the current audit method.
The anomaly detection technology is used to identify abnormal behaviours that are statistical
outliers. Some probabilistic methods learned normal patterns, against which they detected an
anomaly. But these methods assume very few users are deviated from normal patterns. In case
there are a number of anomalous users, the normal pattern would be diverged. These works do
not examine user behaviour from either a historical or an incremental view, which may
overlook some malicious behaviour. Furthermore, if a group of people collude together, it is
difficult to find them by the current methods.
We tackle the insider threat problem using different approaches. We take into consideration
the fact that certain data elements are more critical to the database as compared with other data
elements. Thus, we pay special attention to the security of such critical data elements. We also
recognise the presence of data attributes in a system which can be manipulated to indirectly
influence the crucial data attributes. We address the threat to our critical data elements using
such attributes also.
We investigate a suspected user also from the diachronic view by analysing his/her historical
behaviour. We store a measure denoting how suspicious a user has been. The greater this
measure, the greater the chances of the query being malicious. This measure also solves the
problem of gradually malicious threat since the historical statics measures the accumulative
results.
The main purpose of our project (QPAFCS) is to recognise user access pattern. Our Intrusion
Detection System (IDS) pays special attention to certain semantically critical data elements,
along with those elements which can be used to infer them. We present an innovative approach
to combine a user’s historic and present access pattern and hence classify the incoming
transaction as malicious/non-malicious. Using FCM, we partition the users into fuzzy clusters.
Each of these clusters contains a set of rules in their cluster profiles. In the detection phase,
new transactions are checked against rules in these clusters, and then a suitable action is taken
depending upon the nature of transaction. The main advantage of our IDS lies in its ability to
prevent inference attacks on Critical Data Elements.
The remainder of this work is organized as follows. In Sect. 2, we present prior research related
to this work. Section 3 introduces the fuzzy clustering and belief update framework. Section 4
discusses the approach using examples. In Sect. 5, we discuss how to apply our method into
practical system. Experimental evaluation is discussed in Section. 6.
2. RELATED WORK
Numerous researchers are currently working in the field of Network Intrusion Detection
Systems, but only a few have proposed research work in Database IDSs. Several systems for
Intrusion Detection in operating systems and networks have been developed, however they are
not adequate in protecting the database from intruders.[11] ID system in databases work at
query level, transaction level and user (role) level. Bertino et. al. described the challenges to
ensure data confidentiality, integrity and availability and the need of database security wherein
the need of database IDSs to tackle insider threats was discussed.
Panda et. al. [19] propose to employ data mining approach for determining data dependencies
in the database system. The classification rules reflecting data dependencies are deduced
directly from the database log. These rules represent what data items probably need to be read
before an update operation and what data items are most likely to be written following this
update operation. Transactions that are not compliant to the data dependencies generated are
flagged as anomalous transactions.
Database IDSs include Temporal Analysis of queries and Data dependencies among attributes,
queries and transaction. Lee et al. [28] proposed a Temporal Analysis based intrusion detection
method which incorporated time signatures and recorded update gap of temporal attributes.
Any anomaly in update pattern of the attribute was reported as an intrusion in the proposed
approach. The breakthrough introduction to association rule mining by Aggarwal et. al. [22]
helped in finding data dependencies among data attributes, which was incorporated in the field
of intrusion detection in Databases.
During the initial development of data dependency association rule mining, DEMIDS, a misuse
detection system for relational database systems was proposed by Chung et. al. [7] Profiles
which specified user access pattern were derived from the audit log and Distance Metrics were
further applied for recognizing data items These were used together in order to represent the
expanse of users. But once the number of users for a single system becomes substantial,
maintaining profiles becomes a redundant procedure. Another flaw was the system assuming
domain information about a given schema.
Hu et. al. [16] presented a data mining-based intrusion detection system, which used the static
analysis of database audit log to mine dependencies among attributes at transaction level and
represented those dependencies as sets of reading and writing operations on each data item. In
another approach proposed by Hu et. al., techniques of sequential pattern mining have been
applied on the training log, in order to identify frequent sequences at the transaction level. This
approach helped in identifying a group of malicious transactions, which individually complied
with the user behavior. The approach was improved in by Hu et. al. by clustering legitimate
user transaction into user tasks for discovery of inter-transaction data dependencies.
The method proposed extends the approach by assigning weights to all the operations on data
attributes. The transactions which didn’t follow the data dependencies were marked as
malicious. The major disadvantage of user assigned weights is the fact that they are static and
unrelated to other data attributes. Kamra et. al. [27] employed a clustering technique on an
RBAC model to form profiles based on attribute access which represented normal user
behavior. An alarm is raised when anomalous behavior of that role profile is observed.
(Bezdek, Ehrlich & Full, 1984) proposed the Fuzzy C-Means Algorithm. The basic idea behind
this approach was to illustrate the similarity a data point may share with each of the clusters
with help of a function often referred to as membership function. This measure of similarity
lies between zero and one signifies the extent of similarity between the data point and the
cluster and is termed as the membership value. The main aim of this technique is to construct
fuzzy partitions of a particular data set.
Y. Yu et. al. [29] illustrated a fuzzy logic-based Anomaly Intrusion Detection System. A Naive
Bayes Classifier is used to classify an input event as normal or anomalous. The basis of
classifier is formed by the independent frequency of each system call from a process in normal
conditions. The ratio of the probability of a sequence from a process and the probability not
from the process serves as the input of a fuzzy system for the classification.
A hybrid approach was described by Doroudian et. al. [26] to identify intrusion at both
transaction and inter-transaction level. At the transaction level, a set of predefined expected
transactions were specified to the system and a sequential rule mining algorithm was applied
at the inter transaction level to find dependencies between the identified transactions. The
drawback of such a system is that sequences with frequencies lower than the threshold value
are neglected. Therefore, the infrequent sequences were completely overlooked by the system,
irrespective of their importance. As a result, the True Positive Rate falls down for the system.
The above drawback was overcome by Sohrabi et. al. [20] who proposed a novel approach
ODARDM, in which rules were formulated for lower frequency item sets, as well. These rules
were extracted using leverage as the rule value measure, which minimized the interesting data
dependencies. As a result, True Positive Rate increased while the False Positive Rate
decreased. In recent developments, Rao et. Al [21] presented a Query Access detection
approach through Principal Component Analysis and Random Forest to reduce data
dimensionality and produce only relevant and uncorrelated data. As the dimensionality is
reduced, both, the system performance and True Positive rate increases.
In 2009, Majumdar et. al. [15] propose a comprehensive database intrusion detection system
that integrates different types of evidences using an extended Dempster-Shafer theory. Besides
combining evidences, they also incorporate learning in our system through application of prior
knowledge and observed data on suspicious users. In 2016, Bertino et. al. [14] tackled the
insider threat problem from a data-driven systemic view. User actions are recorded as historical
log data in a system, and the evaluation investigates the date that users actually process. From
the horizontal view, users are grouped together according to their responsibilities and a normal
pattern is learned from the group behaviours. They investigate a suspected user also from the
diachronic view by comparing his/her historical behaviours with the historical average of the
same group.
Anomaly detection has been an important research problem in security analysis, therefore
development of methods that can detect malicious insider behavior with high accuracy and low
false alarm is vital [10]. In this problem layout, McGough et al [8] designed a system to identify
anomalous behavior of user by comparing of individual user’s activities against their own
routine profile, as well as against the organization’s rule. They applied two independent
approaches of machine learning and Statistical Analyzer on data. Then results from these two
parts combined together to form consensus which then mapped to a risk score. Their system
showed high accuracy, low false positive and minimum effect on the existing computing and
network resources in terms of memory and CPU usage.
Bhattacharjee et al proposed a graph-based method that can investigate user behavior from two
perspectives: (a) anomaly with reference to the normal activities of individual user which has
been observed in a prolonged period of time, and (b) finding the relationship between user and
his colleagues with similar roles/profiles. They utilized CMU-CERT dataset in unsupervised
manner. In their model, Boykov Kolmogorov algorithm was used and the result compared with
different algorithms including Single Model One-Class SVM, Individual Profile Analysis, k-
User Clustering and Maximum Clique (MC). Their proposed model evaluated by evaluation
metrics Area-Under-Curve (AUC) that showed impressive improvement compare to other
algorithms [9].
T. Rashid et al. proposed that parameter learning task in HMMs is to find, given an output
sequence or a set of such sequences, the best set of state transition and emission probabilities.
The task is usually to derive the maximum likelihood estimate of the parameters of the HMM
given the set of output sequences. No tractable algorithm is known for solving this problem
exactly, but a local maximum likelihood can be derived efficiently using the Baum–Welch
algorithm or the Baldi–Chauvin algorithm. The Baum–Welch algorithm is a special case of the
expectation-maximization algorithm. If the HMMs are used for time series prediction, more
sophisticated Bayesian inference methods, like Markov chain Monte Carlo (MCMC) sampling
are proven to be favorable over finding a single maximum likelihood model both in terms of
accuracy and stability.[12] Log data are considered as high-dimensional data which contain
irrelevant and redundant features. Feature selection methods can be applied to reduce
dimensionality, decrease training time and enhance learning performance .
3. OUR APPROACH
3.1 Basic Notations
Large organisations deal with tremendous amount of data whose security is of prime interest.
The data in databases comprises of attributes describing real life objects called as entities. The
attributes have varying levels of sensitivity, i.e. not all attributes are equally important to the
integrity of database. As an example, the signatures and other biometric data are highly
sensitive data attributes for a financial organisation like Bank in comparison to others like
name, gender etc. So, unauthorised access to the crucial attributes is of a greater concern. Only
certain employees may have access to such data elements and access by all others must be
blocked instantaneously to ensure Confidentiality and consistency of data.
Our proposed model QPAFCS(Query Pattern Access and Fuzzy Clustering System) pays
special attention to sensitive data attributes and they have been referred to as CDE (Critical
Data Elements) in the text. The attributes that can be used to indirectly infer CDEs are also
critical to the functioning of the organisation. For instance, account number of a user may be
used to access the signatures and other crucial details about him. Such attributes have been
referred to as DAE (Directly Associated Elements) in the text.
We propose a two-phase detection and prevention model that clusters users based on similarity
of their attribute access patterns and the types of queries performed by them, i.e. our model
tries to track the user access pattern of each user and further classify it as normal or malicious.
The superiority of our model lies in its ability to prevent unauthorised retrieving and
modification of most sensitive data elements (CDEs). Our model also makes sure that the query
pattern for access of CDEs is specific and fixed for a particular user to avoid data breaches, i.e.
the user associates himself with his regular access behaviour. Any deviation from the regular
arrangement may lead to depreciation of user’s confidence and may act as representative of
user’s malicious intent. The following terminologies are used:
Definition 1 (Transaction) Set of queries executed by a user. Each transaction is represented
by a unique transaction ID and also carries the user’s ID. Hence <Uid,Tid> act as unique
identification key for each set of query patterns. Each Transaction T is denoted as
<Uid, Tid, <q1, q2, … qn>>
where
qi denotes the ith query, i ∈ [1 … n]
For example, suppose a user has id 1001. He/she then executes the following set of SQL
queries:
q1: SELECT a,b,c
FROM R1,R2
WHERE R1.A>R2.B
q2: SELECT P
FROM R5
WHERE R5.P==10
Then this is said to be a transaction of the form:
t=<1001,67,<q1,q2>>
Definition 2 (Query) A query is a standard database management system token/request for
inserting and retrieving data or information from a database table or combination of tables. We
define query as a read or write request on an attribute of the relation. A query is represented as
<O(D1), O(D2), … O(Dn)>
where,
D1, D2, … Dn ∈ Rs
where Rs is the relation schema and Di are the attributes. O represents the operations i.e. Read
or write Operations. O ∈ {R, W}
For example, examine the following transaction:-
start transaction
select balance from Account where Account_Number='9001';
select balance from Account where Account_Number='9002';
update Account set balance=balance-900 where Account_Number='9001' ;
update Account set balance=balance+900 where Account_Number='9002' ;
commit; //if all SQL queries succeed
rollback; //if any of SQL queries failed or error
The query corresponding to this transaction is:
<<R(Account_Number),R(balance)>, <R(Account_Number),R(balance)>,
<R(Account_Number),R(balance),W(balance)>,
<R(Account_Number),R(balance),W(balance)>>
Definition 3 (Read Sequence) A read sequence is defined as
{R(x1), R(x2), … O(xn)}
where O represents the operations i.e. Read or write Operations. O ∈ {R, W}. The Read
sequence represents that the transaction may need to read all data items x1, x2, …, xn-1 before
the transaction performs operation (O∈ {R, W}) on data item xn.
For example, consider the following update statement in a transaction.
Update Table1 set x = a + b + c where d = 90;
In this statement, before updating x, values of a, b, c and d must
be read and then the new value of x is calculated. So <R(a), R(b),
R(c), R(d), W(x)> ∈ RS(x).
Definition 4 (Write Sequence) A write sequence is defined as
{O(x1), W(x2), … W(xn)}
where O represents the operations i.e. Read or write Operations i.e. O ∈ {R, W} which
represents that the transaction may need to write all data items x1, x2, …, xn-1 in this order
after the transaction operates on data item xn.
For example, consider the following update statements in one transaction.
Update Table1 set x = a + b + c where a=50;
Update Table1 set y = x + u where x=60;
Update Table1 set z = x + w + v where w=80;
Using the above example, it can be noted that <W(x), W(y),W(z)>
is one write sequence of data item x, that is <W(x), W(y),W(z)> ∈
WS(x), where WS(x) denotes the write sequence set of x.
Definition 5 (Read Rules (RR)) Read rules are the association rules generated from Read
sequences whose confidence is greater than the user defined threshold (Ψconf). A read rule is
represented as
{R(x1), R(x2) ...} ⇒ O(x).
For all sequential patterns <R(x1), R(x2), …, R(Xn-1), O(xn) > in read sequence set, generate
the read rules with the format {R(x1), R(x2) ...} ⇒ O(xn). If the confidence of the rule is larger
than the minimum confidence (Ψconf), then it’s added to the answer set of read rules, which
implies that before xn, we need to read x1,x2…….. xn-1
For example:
The Read Rule corresponding to the read sequence <R(a), R(b),
R(c), R(d), W(x)> is:
{R(a), R(b), R(c), R(d)} ⇒ W(x)
Definition 6 (Write Rules (WR)) Write rules are the association rules generated from write
sequences whose confidence is greater than the user defined threshold (Ψconf). A write rule is
represented as
O(x) ⇒ {W(x1), W(x2) …}
For all sequential patterns O(x), W(x1), W(x2), …,(xk) in the write sequence set, generate the
write rules with the format O(x)→W(x1), w(x2), …, w(xk). If the confidence of the rule is larger
than the minimum confidence (Ψconf), then it’s added in the set of write rules which depicts
after updating x, data
items x1, x2, …, xk must be updated by the same transaction.
For Example: The write rule corresponding to the write sequence
<W(x), W(y),W(z)> is W(x) ⇒ {W(y),W(z)}
Definition 7 (Critical Data Elements (CDE)) They are semantically defined data elements
crucial to the functioning of the system. They are the data attributes of prime significance
having direct correlation to the integrity of the system. In a vertically hierarchical organisation,
these are the attributes accessed only by the top level management, and the access by lower
levels of hierarchy is strictly protected.
Type of Attribute Sensitivity Level
Critical data Elements Highest
Directly Associated Elements Medium
Normal Attributes Low
Table 3.1 Types of attributes and their sensitivity levels
CDEs are tokens of behaviour that our model uses for the malicious activity recognition of
users of system.
Definition 8 (Critical Rules (CR)) A set of rules that contain a Critical Data Element in its
antecedent or consequent.
CR = {ζ | (ζ ∈ RR ∨ ζ ∈ WR) ∩ (x ∈ CDE ∩ ({R(x1), R(x2) …} ⇒ O(x) ∪ O(x) ⇒ {W(x1),
W(x2) …}))}
We propose a method of user Access Pattern Recognition using the Critical Rules. CR
recognize the actions and goals of Users from a series of observations on the users' actions and
the environmental conditions, i.e. the user query pattern associated to the Critical data elements.
Definition 9 (Directly Associated elements (DAE)) The attributes except those present in
CDE, which are either part of antecedents or consequents of Critical Rules.
DAE = {μi| μi ∈ CR ∩ μi ∉ CDE}.
The query patterns as perceived by our model QPAFCS are explored using DAEs that represent
the first level of access of the CDEs. A user's behaviour is represented by a set of first-order
statements (derived from queries) called attribute hierarchy encoded in first-order logic, which
defines abstraction, decomposition and functional relationships between types of access
arrangements. The unit-transactions accessing CDEs are decomposed into attribute hierarchy
comprising of DAEs, which further represents the user’s most sensitive retrieval pattern.
Example:
 R(b) → R(a)
 R(b), R(c) → R(a)
If a is a CDE, then the set {b,c} represents DAEs.
Definition 10 (Dubiety Score(φ)) A measure of anomaly exhibited by a user in the past based
on his historic transactional data. This score summarizes the user’s historic malicious access
attempts. Dubiety Score attempts to quantify the personnel vulnerability that the organisation
faces because of a particular user.
Dubiety Score is indicative of the amount of deviation between the user’s access pattern and
his designated role. Dubiety Score combined with the deviation of user’s present query from
his normal behaviour pattern, yields the output of the proposed IDS.
For our paper:
0<= φ<=1. (1)
Higher the Dubiety Score, more is the evidence against user following the assigned role, that
is more is the malicious intent i.e. rogue behaviour.
Definition 11 (Dubiety Table) A table maintaining the record of dubiety scores of each user.
It contains two attributes: UserID and Dubiety Score.
The initial Dubiety scores are set to 1.
Uid φ
1001 1
1002 1
1003 1
1004 1
1005 1
Table 3.2 Initial Dubiety Table
The dubiety table is updated each time a user performs query.
For example:
Let user 1001’s deviation from normal query is quantified as 0.81, Then the updated Dubiety
table is as shown.
Where:
ds = deviation from normal query
φi = Initial dubiety score.
Uid √𝑑𝑠 ∗ фi
1001 0.9
1002 1
1003 1
1004 1
1005 1
Table 3.3 Updated Dubiety Table
The Updated Dubeity table is hence stored in memory for further processing.
3.2 Learning Phase
We start our learning phase by reading the training dataset into the memory and extracting
useful patterns out of it. Our system requires non-malicious training dataset composed of
transactions executed by trusted users. The model aims at generating user-profiles from the
transaction-logs and quantifies deviation from normal behaviour i.e. this phase aims to
recognise and characterise the user activity pattern on the basis of their queries arrangement.
The following are various components of architecture of the proposed model:
Fig 3(a) Learning Phase Architecture
COMPONENTS OF ARCHITECTURE:
Training data: A transaction log is a sequential record of all changes made to the database
while the actual data is contained in a separate file. The transaction log contains enough
information to undo all changes made to the data file as part of any individual transaction. The
log records the start of a transaction, all the changes considered to be a part of it, and then the
final commit or rollback of the transaction. Each database has at least one physical transaction
log and one data file that is exclusive to the database for which it was created. Our initial input
to the learning phase algorithm is the transaction log, with only authorised and consistent
transactions. This data is free of any unauthorised activity and is used to form user profiles,
role profiles etc based on normal user transactions. The logs are scanned, and the following
elements are extracted:
a. SQL Queries
b. The user executing a given query
SQL query parser: This is a tool that takes SQL queries as input, parses them and produces
sequences (read and write) corresponding to the SQL query as output. The query parser also
assigns a unique Transaction ID. The final output consists of two 3 columns: (TID), UID (User
ID) and the read and write sequence generated by the parsing algorithm.
As an Example, if the following transaction performed by user U1001 is examined:
start transaction
select balance from Account where Account_Number='9001';
commit; //if all SQL queries succeed
rollback; //if any of SQL queries failed or error
The parser generates a unique Transaction ID say T1234 followed by parsing the transaction.
The parser finally yields:
< T1234,U1001,<R(Account_number),R(balance)>>
Frequent sequences generator: After the SQL query parser generates the sequences, the
generated sequences are pre-processed. Then weights are assigned to data items, for instance
the CDEs are given greater weight as compared to DAEs and other normal attributes. Then
finally these pre-processed sequences are given as inputs to frequent sequences generator. It
uses the prefix span algorithm to generate frequent sequences out of input sequences
corresponding to each UID.
Rule generator: The frequent sequences are given as inputs to the rule generator module
which uses association rule mining to generate read rules and write rules out of the frequent
sequences.
As an example, if the input frequent sequences are:
1. <R(m),R(n),R(o),W(a)>
2. <R(m),R(n),W(o),W(a)>
3. <R(m),W(n),W(o),W(a)>
4. <W(a),R(b),W(o)>
5. <R(a),R(b),R(m),W(a)>
6. <R(a),R(b),W(m),W(b)>
S.No. Frequent Sequences Associated Rules
1 <R(m),R(n),R(o),W(a)> R(m),R(n),R(o) →W(a)
2 <R(m),R(n),W(o),W(a)> R(m),R(n),W(o) →W(a)
3 <R(m),W(n),W(o),W(a)> R(m),W(n),W(o) →W(a)
4 <W(a),R(b),W(o)> W(a),R(b) →W(o)
5 <R(a),R(b),R(m),W(a)> R(a),R(b),R(m) →W(a)
6 <R(a),R(b),W(m),W(b)> R(a),R(b),W(m) →W(b)
Table 3.4 Rule Generator for given Example
DAE generator: In our approach, we semantically define a class of data items known as
Critical data elements or CDEs. These CDEs and rules are given as input to our DAE (Directly
associated element) generator which specifies all those elements as DAE which are present in
either the antecedent or the consequent of those rules that involve at least one of the CDEs.
User vector generator: Using the
frequent sequences for the given audit
period, it generates the user vectors. A
user vector is of the form
BID = < UID, w1, w2, w3, ... wn >
where wi = |O(ai)|.
|O(ai)| represents the total number of times
user with the given Uid performs operation
(O ∈ {R, W}) on the aforesaid attribute ai
in the pre-decided audit period. An audit
period τ refers to a period of time such as
one year, a time window τ = [t1, t2] or the
recent 10 months. User vector is
representative of user’s activity.
Each of these wi would represent how frequently a user performs the operation on the particular
data item. It also can be used in a normalized form, as is used in our proposed model QPAFCS.
UVID = <UID, < p(a1), p(a2), p(a3), … p(an)>>
where,
p(𝑎 𝑘) =
𝑤 𝑘
∑ 𝑤𝑗𝑤 𝑗 𝜖 𝐵𝑖
p(ak) is defined as the probability of accessing the attribute ak.
Value of p(𝑎 𝑘)close to 1 would mean that the user accesses the given attribute frequently.
Cluster generator: It takes user vectors and rules as input and generates fuzzy clusters.
Users are clustered into different fuzzy clusters based on the similarity of their user vectors[29].
A cluster profile would include
Ci = <CID, {R}>
where, CID represents the cluster centroid, and
{R} is a set of rules which is formed by taking the union of all the rules that the members of
the given fuzzy cluster abide by.
We have used Fuzzy c-means[26] clustering to create cluster. Each user belongs to a cluster to
a certain degree wij.
Where:
wij represents the membership coefficient of the ith user (ui) with the jth cluster
The centre of a cluster (α) is the mean of all points, weighted by their membership
coefficients[28]. Mathematically,
Algorithm 1: DAE Generator
Data: CDE, Set DAE = {}, RR = Set of Read
Rules, WR = Set of Write Rules
Result: The set of Directly Associated
elements DAE
Function: DAE Generator (CDE, RR, WR)
for Ω є RR ∪ WR do
for α є Ω do
if α є CDE
while β є Ω do
DAE {} ⃪ β
end
end
end
end
𝑤𝑖𝑗 =
1
∑ (
||𝑢 𝑖 −𝛼 𝑗||
|| 𝑢 𝑖−𝛼 𝑘||
)
2
𝑚−1𝐶
𝑘=1
𝛼 𝑘 =
∑ 𝑤(𝑢) 𝑚
𝑢𝑢
∑ 𝑤(𝑢) 𝑚
𝑢
The objective function that is minimized to create clusters is defined as:
𝑎𝑟𝑔 𝑚𝑖𝑛 ∑ ∑ 𝑤𝑖𝑗
𝑚
||𝑢 𝑖 − 𝛼𝑗||2
𝐶
𝑗=1
𝑛
𝑖=1
where
n is the total number of users,
C is the number of clusters, and
m is the fuzzifier.
The dissimilarity/distance function used in the formation of fuzzy clusters is the modified
Jenson Shannon distance[27] which is illustrated as:
Given two user vectors[13]
UVx = <Ux, < px(a1), px(a2), px(a3), … px(an)>> and
UVy = <Uy, < py(a1), py(a2), py(a3), … py(an)>>
of equal length n, the modified Jensen Shannon distance is computed as
𝐷(𝑈𝑉𝑝||𝑈𝑉𝑞) = ∑
(
(1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))log2
(1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))
(1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))
+
(1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤(𝑎𝑖)) log2
(1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤(𝑎𝑖))
(1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤(𝑎𝑖)))
2
𝑛
𝑖=1
where, w(ai) is the semantic weight associated with the ai
th attribute
User profile generator: This module takes user vectors and the cluster profiles as input and
generates user profiles. A user profile is of the form
Ui=<UID, < p(a1), p(a2), p(a3) … p(ak) >, < c1, c2, … cC > >
where
UID is a unique ID given to each user,
<p(a1), p(a2), p(a3), … p(an)> is a 1-D matrix containing the probability of the user accessing a
particular attribute, and
< c1, c2, … cC > is a vector representing the membership coefficients of the given user for C
different clusters.
As an Example:
Table 3.5 User profile for the given Example
Consider a system with 4 fuzzy clusters and 4 attributes, the given table illustrates the profile
of user U1001.
3.3 Testing Phase
In section 3.2, the learning phase is described, in which the system is trained using non-
malicious or benign transactions. Now the trained model can be used to detect malicious
transactions. In this phase, a test query is obtained as input and it is compared with the model’s
perception of user’s access pattern, and the model perpetually evaluates if the test transaction
is malicious. It is first checked whether the user is trying to access a CDE. If yes, the transaction
is allowed only if the given user has accessed that CDE before. Next, it is checked if any DAE
is being accessed. A user can perform write operation on a DAE iff it is previously written by
the same user, otherwise the transaction is termed as malicious. Next, we check if the
transaction abides by the rules that are generally followed by similar users.
PHASES OF TESTING PHASE:
Rule generator: This module takes the sequence as generated by the SQL query parser and
gives the rule that the input transaction follows. This can be a read rule or a write rule and
indicates the operations done by the user, data attributes accessed by the user and the order in
which they are accessed. Now this rule can be checked for maliciousness.
CDE Detector: The semantically critical elements referred to in our approach as CDEs are
detected in this module. The read/ write rule corresponding to the incoming transaction is
checked for the presence of CDEs. If the rule being checked for maliciousness contains a CDE,
then it is dealt with using the following policy:-
a. If read operation has been performed on any CDE, i.e. r(CDE) is present in the rule and
UV[i][r(CDE)] = 0 and UV[i][w(CDE)] = 0 for the given user, then the transaction is
termed as malicious.
b. If write operation has been performed on any CDE i.e. w(CDE) is encountered and
UV[i][w(CDE)] = 0 for the given user, then the transaction is termed as malicious.
Inputs Outputs
C1 C2 C3 C4 User Vector User profile
0.2 0.2 0.2 0.4 <U1001,0.2,0.109,0.9,
0.6>
<U1001,<0.2,0.1,0.9,0.6>,<0.2,0.2,0.2,0.4>>
Fig 3(b) Architecture of Testing Phase.
DAE Detector: This module addresses
the issue of inference attacks on CDEs.
As discussed earlier, certain data
elements can be used to access the CDEs,
i.e. first order inference. This module
uses the rules mined in the learning phase
to determine which elements can be used
to directly infer the DAEs.
Our system seeks to prevent inference
attacks by especially monitoring the
DAEs. We lay emphasis on write
operations on DAEs. If write operation
has been performed on any DAEs i.e.
w(DAE) is present in the rule to be
checked and UV[i][w(DAE)] = 0 for the
given user, then the transaction is termed
as malicious.
Dubiety Score Calculator and Analyser: If the transaction has not been found malicious in
the previous two modules, we check if the transaction is malicious based on the previous
history of the user and the behaviour pattern of all similar users (modified Jenson Shannon
distance). To do so, we maintain a record of action of all users by keeping the measure of
Dubiety Score(φi).
Algorithm 2: CDE Detector
Data: Set of rules (ϒ) from test transaction, Set
χCDE, UID, User Profile(ϴ)
Result: Checks whether the test transaction is
malicious or normal with respect to CDE
for Ѓє ϒ do
for ϱ є Ѓ do
if ϱ є χCDE then
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
end
if r(ϱ) є Ѓ & ϴ[UID][r(ϱ)] == 0 &
ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
end
end
end
end
The deviation of a user’s new
transaction with his normal access
pattern is referred to as Dubiety, and the
relative measure of Dubiety is the
Dubiety Score. Our IDS keeps a log of
the DS (Dubiety Score) in a separate
table. A user who is a potential threat
tends to have a high dubiety score.
Another intuition that our system
follows is that any transaction that a user
makes matches significantly either with
the transactions the same user or similar
users have made in the past.
We use a measure ds to keep a track
of the maximum similarity of the given
rule. We combine ds with φ i to get the final measure of dubiety score φ f for the given user. We
define 2 thresholds ФLT and ФUT. ФUT represents the upper limit for the dubiety score of a non-
malicious user whereas ФLT denotes the lower limit. This means that if φf for a user comes out
to be greater than ФUT, the user is malicious. On the other hand, φf value less than ФLT denotes
a benign user.
 If the incoming rule (R1)
is a write rule, then the consequent
of the incoming rule is matched
with the corresponding rules in the
cluster of which a user is as part.
A user is said to be the part of the
ith cluster iff:
μi > 𝛿.
Where,
μi is the fuzzy membership
coefficient of the given user for
the ith cluster.
𝛿 is a user defined threshold.
 If the incoming rule (R1)
is a read rule, then the antecedent of the incoming rule is matched with the corresponding
rules in the cluster of which a user is as part.
 In order to quantitatively measure the similarity between two rules, we use modified
Jaccard distance[30]:
JD = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
R1 R2
R2| μi > 𝛿 and i [1, k]
Algorithm 4: Modified Jaccard Distance
Data: Rules R1, R2; 𝛿1, 𝛿2; Set χR1, χR2
Result: Distance between the two rules (Ԏ)
Function jcDistance (R1, R2)
for Ω є R1 do
χR1 ← Ω;
end
for Ω’ є R2 do
χR2 ← Ω’;
end
Ԏ =
𝛿1∗( 𝜒𝑅1 𝜒𝑅2)– 𝛿2∗( 𝜒𝑅1 𝜒𝑅2 – 𝜒𝑅1 𝜒𝑅2)
𝜒𝑅1 𝜒𝑅2
;
return Ԏ;
Algorithm 3: DAE Detector
Data: Set of rules (ϒ) from test transaction, Set
χDAE, UID, User Profile(ϴ)
Result: Checks whether the test transaction is
malicious or normal with respect to DAE
for Ѓє ϒ do
for ϱ є Ѓ do
if ϱ є χDAE then
if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then
Raise Alarm;
end
end
end
end
 The minimum value of JD is regarded as ds. φi is fetched directly from dubiety table.
Final dubiety score for the given user is calculated as:
φf =√ 𝑑𝑠 ∗ фi
 If φf < ФLT, the transaction is termed as non-malicious. In this case, the current dubiety
score in the dubiety table for the given user is reduced by a factor known as
“amelioration factor(Å)”.
Thus, φi is updated as
φi = Å φi
 If ФUT > φf ≥ ФLT, the transaction is termed as non-malicious and the dubiety table entry
for the given user is updated with φf.
 If φf ≥ ФUT the transaction is termed as malicious.
 As an Example, Let the initial dubiety table be:
Uid φ
1001 0.9
1002 0.8
1003 0.2
1004 0.6
1005 0.7
Table 3.6 Initial dubiety table
Let the minimum value of ds corresponding to each user be:
Uid ds
1001 0.2
1002 0.3
1003 0.2
1004 0.6
1005 0.3
Table 3.7 Minimum ds values for various Users
The calculated dubiety score table:
Uid φf= √ 𝑑𝑠 ∗ фi
1001 0.42
1002 0.49
1003 0.2
1004 0.6
1005 0.46
Table 3.8 calculated dubiety scores table
Taking ФLT=0.3 and ФUT=0.6
Uid φf Nature of
Transaction
Updated
φf
1001 0.42 Non-
malicious
0.42
1002 0.49 Non-
malicious
0.49
1003 0.2 Non-
malicious
0.198
1004 0.6 Malicious 0.6
1005 0.46 Non-
malicious
0.46
Table 3.9 Summary of transactions of various users
The Malicious Transactions are blocked in a straightforward fashion and the Non Malicious
transactions are processed. Updated Dubiety Table is stored in database.
4. DISCUSSION
With regard to a typical credit card company dataset, some examples of critical data elements
(CDEs) are: -
1. CVV (denoted by a)
Card verification value (CVV) is a combination of features used in credit, debit and
automated teller machine (ATM) cards for the purpose of establishing the owner's identity
and minimizing the risk of fraud. The CVV is also known as the card verification code
(CVC) or card security code (CSC).
When properly used, the CVV is highly effective against some forms of fraud. For example,
if the data in the magnetic stripe is changed, a stripe reader will indicate a "damaged card"
error. The flat-printed CVV is (or should be) routinely required for telephone or Internet-
based purchases because it implies that the person placing the order has physical possession
of the card. Some merchants check the flat-printed CVV even when transactions are
conducted in person.
CVV technology cannot protect against all forms of fraud. If a card is stolen or the legitimate
user is tricked into divulging vital account information to a fraudulent merchant,
unauthorized charges against the account can result. A common method of stealing credit
card data is phishing, in which a criminal sends out legitimate-looking email in an attempt to
gather personal and financial information from recipients. Once the criminal has possession
of the CVV in addition to personal data from a victim, widespread fraud against that victim,
including identity theft, can occur.
The following are directly associated elements (DAEs) to CVV:-
a. Credit card number (denoted by b)
b. Name of card holder (denoted by c)
c. Card expiry date (denoted by d)
Credit Card Number, Name of card holder, Card expiry date are elements that are read before
CVV and hence used to validate the CVV entered by the user. Hence the above-mentioned
attributes have been classified as DAEs, by our system.
Some normal data attributes are: -
1. Gender of Customer (denoted by e)
2. Credit Limit (denoted by f)
3. Customer’s phone number (denoted by g)
These are the attributes that have been collected for the fraud detection and are not directly
used to access the CDE but are crucial for the process.
Some examples of transactions for our proposed approach:
 R(b) → R(a)
 R(b), R(c) → Ra)
5. EXAMPLE TO OUR APPROACH
1. JC Distance
R1: R(c), R(b) → R(a)
R2: R(d), R(b) → R(a)
The modified JC Distance between R1 & R2 where the hyperparameters are 𝛿1 = 0.70 and 𝛿2 =
0.20, is calculated as
JC Distance = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2)
R1 R2
R1 R2 = 2
R1 R2 = 4
JC Distance = 0.75
2. User Profile Vector
B1 = <U1, <0.7, 0.1, 0.6, 0.2, 0.4, 0.0, 0.2, 0.0>, <0.2, 0.3, 0.1, 0.2, 0.167, 0.033> >
Here the values in the second tuple <0.7, …0.0> represent the probability of User U1
accessing particular attributes, for instance 0.7 denotes that there is a 70% probability that
U1 accesses the first attribute.
The values in the third tuple represent the membership of user U1 in the various(k) fuzzy
clusters, which is 6 in our case.
3. Dubiety Score
Suppose the Dubiety Score φi for User U1 is 0.8.
The JC Distance of the test transaction with its cluster is 0.6. Then,
φf =√ 𝑑𝑠 ∗ φi
φf =√0.6 ∗ 0.8 = 0.69
Setting our hyperparameter ФUT as 0.65. We observe that φf > ФUT. Hence the test transaction
is malicious, and an alarm is raised.
6. EXPERIMENTATION
In this section, we describe the method of evaluation of the proposed algorithm. Firstly, we
describe our dataset. We then calculate various accuracy measures considering different
parameters as reference.
6.1 Description of dataset
This paper is about anomaly detection of user behaviours. An ideal dataset should be obtained
from a practical system with concrete job functions. But in fact, it is very sensitive for almost
every organization or company.
The performance of the algorithm was analysed by carrying out several experiments on
a credit card company dataset adhering to the TPC-C benchmark[18]. The TPC-C schema is
composed of a mixture of read only and read/write transactions that replicate the activities
found in complex OLTP application environment. The database schema, data population,
transactions, and implementation rules were designed to broadly represent modern OLTP
systems. We used two audit logs: one for training the model and the second for testing it. The
training log comprised of normal user transactions and testing log consisted of a mixture of
normal as well as malicious user transactions. Although there are unusual records in real
dataset, we also inject some anomalies for detection. The injected anomalies are set differently
with the normal behaviour pattern from several aspects. In totality, about 20,000 transactions
were used. In total, about 99% of data was non-malicious while less than 1% of data was
malicious. Fig. 6(a) shows the distribution of malicious and benign data in the dataset used:
Fig 6(a) Frequency of data items and their access frequency
The details of CDEs, DAEs and Normal data items has already been given in Section 3 and
examples have been discussed in Section 5.
The access pattern data hereby shows that CDEs are rarely accessed, that too only by a few
user roles and hence, protection of CDEs from malicious access is of a greater significance as
compared to DAEs and Normal data elements.
6.2 Cluster Analysis
When the number of users/user roles exceeds a given limit, it becomes exceedingly difficult
for the IDS to keep track of individual user access patterns and hence detect anomaly. This is
the reason that clustering is a better and computationally efficient solution for better
performance of IDS. We prefer Fuzzy clustering over hard clustering. Fuzzy clustering (also
referred to as soft clustering) is a form of clustering in which each data point can belong to
more than one cluster. In non-fuzzy clustering (also known as hard clustering), data is divided
into distinct clusters, where each data point can only belong to exactly one cluster. In fuzzy
clustering, data points can potentially belong to multiple clusters. Membership grades are
assigned to each of the data points(tags). These membership grades indicate the degree to which
data points belong to each cluster. Thus, points on the edge of a cluster, with lower membership
grades, may be in the cluster to a lesser degree than points in the center of cluster. When we
evaluate various performance measures keeping the number of clusters as a reference
parameter, it is observed that a particular count for clusters is the most efficient in predicting
results.
Fig 6(b) Variation of performance with number of clusters
Fig 6(b) depicts variation in precision, recall, TNR, accuracy with change in number of clusters.
From the graph, we can see that :-
 TNR does not vary with the number of clusters, i.e. TNR is invariant.
 The precision is always greater than 0.94 and is more or less constant.
 Recall reaches optimum value when number of Fuzzy Clusters is greater than 3.
 Accuracy also reaches the optimum value when number of clusters is greater than 3.
6.3 Distances and thresholds
In section 3.2, we have described Modified Jensen-Shannon distance as a measure to calculate
distance between two user vectors of same length. In probability theory and statistics, the
Jensen–Shannon divergence is a method of measuring the similarity between two probability
distributions. It is also known as information radius (IRad) or total divergence to the average.
It is based on the Kullback–Leibler divergence, with some notable (and useful) differences,
including that it is symmetric and it is always a finite value. The square root of the Jensen–
Shannon divergence is a metric often referred to as Jensen-Shannon distance. We preferred to
use modified Jenson-Shannon distance to give weights to data attributes and avoid curse of
dimensionality. The variation of modified Jenson-Shannon distance with Euclidean distance is
shown in the fig 6(g).
In section 3.3, we have defined modified Jaccard distance to quantitatively measure the
similarity between two rules. The Jaccard index, also known as Intersection over Union of the
Jaccard similarity coefficient, is a statistical measure used for comparing the similarity and
diversity of sample sets. The Jaccard coefficient measures similarity between finite sample
sets, and is defined as the size of the intersection divided by the size of the union of the sample
sets. The variation of modified Jaccard index with Jaccard index is shown in fig 6(h).
The variation of precision, recall, TNR, accuracy with the various thresholds, namely 𝛿1, 𝛿2,
фUT , фLT that were defined in section 3 is shown in the following figures:
Fig 6(c) shows the variation of Precision, recall, TNR, accuracy with 𝛿1. It can be observed
from the graph that Precision, TNR and Accuracy increase with the increase in value of 𝛿1,
while the value of Recall decreases with increase in value of 𝛿1.
Fig 6(e) shows the variation of Precision, recall, TNR, accuracy with 𝛿2. It can be observed
from the graph that the value of Precision, TNR and Accuracy starts decreasing when the value
of 𝛿2 increases beyond a certain value. Recall, on the other hand, increases for higher values
of 𝛿2.
Fig 6(d) shows the variation of Precision, recall, TNR, accuracy with фUT. It can be observed
from the graph that the value of Precision first decreases and then exponentially increases with
the increase in value of фUT. An identical trend is followed by Accuracy. Somewhat similar
trend is followed by TNR except that it does not decrease initially. On the contrary, the value
of Recall decreases with the increase in value of фUT.
Fig 6(f) shows the variation of Precision, recall, TNR, accuracy with фLT. It can be observed
from the graph that the values of all the parameters fluctuate a little but remain more or less
constant with the increase in value of фLT.
With regards to the dataset we have used, following inferences can be done from the graphs:
1. Value of 𝛿1 should be close to 0.65 for optimum performance.
2. Value of 𝛿2 should be close to 0.55 for optimum performance.
3. Value of фUT should be close to 0.59 for optimum performance.
4. Value of фLT should be close to 0.2 for optimum performance.
6.4 Comparison with related methods
Table 1 shows the performance measures used for comparison of approaches. Using these
performance measures, we will compare our approaches with other related works. Our various
approaches are:-
Approach 1. Our approach using modified Jenson-Shanon distance and modified Jaccard
index.
Approach 2. Using unmodified Jaccard index with Jenson-Shanon distance.
Approach 3. Using Euclidean distance with unmodified Jaccard index.
The various performance measures used for comparison of approaches are shown in Table
6.1.
S.No. PERFORMANCE
MEASURE
FORMULA
1 TNR TN
TN + FP
2 Precision TP
TP + FP
3 Accuracy TP + TN
TN + FP + TP + FN
4 F1 Score 2 ∗ Precision ∗ Recall
Precision + Recall
5 PPV TP
TP + FP
6 ACC TP + TN
TP + TN + FP + FN
7 NPV TN
TN + FN
8 FDR FP
FP + TP
9 FOR FN
TN + FN
10 BM TPR + TNR – 1
11 FPR FP
FP + TN
12 FNR FN
FN + TP
13 MK PPV + NPV – 1
14 MCC TP × TN − FP × FN
√(TP + FP)(TP + FN)(TN + FP)(TN + FN)
Table 6.1 Performance Measures
In Table 6.2 we have compared the three approaches with each other.
Sensitivity
Measures
Approach
1
Approach
2
Approach
3
PPV 0.96 0.73 0.74
TPR 0.81 0.95 1.00
ACC 0.89 0.80 0.83
F1 Score 0.88 0.83 0.85
NPV 0.83 0.93 1.00
FDR 0.04 0.27 0.26
FOR 0.17 0.07 0.00
BM 0.77 0.60 0.65
FPR 0.03 0.34 0.34
TNR 0.96 0.65 0.65
FNR 0.19 0.05 0.00
MK 0.79 0.66 0.74
MCC 0.78 0.63 0.70
Table 6.2 Comparison of our approaches
From the table, following observations can be made:-
If we compare Approach 1 with Approach 2, we can observe that:-
 TNR and FPR of Approach 1 is a lot better than the TNR and precision for Approach 2.
 Approach 1 has also got better accuracy as compared to Approach 2.
 Approach 1 has a much lower FPR and FDR score as compared to Approach 2.
 Amongst other performance measures, MK and MCC values of approach 1 are also better
than that of Approach 2.
 Approach 2, on the other hand has got better TPR, NPV and FOR measures as compared
to Approach 1.
 Both Approach 1 and Approach 2 have got somewhat similar F1 score.
In the measures like FPR and TNR where Approach 1 has good performance, Approach 2
performs rather poorly. However, in measures like TPR and NPV, where Approach 2 performs
better, Approach 1 also has good performance. For example, both Approach 1 and Approach 2
have similar NPV scores with Approach 2 performing slightly better As Approach 1 performs
far better than Approach 2 in most of the measures, we can conclude that the overall
performance of Approach 1 is better than Approach 2.
If we compare Approach 1 with Approach 3, we observe that:-
 TNR and precision of Approach 1 is a lot better than the TNR and precision for Approach3
 Approach 1 has also got better accuracy as compared to Approach 3.
 Approach 1 also has a much lower FPR and FDR score as compared to Approach 3.
 Amongst other performance measures, MK and MCC values of Approach 1 are also
slightly better than that of Approach 3.
 Approach 3, on the other hand has got better TPR, NPV and FOR measures as compared
to Approach 1. In fact, it has the best values for these parameters in the entire table.
 Also, both Approach 1 and Approach 3 have got somewhat similar F1 score.
In the measures like TNR and precision, where Approach 1 has one of the best scores in the
entire table, Approach 3 performs rather poorly. Also, Approach 3 lags far behind in measures
like FPR and FDR score. On the other hand, in the measures in which Approach 3 performs
better than Approach 1, Approach 1 is also performing quiet nicely. For example, in case of
NPV, both the approaches have good scores, with Approach 3 performing better. similar trends
are observed in case of all other measures except FNR, where Approach 3 has is far superior.
Considering all the above scenario, we can say that the overall even though Approach 3 has
the best values for some performance measures, its poor performance in other measures are
clearly a disadvantage due to which Approach 1 is better than Approach 3.
Table 3 shows a comparison of our approaches with various other related works.
Table 6.3 Comparison of our approaches with related works
If we compare our approach with other related approaches, we observe that:-
 In comparison to HU Panda, our approach works better with respect to all the performance
measures considered for the purpose of comparison.
 In comparison to the work of Mostafa et al. our approach performs better with respect to
all the performance measures that are considered for comparison.
 In comparison to the work of Hashemi et al. even though our approach scores just a little
less in measures like TNR and precision, it scores a lot better with respect to rest of the
performance measures
 If we consider the work of Mina Sohrabi et al. our approach performs better with respect
to all the performance measures that are present in the table.
 In comparison to the work of Majumdar et al. our approach performs better with respect to
all the performance measures that we have considered for the purpose of comparison.
 With comparison to the work of UP Rao et al. our approach performs better in context to
all the measures that are considered in the table for comparison.
 In comparison to the work of Elisa Bertino, our approach gives better TNR and precision
scores. It also gives comparatively better FDR and FPR scores. In other measures, except
TPR and recall, both approaches have somewhat similar score. Since our work is mostly
related to finding Critical Data Items in a dataset, higher TNR and precision scores are
more desirable as compared to other performance measures. Since our approach performs
quiet well with respect to other performance measures as well, better TNR and precision
scores can easily cover up lower recall values.
Sensitivity
Measures
Approach
1
Approach
2
Approach
3
HU
Panda
Hashemi
et al.
Mostafa
et al.
Mina
Sohrabi
et al.
Majumdar
et al.
(2006)
EliSa
Bertino
et al.
UP Rao
et
al.(2016)
PPV 0.96 0.73 0.74 0.88 0.97 0.94 0.93 0.88 0.94 0.61
TPR 0.81 0.95 1.00 0.73 0.71 0.75 0.66 0.70 0.91 0.70
ACC 0.89 0.80 0.83 0.81 0.84 0.85 0.80 0.80 0.93 0.64
F1 Score 0.88 0.83 0.85 0.79 0.82 0.83 0.77 0.78 0.92 0.65
NPV 0.83 0.93 1.00 0.77 0.77 0.79 0.73 0.75 0.91 0.68
FDR 0.04 0.27 0.26 0.12 0.03 0.06 0.07 0.13 0.06 0.39
FOR 0.17 0.07 0.00 0.23 0.23 0.21 0.27 0.25 0.09 0.32
BM 0.77 0.60 0.65 0.63 0.69 0.70 0.60 0.60 0.85 0.35
FPR 0.03 0.34 0.34 0.10 0.02 0.05 0.05 0.10 0.06 0.45
TNR 0.96 0.65 0.65 0.90 0.98 0.95 0.94 0.90 0.94 0.65
FNR 0.19 0.05 0.00 0.28 0.29 0.25 0.35 0.30 0.09 0.30
MK 0.79 0.66 0.74 0.65 0.74 0.73 0.66 0.63 0.85 0.29
MCC 0.78 0.63 0.70 0.63 0.72 0.71 0.63 0.61 0.85 0.29
 In comparison to the work of Elisa Bertino, our approach gives better TNR and precision
scores. It also gives comparatively better FDR and FPR scores. In other measures, except
TPR and recall, both approaches have somewhat similar score. Since our work is mostly
related to finding Critical Data Items in a dataset, higher TNR and precision scores are
more desirable as compared to other performance measures. Since our approach performs
quiet well with respect to other performance measures as well, better TNR and precision
scores can easily cover up lower recall value
7. CONCLUSION AND FUTURE WORKS
In this paper we have tried to detect malicious transactions with a perspective that certain data
elements hold more critical information than others. Inference attacks against such data
elements are blocked by taking into account user access pattern and also the historic behaviour.
A user who regularly behaves as a normal user is gradually allowed to improve his dubiety
score. The approach is analysed with respect to different performance parameters by
conducting experiments. Finally, it may be concluded that the approach works efficiently in
determining the nature of a transaction.
We plan to extend our approach from 2-level inference control to n-level inference control,
whereby nth order statements will be encoded to attribute-hierarchy and then the n-level
attribute tree/graph will be manipulated to form fuzzy clusters and the incoming transactions
will be checked against nth access level. Automatic manipulation of semantics to classify
attributes as critical data elements may also be considered as future research topic.
8. REFERENCES
1. I-Yuan Lin; Xin-Mao Huang; Ming-Syan Chen “Capturing user access patterns in the Web
for data mining” Proceedings 11th International Conference on Tools with Artificial
Intelligence, IEEE pp 9-11 Nov. 1999
2. R.S. Sandhu; P. Samarati “Access control: principle and practice” Published in: IEEE
Communications Magazine (Volume: 32, Issue: 9, Sept. 1994)
3. Denning, D.E. (1987) An Intrusion Detection Model. IEEE Transactions on Software
Engineering, Vol. SE-13, 222-232.
4. Knuth, Donald E., James H. Morris, Jr, and Vaughan R. Pratt. "Fast pattern matching in
strings." SIAM journal on computing 6.2 (1977): 323-350.
5. Wang, Ke. “Anomalous Payload-Based Network Intrusion Detection”. Recent Advances
in Intrusion Detection. Springer Berlin. doi:10.1007/978-3-540-30143-1_11
6. Douligeris, Christos; Serpanos, Dimitrios N. (2007-02-09). Network Security: Current
Status and Future Directions. John Wiley & Sons. ISBN 9780470099735.
7. Christina Yip Chung, Michael Gertz and Karl Levitt (2000), “DEMIDS: a misuse detection
system for database systems”, Integrity and internal control information systems: strategic
views on the need for control, Kluwer Academic Publishers, Norwell, MA.
8. A. S. McGough, D. Wall, J. Brennan, G. Theodoropoulos, E. Ruck-Keene, B. Arief, et al.,
"Insider Threats: Identifying Anomalous Human Behaviour in Heterogeneous Systems
Using Beneficial Intelligent Software (Ben-ware)", presented at the Proceedings of the 7th
ACM CCS International Workshop on Managing Insider Security Threats, Denver,
Colorado, USA, 2015.
9. S. D. Bhattacharjee, J. Yuan, Z. Jiaqi, and Y.-P. Tan, "Context-aware graph-based analysis
for detecting anomalous activities", presented at the Multimedia and Expo (ICME), 2017
IEEE International Conference on, 2017.
10. P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese, "Automated insider threat detection
system using user and role-based profile assessment", IEEE Systems Journal, vol. 11, pp.
503-512, 2015.
11. I. Agrafiotis, A. Erola, J. Happa, M. Goldsmith, and S. Creese, "Validating an Insider
Threat Detection System: A Real Scenario Perspective", presented at the 2016 IEEE
Security and Privacy Workshops (SPW), 2016.
12. T. Rashid, I. Agrafiotis, and J. R. C. Nurse, "A New Take on Detecting Insider Threats:
Exploring the Use of Hidden Markov Models", presented at the Proceedings of the 8th
ACM CCSInternational Workshop on Managing Insider Security Threats, Vienna, Austria,
2016.
13. Zamanian Z., Feizollah A., Anuar N.B., Kiah L.B.M., Srikanth K., Kumar S. (2019) User
Profiling in Anomaly Detection of Authorization Logs. In: Alfred R., Lim Y., Ibrahim A.,
Anthony P. (eds) Computational Science and Technology. Lecture Notes in Electrical
Engineering, vol 481. Springer, Singapore
14. Yuqing Sun, Haoran Xu, Elisa Bertino, and Chao Sun. 2016. A Data-Driven Evaluation for
Insider Threats. Data Science and Engineering Vol. 1, 2 (2016), 73--85.
doi:10.1007/s41019-016-0009-x
15. S. Panigrahi, S. Sural and A. K. Majumdar, "Detection of intrusive activity in databases by
combining multiple evidences and belief update," 2009 IEEE Symposium on
Computational Intelligence in Cyber Security, Nashville, TN, 2009, pp. 83-90. doi:
10.1109/CICYBS.2009.4925094
16. Yi Hu, Bajendra Panda, A data mining approach for database intrusion detection, SAC '04
Proceedings of the 2004 ACM symposium on Applied computing Pages 711-716,
doi:10.1145/967900.968048
17. Abhinav Srivastava , Shamik Sural , A. K. Majumdar, Weighted intra-transactional rule
mining for database intrusion detection, Proceedings of the 10th Pacific-Asia conference
on Advances in Knowledge Discovery and Data Mining, April 09-12, 2006, Singapore
doi:10.1007/11731139_71
18. TPC-C benchmark: http://www.tpc.org/tpcc/default.asp
19. Mina Sohrabi, M. M. Javidi, S. Hashemi, “Detecting intrusion transactions in database
systems: a novel approach”, Journal of Intelligent Info Systems 42:619-644 doi: 10.1007
Springer 2014.
20. UP Rao et. al ,“Weighted Role Based Data Dependency Approach for Intrusion Detection
in Database”, International Journal of Network Security, Vol.19, No.3, PP.358-370, May
2017 (doi: 10.6633/IJNS.201703.19(3).05).
21. R. Agrawal, T. lmieliiski, and A. Swami, “Mining Association Rules between Sets of Items
in Large Databases”, in Proceedings of the 1993 ACM SIGMOD International Conference
on Management of data, 1993.
22. Sattar Hashemi, Ying Yang,Davoud Zabihzadeh and Mohammadreza Kangavari,
“Detecting intrusion transactions in databases using data item dependencies and anomaly
analysis”, Article in Expert Systems 25(5):460-473. November 2008 doi: 10.1111/j.1468-
0394.2008.00467.
23. Mostafa Doroudian, Hamid Reza Shahriari, “A Hybrid Approach for Database Intrusion
Detection at Transaction and Inter-transaction Levels”, 6th Conference on Information and
Knowledge Technology (IKT 2014), May 28-30, 2014, Shahrood University of
Technology, Tehran, Iran.
24. E. Bertino, A. Kamra, E. Terzi and A. Vakali (2005), "Intrusion detection in RBAC
administered databases", in Proceedings of the Applied Computer Security Applications
Conference (ACSAC).
25. Lee, V. C.S., Stankovic, J. A., Son, S. H. Intrusion Detection in Real-time Database
Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time Technology
and Applications Symposium, 2000.
26. Weina Wang, Yunjie Zhang, Yi Li and Xiaona Zhang (2006), "The Global Fuzzy C-Means
Clustering Algorithm", 2006 6th World Congress on Intelligent Control and Automation,
Dalian, 2006, pp. 3604- 3607.
27. Fuglede, Bent; Topsøe, Flemming (2004). "Jensen-Shannon divergence and Hilbert space
embedding - IEEE Conference Publication".
28. Dunn, J. C. (1973-01-01). "A Fuzzy Relative of the ISODATA Process and Its Use in
Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57.
doi:10.1080/01969727308546046. ISSN 0022-0280.
29. A. Mangalampalli and V. Pudi (2009), "Fuzzy association rule mining algorithm for fast
and efficient performance on very large datasets", 2009 IEEE International Conference on
Fuzzy Systems, Jeju Island, 2009, pp. 1163-1168
30. Vorontsov, I.E., Kulakovskiy, I.V. & Makeev, V.J. Algorithms Mol Biol (2013) 8: 23.
“Jaccard index based similarity measure to compare transcription factor binding site
models” doi: 10.1186/1748-7188-8-23
TABLE OF CONTENTS
 DECLARATION
 CERTIFICATE
 ACKNOWLEDGEMENT
 LISTOF ABBREVIATIONS USED
 LISTOF FIGURES
 LISTOF TABLES
 PROBLEMSTATEMENT
 MOTIVATION
 OBJECTIVE
 ABSTRACT
 INTRODUCTION
 LITERATUREOVERVIEW
 OUR PROPOSED APPROACH
 RESULT AND EXPLANATION
 CONCLUSIONAND FUTUREWORK
 REFERENCES
LIST OF ABBREVIATIONS USED
1. CDE: Critical Data Elements
2. DAE: Directly Accessed Elements
3. IDS: Intrusion Detection System
4. SPM: Sequential Pattern Mining
5. FPM: Frequent Pattern Mining
6. ARM: Association Rule Mining
7. DT: Dubiety Table
8. FPR: False Positive Rate
9. TPR: True Positive Rate
10. FNR: False Negative Rate
11. TNR: True Negative Rate
12. UV: User Vector
13. Uid: User ID
14. Cid: Cluster ID
15. JCD: Jaccard Distance
16. JSD: Jenson Shannon Distance
17. KL: Kulback Liebler
LIST OF FIGURES
1. Fig 3 (a) Architecture of Learning Phase.
2. Fig 3 (b) Architecture of Testing Phase.
3. Fig 6 (a) Distribution of data in the dataset.
4. Fig 6 (b) Variation of performance measures with number of clusters.
5. Fig 6 (c) Variation of Precision, recall, TNR, accuracy with 𝛿1.
6. Fig 6 (d) Variation of Precision, recall, TNR, accuracy with фLT.
7. Fig 6 (e) Variation of Precision, recall, TNR, accuracy with 𝛿2.
8. Fig 6 (f) Variation of Precision, recall, TNR, accuracy with фUT.
9. Fig 6 (g) Variation of modified Jenson-Shannon distance with Euclidean distance.
10. Fig 6 (h) Variation of modified Jenson-Shannon distance with Euclidean distance.
LIST OF TABLES
1. Table 3.1 Types of attributes and their sensitivity levels
2. Table 3.2 Initial Dubiety Table
3. Table 3.3 Updated Dubiety Table
4. Table 3.4 Rule Generator for given Example
5. Table 3.5 User profile for the given Example
6. Table 3.6 Initial dubiety table
7. Table 3.7 Minimum ds values for various Users
8. Table 3.8 calculated dubiety scores table
9. Table 3.9 Summary of transactions of various users
10. Table 6.1 Performance Measures
11. Table 6.2 Comparison of our approaches
12. Table 6.3 Comparison of our approaches with related works

More Related Content

What's hot

Behavioural biometrics and cognitive security authentication comparison study
Behavioural biometrics and cognitive security authentication comparison studyBehavioural biometrics and cognitive security authentication comparison study
Behavioural biometrics and cognitive security authentication comparison studyacijjournal
 
INTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARD
INTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARDINTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARD
INTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARDIJCI JOURNAL
 
AN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORD
AN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORDAN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORD
AN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORDIJNSA Journal
 
Graphical Password Authentication using Images Sequence
Graphical Password Authentication using Images SequenceGraphical Password Authentication using Images Sequence
Graphical Password Authentication using Images SequenceIRJET Journal
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Anomaly Threat Detection System using User and Role-Based Profile Assessment
Anomaly Threat Detection System using User and Role-Based Profile AssessmentAnomaly Threat Detection System using User and Role-Based Profile Assessment
Anomaly Threat Detection System using User and Role-Based Profile Assessmentijtsrd
 
IRJET-Biostatistics in Indian Banks: An Enhanced Security Approach
IRJET-Biostatistics in Indian Banks: An Enhanced Security ApproachIRJET-Biostatistics in Indian Banks: An Enhanced Security Approach
IRJET-Biostatistics in Indian Banks: An Enhanced Security ApproachIRJET Journal
 
55994241 cissp-cram
55994241 cissp-cram55994241 cissp-cram
55994241 cissp-crambsnl007
 
ipas implicit password authentication system ieee 2011
ipas implicit password authentication system ieee 2011ipas implicit password authentication system ieee 2011
ipas implicit password authentication system ieee 2011prasanna9
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD Editor
 
ENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICS
ENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICSENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICS
ENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICSIJNSA Journal
 
Context based access control systems for mobile devices
Context based access control systems for mobile devicesContext based access control systems for mobile devices
Context based access control systems for mobile devicesLeMeniz Infotech
 
Dynamic Access Control for RBAC-administered web-based Databases
Dynamic Access Control for RBAC-administered web-based DatabasesDynamic Access Control for RBAC-administered web-based Databases
Dynamic Access Control for RBAC-administered web-based DatabasesThitichai Sripan
 
Prevention of SQL injection in E- Commerce
Prevention of SQL injection in E- CommercePrevention of SQL injection in E- Commerce
Prevention of SQL injection in E- Commerceijceronline
 
Ieee project-2014-2015-context-based-access-control-systems
Ieee project-2014-2015-context-based-access-control-systemsIeee project-2014-2015-context-based-access-control-systems
Ieee project-2014-2015-context-based-access-control-systemsSteph Cliche
 
IRJET-An Economical and Secured Approach for Continuous and Transparent User ...
IRJET-An Economical and Secured Approach for Continuous and Transparent User ...IRJET-An Economical and Secured Approach for Continuous and Transparent User ...
IRJET-An Economical and Secured Approach for Continuous and Transparent User ...IRJET Journal
 

What's hot (20)

Behavioural biometrics and cognitive security authentication comparison study
Behavioural biometrics and cognitive security authentication comparison studyBehavioural biometrics and cognitive security authentication comparison study
Behavioural biometrics and cognitive security authentication comparison study
 
INTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARD
INTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARDINTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARD
INTRUSION DETECTION IN MULTITIER WEB APPLICATIONS USING DOUBLEGUARD
 
AN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORD
AN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORDAN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORD
AN EFFICIENT IDENTITY BASED AUTHENTICATION PROTOCOL BY USING PASSWORD
 
Graphical Password Authentication using Images Sequence
Graphical Password Authentication using Images SequenceGraphical Password Authentication using Images Sequence
Graphical Password Authentication using Images Sequence
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Anomaly Threat Detection System using User and Role-Based Profile Assessment
Anomaly Threat Detection System using User and Role-Based Profile AssessmentAnomaly Threat Detection System using User and Role-Based Profile Assessment
Anomaly Threat Detection System using User and Role-Based Profile Assessment
 
IRJET-Biostatistics in Indian Banks: An Enhanced Security Approach
IRJET-Biostatistics in Indian Banks: An Enhanced Security ApproachIRJET-Biostatistics in Indian Banks: An Enhanced Security Approach
IRJET-Biostatistics in Indian Banks: An Enhanced Security Approach
 
55994241 cissp-cram
55994241 cissp-cram55994241 cissp-cram
55994241 cissp-cram
 
ipas implicit password authentication system ieee 2011
ipas implicit password authentication system ieee 2011ipas implicit password authentication system ieee 2011
ipas implicit password authentication system ieee 2011
 
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
IJERD (www.ijerd.com) International Journal of Engineering Research and Devel...
 
SQL injection
SQL injectionSQL injection
SQL injection
 
ENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICS
ENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICSENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICS
ENHANCED AUTHENTICATION FOR WEB-BASED SECURITY USING KEYSTROKE DYNAMICS
 
Kx3518741881
Kx3518741881Kx3518741881
Kx3518741881
 
Context based access control systems for mobile devices
Context based access control systems for mobile devicesContext based access control systems for mobile devices
Context based access control systems for mobile devices
 
Dynamic Access Control for RBAC-administered web-based Databases
Dynamic Access Control for RBAC-administered web-based DatabasesDynamic Access Control for RBAC-administered web-based Databases
Dynamic Access Control for RBAC-administered web-based Databases
 
Assessment and Mitigation of Risks Involved in Electronics Payment Systems
Assessment and Mitigation of Risks Involved in Electronics Payment Systems Assessment and Mitigation of Risks Involved in Electronics Payment Systems
Assessment and Mitigation of Risks Involved in Electronics Payment Systems
 
Prevention of SQL injection in E- Commerce
Prevention of SQL injection in E- CommercePrevention of SQL injection in E- Commerce
Prevention of SQL injection in E- Commerce
 
Ieee project-2014-2015-context-based-access-control-systems
Ieee project-2014-2015-context-based-access-control-systemsIeee project-2014-2015-context-based-access-control-systems
Ieee project-2014-2015-context-based-access-control-systems
 
IRJET-An Economical and Secured Approach for Continuous and Transparent User ...
IRJET-An Economical and Secured Approach for Continuous and Transparent User ...IRJET-An Economical and Secured Approach for Continuous and Transparent User ...
IRJET-An Economical and Secured Approach for Continuous and Transparent User ...
 
Idps
IdpsIdps
Idps
 

Similar to Query Pattern Access and Fuzzy Clustering Based Intrusion Detection System

A Comprehensive Review On Intrusion Detection System And Techniques
A Comprehensive Review On Intrusion Detection System And TechniquesA Comprehensive Review On Intrusion Detection System And Techniques
A Comprehensive Review On Intrusion Detection System And TechniquesKelly Taylor
 
Intrusion Detection System using Data Mining
Intrusion Detection System using Data MiningIntrusion Detection System using Data Mining
Intrusion Detection System using Data MiningIRJET Journal
 
IRJET - A Secure Approach for Intruder Detection using Backtracking
IRJET -  	  A Secure Approach for Intruder Detection using BacktrackingIRJET -  	  A Secure Approach for Intruder Detection using Backtracking
IRJET - A Secure Approach for Intruder Detection using BacktrackingIRJET Journal
 
Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...
Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...
Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...IJRTEMJOURNAL
 
information security (Audit mechanism, intrusion detection, password manageme...
information security (Audit mechanism, intrusion detection, password manageme...information security (Audit mechanism, intrusion detection, password manageme...
information security (Audit mechanism, intrusion detection, password manageme...Zara Nawaz
 
Phi 235 social media security users guide presentation
Phi 235 social media security users guide presentationPhi 235 social media security users guide presentation
Phi 235 social media security users guide presentationAlan Holyoke
 
Self Monitoring System to Catch Unauthorized Activity
Self Monitoring System to Catch Unauthorized ActivitySelf Monitoring System to Catch Unauthorized Activity
Self Monitoring System to Catch Unauthorized ActivityIRJET Journal
 
What is penetration testing and why is it important for a business to invest ...
What is penetration testing and why is it important for a business to invest ...What is penetration testing and why is it important for a business to invest ...
What is penetration testing and why is it important for a business to invest ...Alisha Henderson
 
Remote Access Policy Is A Normal Thing
Remote Access Policy Is A Normal ThingRemote Access Policy Is A Normal Thing
Remote Access Policy Is A Normal ThingKaren Oliver
 
Identity and Access Intelligence
Identity and Access IntelligenceIdentity and Access Intelligence
Identity and Access IntelligenceTim Bell
 
Why IAM is the Need of the Hour
Why IAM is the Need of the HourWhy IAM is the Need of the Hour
Why IAM is the Need of the HourTechdemocracy
 
Comprehensive Analysis of Contemporary Information Security Challenges
Comprehensive Analysis of Contemporary Information Security ChallengesComprehensive Analysis of Contemporary Information Security Challenges
Comprehensive Analysis of Contemporary Information Security Challengessidraasif9090
 
Augment Method for Intrusion Detection around KDD Cup 99 Dataset
Augment Method for Intrusion Detection around KDD Cup 99 DatasetAugment Method for Intrusion Detection around KDD Cup 99 Dataset
Augment Method for Intrusion Detection around KDD Cup 99 DatasetIRJET Journal
 
Aujas incident management webinar deck 08162016
Aujas incident management webinar deck 08162016Aujas incident management webinar deck 08162016
Aujas incident management webinar deck 08162016Karl Kispert
 
A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...
A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...
A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...IJCSIS Research Publications
 
Ids 013 detection approaches
Ids 013 detection approachesIds 013 detection approaches
Ids 013 detection approachesjyoti_lakhani
 
A Database System Security Framework
A Database System Security FrameworkA Database System Security Framework
A Database System Security FrameworkMaria Perkins
 

Similar to Query Pattern Access and Fuzzy Clustering Based Intrusion Detection System (20)

A Comprehensive Review On Intrusion Detection System And Techniques
A Comprehensive Review On Intrusion Detection System And TechniquesA Comprehensive Review On Intrusion Detection System And Techniques
A Comprehensive Review On Intrusion Detection System And Techniques
 
Intrusion Detection System using Data Mining
Intrusion Detection System using Data MiningIntrusion Detection System using Data Mining
Intrusion Detection System using Data Mining
 
46 102-112
46 102-11246 102-112
46 102-112
 
IRJET - A Secure Approach for Intruder Detection using Backtracking
IRJET -  	  A Secure Approach for Intruder Detection using BacktrackingIRJET -  	  A Secure Approach for Intruder Detection using Backtracking
IRJET - A Secure Approach for Intruder Detection using Backtracking
 
Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...
Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...
Requirement Based Intrusion Detection in Addition to Prevention Via Advanced ...
 
information security (Audit mechanism, intrusion detection, password manageme...
information security (Audit mechanism, intrusion detection, password manageme...information security (Audit mechanism, intrusion detection, password manageme...
information security (Audit mechanism, intrusion detection, password manageme...
 
Phi 235 social media security users guide presentation
Phi 235 social media security users guide presentationPhi 235 social media security users guide presentation
Phi 235 social media security users guide presentation
 
Self Monitoring System to Catch Unauthorized Activity
Self Monitoring System to Catch Unauthorized ActivitySelf Monitoring System to Catch Unauthorized Activity
Self Monitoring System to Catch Unauthorized Activity
 
What is penetration testing and why is it important for a business to invest ...
What is penetration testing and why is it important for a business to invest ...What is penetration testing and why is it important for a business to invest ...
What is penetration testing and why is it important for a business to invest ...
 
Remote Access Policy Is A Normal Thing
Remote Access Policy Is A Normal ThingRemote Access Policy Is A Normal Thing
Remote Access Policy Is A Normal Thing
 
Identity and Access Intelligence
Identity and Access IntelligenceIdentity and Access Intelligence
Identity and Access Intelligence
 
Why IAM is the Need of the Hour
Why IAM is the Need of the HourWhy IAM is the Need of the Hour
Why IAM is the Need of the Hour
 
Comprehensive Analysis of Contemporary Information Security Challenges
Comprehensive Analysis of Contemporary Information Security ChallengesComprehensive Analysis of Contemporary Information Security Challenges
Comprehensive Analysis of Contemporary Information Security Challenges
 
Augment Method for Intrusion Detection around KDD Cup 99 Dataset
Augment Method for Intrusion Detection around KDD Cup 99 DatasetAugment Method for Intrusion Detection around KDD Cup 99 Dataset
Augment Method for Intrusion Detection around KDD Cup 99 Dataset
 
306 310
306 310306 310
306 310
 
306 310
306 310306 310
306 310
 
Aujas incident management webinar deck 08162016
Aujas incident management webinar deck 08162016Aujas incident management webinar deck 08162016
Aujas incident management webinar deck 08162016
 
A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...
A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...
A Hybrid Intrusion Detection System for Network Security: A New Proposed Min ...
 
Ids 013 detection approaches
Ids 013 detection approachesIds 013 detection approaches
Ids 013 detection approaches
 
A Database System Security Framework
A Database System Security FrameworkA Database System Security Framework
A Database System Security Framework
 

Recently uploaded

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxnada99848
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...gurkirankumar98700
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptkotipi9215
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 

Recently uploaded (20)

办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...Advancing Engineering with AI through the Next Generation of Strategic Projec...
Advancing Engineering with AI through the Next Generation of Strategic Projec...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...Call Girls In Mukherjee Nagar 📱  9999965857  🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
Call Girls In Mukherjee Nagar 📱 9999965857 🤩 Delhi 🫦 HOT AND SEXY VVIP 🍎 SE...
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
software engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptxsoftware engineering Chapter 5 System modeling.pptx
software engineering Chapter 5 System modeling.pptx
 
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
(Genuine) Escort Service Lucknow | Starting ₹,5K To @25k with A/C 🧑🏽‍❤️‍🧑🏻 89...
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
chapter--4-software-project-planning.ppt
chapter--4-software-project-planning.pptchapter--4-software-project-planning.ppt
chapter--4-software-project-planning.ppt
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 

Query Pattern Access and Fuzzy Clustering Based Intrusion Detection System

  • 1. CERTIFICATE This is to certify that this project titled “Query Pattern Access and Fuzzy Clustering Based Intrusion Detection System” submitted by Shivam Gupta (2K16/CO/295), Shivam Maini (2K16/CO/299), Shubham (2K16/CO/309) and Simran Seth (2K16/CO/317) in partial fulfilment for the requirements for the award of Bachelor of Technology degree in Computer Engineering (COE) at Delhi Technological University is an authentic work carried out by the students under my supervision and guidance. To the best of my knowledge, the matter embodied in the thesis has not been submitted to any other university or institute for the award of any degree or diploma. Ms. Indu Singh (Assistant Professor) Department of CSE Delhi Technological University
  • 2. DECLARATION We hereby certify that the work which is presented in the project entitles “Query Pattern Access and Fuzzy Clustering Based Intrusion Detection System” in fulfilment for the requirement for the award of the degree of Bachelor of Technology and submitted to the Department of Computer Engineering, Delhi Technological University is an authentic record of our own carried out during a period from April 2018 to February 2019, under the supervision of Ms. Indu Singh (Assistant Professor, CSE Department). The matter presented in this report has not been submitted by me for the award of any other degree of this or any other Institute/University. Shivam Gupta (2K16/CO/295) Shivam Maini (2K16/CO/299) Shubham (2K16/CO/309) Simran Seth (2K16/CO/317)
  • 3. ACKNOWLEDGEMENTS “The successful completion of any task would be incomplete without accomplishing the people who made it all possible and whose constant guidance and encouragement secured us the success.” We owe a debt of gratitude to our guide Ms. Indu Singh (Assistant Professor CSE Department) for incorporating in us the idea of a creative project, helping us in undertaking this project and for being there whenever we need her assistance. I also place on record, my sense of gratitude to one and all, who directly or indirectly have lent their helping hand in this venture. I feel proud and privileged in expressing my deep sense of gratitude to all those who have helped me in presenting this project. Last but never the least, we thank our parents for always being with us, in every sense.
  • 4. PROBLEM STATEMENT The aim of the project is to build an intrusion detection system that provides the following functionalities:  The designed system must be able to detect any anomalous behaviour by any user and raise an alarm and take necessary response against such behaviour.  Our system must be robust to user behaviour.  Detect and prevent insider frauds in a credit card company.  Provide higher level of access control to critical data items (like CVV)  The designed system should be free from any vulnerabilities of an outsider attack such as session hijacking, session fixation, data theft etc.  The system must block all transactions that don’t fall under the user’s jurisdiction by maintaining user behaviour logs.
  • 5. MOTIVATION An Anomaly-Based Intrusion Detection System is a system for detecting computerised intrusions and misuse by monitoring system activity and classifying it as either normal or anomalous. The classification is based on heuristics or rules, rather than patterns or signatures, and will detect any type of misuse that differs significantly from normal system operation. Earlier, IDSs relied on some hand coded rules designed by security experts and network administrators. However, given the requirements and the complexities of today’s network environments, we need a systematic and automated IDS development process rather than the pure knowledge based and engineering approaches which rely only on intuition and experience. This encouraged us to study some Data Mining based frameworks for Intrusion Detection. These frameworks use data mining algorithms to compute activity patterns from system audit data and extract predictive features from the patterns. Machine learning algorithms are then applied to the audit records that are processed according to the feature definitions to generate intrusion detection rules. The Data Mining based approaches that we have studied can be divided into two main categories :- 1. Supervised Learning a. Association Rule Mining 2. Unsupervised Learning a. Clustering
  • 6. OBJECTIVE The main purpose of our paper is to monitor user access. Our Intrusion Detection System (IDS) pays special attention to certain semantically critical data elements along with those elements which can be used to infer them. We present an innovative approach to combine a user’s historic and present access pattern, and hence classify the incoming transaction as malicious or non-malicious. Using Fuzzy C-Means, we partition the users into fuzzy clusters. Each of these clusters contains a set of rules in their cluster profiles. New transactions are checked in the detection phase using these clusters. The main advantage of our IDS lies in its ability to prevent inference attacks on Critical Data Elements and take into account the user’s historic behaviour.
  • 7. ABSTRACT Hackers and malicious insiders perpetually try to steal, manipulate and corrupt sensitive data elements and an organization’s database servers are often the primary targets of these attacks. In the broadest sense, misuse (witting or unwitting) by authorized database users, database administrators, or network/systems managers are potential insider threats that our project intends to address. Insider threats are more menacing because in contrast to outsiders (hackers or unauthorised users), insiders have authorised access to the database and have knowledge about the critical nuances of the database. Database security involves using multitude of information security controls to protect databases against breach of confidentiality, integrity and availability (CIA). QPAFCS (Query Pattern Access and Fuzzy Clustering System) involves plethora of controls such as technical, procedural/administrative and physical. We hence intend to propose an Intrusion Detection System (IDS) that monitors a database management system and prevents inference attacks on sensitive attributes, by means of auditing user access patterns. Keywords: Intrusion Detection, Fuzzy Clustering, User Access Pattern, Insider Attacks, Dubiety Score
  • 8. 1. INTRODUCTION Data protection from insider threats is essential to most organizations. Attacks from insiders could be more damaging than those from outsiders, since in most cases insiders have full or partial access to the data; therefore, traditional mechanisms for data protection, such as authentication and access control, cannot be solely used to protect against insiders. Since recent work has shown that insider attacks are accompanied by changes in the access patterns of users, user access pattern mining [1] is a suitable approach for the detection of these attacks. It creates profiles of the normal access patterns of users using past logs of users accesses. New accesses are later checked against these profiles and mismatches indicate potential attacks. A security technique called Access control [2] can regulate who can view or use resources in a computing environment. There are diverse access control systems that perform authorization, identification, authentication and access approval. Intrusion Detection Systems [3]. scrutinise and unearth surreptitious activities perpetrated by malevolent users. IDS work by either looking for signatures of known attacks or deviations of normal activity. Normally, IDS undergo a training phase with intrusion free data wherein they maintain a log of benign transactions. Pattern matching [4] is then used to detect whether or not an action is malign. This is called anomaly-based detection [5]. When errors are detected using their known “signatures” from previous knowledge of the attack, it is called signature-based detection [6]. These malicious actions once detected are then either blocked or probed depending upon the organisation’s policy. However, IDS need to be dynamic, robust and quick. Different architectures for IDS function differently and have different measures of performance. Every organisation needs to make sure that the IDS it uses satisfies its requisites. Several AD techniques have been proposed to detect anomalous data accesses. Some rely on the analysis of input queries based on the syntax. Although these approaches are computationally efficient, they are unable to detect anomalies in scenarios like the following one. Consider a clerk in an organization who issues queries to a relational database that typically selects a few rows from specific tables. An access from this clerk that selects all or most of the rows of these tables should be considered anomalous with respect to the daily access pattern of the clerk. However, approaches based on syntax only are not able to classify such access as anomalous. Thus, syntactic approaches have to be extended to take into account the semantic features of queries such as the number of result rows. An important requirement is that queries should be inspected before their execution in order to prevent malicious queries from making changes to the database. From the technical perspective, the main purpose is to ensure the effective enforcement of security regulations. Audit is an important technique of examining whether user behaviours in a system are in conformance with security policies. Many methods audit a database processing by comparing a user SQL query expression against some predefined patterns so as to find out an anomaly. But a malicious query may be made up as good looking so as to evade such syntactic detection. To overcome this shortcoming, the data-centric method further audits whether the data a user query actually accessed has involved any banned information. However, such audit concerns a concrete policy rather than the overall view of multiple security policies. It requires clear audit commands that are articulated by experienced professionals and much interactive analysis. Since in practice an anomaly pattern cannot be articulated in advance, it is difficult to detect such fraud by the current audit method.
  • 9. The anomaly detection technology is used to identify abnormal behaviours that are statistical outliers. Some probabilistic methods learned normal patterns, against which they detected an anomaly. But these methods assume very few users are deviated from normal patterns. In case there are a number of anomalous users, the normal pattern would be diverged. These works do not examine user behaviour from either a historical or an incremental view, which may overlook some malicious behaviour. Furthermore, if a group of people collude together, it is difficult to find them by the current methods. We tackle the insider threat problem using different approaches. We take into consideration the fact that certain data elements are more critical to the database as compared with other data elements. Thus, we pay special attention to the security of such critical data elements. We also recognise the presence of data attributes in a system which can be manipulated to indirectly influence the crucial data attributes. We address the threat to our critical data elements using such attributes also. We investigate a suspected user also from the diachronic view by analysing his/her historical behaviour. We store a measure denoting how suspicious a user has been. The greater this measure, the greater the chances of the query being malicious. This measure also solves the problem of gradually malicious threat since the historical statics measures the accumulative results. The main purpose of our project (QPAFCS) is to recognise user access pattern. Our Intrusion Detection System (IDS) pays special attention to certain semantically critical data elements, along with those elements which can be used to infer them. We present an innovative approach to combine a user’s historic and present access pattern and hence classify the incoming transaction as malicious/non-malicious. Using FCM, we partition the users into fuzzy clusters. Each of these clusters contains a set of rules in their cluster profiles. In the detection phase, new transactions are checked against rules in these clusters, and then a suitable action is taken depending upon the nature of transaction. The main advantage of our IDS lies in its ability to prevent inference attacks on Critical Data Elements. The remainder of this work is organized as follows. In Sect. 2, we present prior research related to this work. Section 3 introduces the fuzzy clustering and belief update framework. Section 4 discusses the approach using examples. In Sect. 5, we discuss how to apply our method into practical system. Experimental evaluation is discussed in Section. 6.
  • 10. 2. RELATED WORK Numerous researchers are currently working in the field of Network Intrusion Detection Systems, but only a few have proposed research work in Database IDSs. Several systems for Intrusion Detection in operating systems and networks have been developed, however they are not adequate in protecting the database from intruders.[11] ID system in databases work at query level, transaction level and user (role) level. Bertino et. al. described the challenges to ensure data confidentiality, integrity and availability and the need of database security wherein the need of database IDSs to tackle insider threats was discussed. Panda et. al. [19] propose to employ data mining approach for determining data dependencies in the database system. The classification rules reflecting data dependencies are deduced directly from the database log. These rules represent what data items probably need to be read before an update operation and what data items are most likely to be written following this update operation. Transactions that are not compliant to the data dependencies generated are flagged as anomalous transactions. Database IDSs include Temporal Analysis of queries and Data dependencies among attributes, queries and transaction. Lee et al. [28] proposed a Temporal Analysis based intrusion detection method which incorporated time signatures and recorded update gap of temporal attributes. Any anomaly in update pattern of the attribute was reported as an intrusion in the proposed approach. The breakthrough introduction to association rule mining by Aggarwal et. al. [22] helped in finding data dependencies among data attributes, which was incorporated in the field of intrusion detection in Databases. During the initial development of data dependency association rule mining, DEMIDS, a misuse detection system for relational database systems was proposed by Chung et. al. [7] Profiles which specified user access pattern were derived from the audit log and Distance Metrics were further applied for recognizing data items These were used together in order to represent the expanse of users. But once the number of users for a single system becomes substantial, maintaining profiles becomes a redundant procedure. Another flaw was the system assuming domain information about a given schema. Hu et. al. [16] presented a data mining-based intrusion detection system, which used the static analysis of database audit log to mine dependencies among attributes at transaction level and represented those dependencies as sets of reading and writing operations on each data item. In another approach proposed by Hu et. al., techniques of sequential pattern mining have been applied on the training log, in order to identify frequent sequences at the transaction level. This approach helped in identifying a group of malicious transactions, which individually complied with the user behavior. The approach was improved in by Hu et. al. by clustering legitimate user transaction into user tasks for discovery of inter-transaction data dependencies. The method proposed extends the approach by assigning weights to all the operations on data attributes. The transactions which didn’t follow the data dependencies were marked as
  • 11. malicious. The major disadvantage of user assigned weights is the fact that they are static and unrelated to other data attributes. Kamra et. al. [27] employed a clustering technique on an RBAC model to form profiles based on attribute access which represented normal user behavior. An alarm is raised when anomalous behavior of that role profile is observed. (Bezdek, Ehrlich & Full, 1984) proposed the Fuzzy C-Means Algorithm. The basic idea behind this approach was to illustrate the similarity a data point may share with each of the clusters with help of a function often referred to as membership function. This measure of similarity lies between zero and one signifies the extent of similarity between the data point and the cluster and is termed as the membership value. The main aim of this technique is to construct fuzzy partitions of a particular data set. Y. Yu et. al. [29] illustrated a fuzzy logic-based Anomaly Intrusion Detection System. A Naive Bayes Classifier is used to classify an input event as normal or anomalous. The basis of classifier is formed by the independent frequency of each system call from a process in normal conditions. The ratio of the probability of a sequence from a process and the probability not from the process serves as the input of a fuzzy system for the classification. A hybrid approach was described by Doroudian et. al. [26] to identify intrusion at both transaction and inter-transaction level. At the transaction level, a set of predefined expected transactions were specified to the system and a sequential rule mining algorithm was applied at the inter transaction level to find dependencies between the identified transactions. The drawback of such a system is that sequences with frequencies lower than the threshold value are neglected. Therefore, the infrequent sequences were completely overlooked by the system, irrespective of their importance. As a result, the True Positive Rate falls down for the system. The above drawback was overcome by Sohrabi et. al. [20] who proposed a novel approach ODARDM, in which rules were formulated for lower frequency item sets, as well. These rules were extracted using leverage as the rule value measure, which minimized the interesting data dependencies. As a result, True Positive Rate increased while the False Positive Rate decreased. In recent developments, Rao et. Al [21] presented a Query Access detection approach through Principal Component Analysis and Random Forest to reduce data dimensionality and produce only relevant and uncorrelated data. As the dimensionality is reduced, both, the system performance and True Positive rate increases. In 2009, Majumdar et. al. [15] propose a comprehensive database intrusion detection system that integrates different types of evidences using an extended Dempster-Shafer theory. Besides combining evidences, they also incorporate learning in our system through application of prior knowledge and observed data on suspicious users. In 2016, Bertino et. al. [14] tackled the insider threat problem from a data-driven systemic view. User actions are recorded as historical log data in a system, and the evaluation investigates the date that users actually process. From the horizontal view, users are grouped together according to their responsibilities and a normal pattern is learned from the group behaviours. They investigate a suspected user also from the diachronic view by comparing his/her historical behaviours with the historical average of the same group. Anomaly detection has been an important research problem in security analysis, therefore development of methods that can detect malicious insider behavior with high accuracy and low false alarm is vital [10]. In this problem layout, McGough et al [8] designed a system to identify
  • 12. anomalous behavior of user by comparing of individual user’s activities against their own routine profile, as well as against the organization’s rule. They applied two independent approaches of machine learning and Statistical Analyzer on data. Then results from these two parts combined together to form consensus which then mapped to a risk score. Their system showed high accuracy, low false positive and minimum effect on the existing computing and network resources in terms of memory and CPU usage. Bhattacharjee et al proposed a graph-based method that can investigate user behavior from two perspectives: (a) anomaly with reference to the normal activities of individual user which has been observed in a prolonged period of time, and (b) finding the relationship between user and his colleagues with similar roles/profiles. They utilized CMU-CERT dataset in unsupervised manner. In their model, Boykov Kolmogorov algorithm was used and the result compared with different algorithms including Single Model One-Class SVM, Individual Profile Analysis, k- User Clustering and Maximum Clique (MC). Their proposed model evaluated by evaluation metrics Area-Under-Curve (AUC) that showed impressive improvement compare to other algorithms [9]. T. Rashid et al. proposed that parameter learning task in HMMs is to find, given an output sequence or a set of such sequences, the best set of state transition and emission probabilities. The task is usually to derive the maximum likelihood estimate of the parameters of the HMM given the set of output sequences. No tractable algorithm is known for solving this problem exactly, but a local maximum likelihood can be derived efficiently using the Baum–Welch algorithm or the Baldi–Chauvin algorithm. The Baum–Welch algorithm is a special case of the expectation-maximization algorithm. If the HMMs are used for time series prediction, more sophisticated Bayesian inference methods, like Markov chain Monte Carlo (MCMC) sampling are proven to be favorable over finding a single maximum likelihood model both in terms of accuracy and stability.[12] Log data are considered as high-dimensional data which contain irrelevant and redundant features. Feature selection methods can be applied to reduce dimensionality, decrease training time and enhance learning performance .
  • 13. 3. OUR APPROACH 3.1 Basic Notations Large organisations deal with tremendous amount of data whose security is of prime interest. The data in databases comprises of attributes describing real life objects called as entities. The attributes have varying levels of sensitivity, i.e. not all attributes are equally important to the integrity of database. As an example, the signatures and other biometric data are highly sensitive data attributes for a financial organisation like Bank in comparison to others like name, gender etc. So, unauthorised access to the crucial attributes is of a greater concern. Only certain employees may have access to such data elements and access by all others must be blocked instantaneously to ensure Confidentiality and consistency of data. Our proposed model QPAFCS(Query Pattern Access and Fuzzy Clustering System) pays special attention to sensitive data attributes and they have been referred to as CDE (Critical Data Elements) in the text. The attributes that can be used to indirectly infer CDEs are also critical to the functioning of the organisation. For instance, account number of a user may be used to access the signatures and other crucial details about him. Such attributes have been referred to as DAE (Directly Associated Elements) in the text. We propose a two-phase detection and prevention model that clusters users based on similarity of their attribute access patterns and the types of queries performed by them, i.e. our model tries to track the user access pattern of each user and further classify it as normal or malicious. The superiority of our model lies in its ability to prevent unauthorised retrieving and modification of most sensitive data elements (CDEs). Our model also makes sure that the query pattern for access of CDEs is specific and fixed for a particular user to avoid data breaches, i.e. the user associates himself with his regular access behaviour. Any deviation from the regular arrangement may lead to depreciation of user’s confidence and may act as representative of user’s malicious intent. The following terminologies are used: Definition 1 (Transaction) Set of queries executed by a user. Each transaction is represented by a unique transaction ID and also carries the user’s ID. Hence <Uid,Tid> act as unique identification key for each set of query patterns. Each Transaction T is denoted as <Uid, Tid, <q1, q2, … qn>> where qi denotes the ith query, i ∈ [1 … n] For example, suppose a user has id 1001. He/she then executes the following set of SQL queries: q1: SELECT a,b,c FROM R1,R2 WHERE R1.A>R2.B q2: SELECT P FROM R5
  • 14. WHERE R5.P==10 Then this is said to be a transaction of the form: t=<1001,67,<q1,q2>> Definition 2 (Query) A query is a standard database management system token/request for inserting and retrieving data or information from a database table or combination of tables. We define query as a read or write request on an attribute of the relation. A query is represented as <O(D1), O(D2), … O(Dn)> where, D1, D2, … Dn ∈ Rs where Rs is the relation schema and Di are the attributes. O represents the operations i.e. Read or write Operations. O ∈ {R, W} For example, examine the following transaction:- start transaction select balance from Account where Account_Number='9001'; select balance from Account where Account_Number='9002'; update Account set balance=balance-900 where Account_Number='9001' ; update Account set balance=balance+900 where Account_Number='9002' ; commit; //if all SQL queries succeed rollback; //if any of SQL queries failed or error The query corresponding to this transaction is: <<R(Account_Number),R(balance)>, <R(Account_Number),R(balance)>, <R(Account_Number),R(balance),W(balance)>, <R(Account_Number),R(balance),W(balance)>> Definition 3 (Read Sequence) A read sequence is defined as {R(x1), R(x2), … O(xn)} where O represents the operations i.e. Read or write Operations. O ∈ {R, W}. The Read sequence represents that the transaction may need to read all data items x1, x2, …, xn-1 before the transaction performs operation (O∈ {R, W}) on data item xn. For example, consider the following update statement in a transaction. Update Table1 set x = a + b + c where d = 90; In this statement, before updating x, values of a, b, c and d must be read and then the new value of x is calculated. So <R(a), R(b), R(c), R(d), W(x)> ∈ RS(x).
  • 15. Definition 4 (Write Sequence) A write sequence is defined as {O(x1), W(x2), … W(xn)} where O represents the operations i.e. Read or write Operations i.e. O ∈ {R, W} which represents that the transaction may need to write all data items x1, x2, …, xn-1 in this order after the transaction operates on data item xn. For example, consider the following update statements in one transaction. Update Table1 set x = a + b + c where a=50; Update Table1 set y = x + u where x=60; Update Table1 set z = x + w + v where w=80; Using the above example, it can be noted that <W(x), W(y),W(z)> is one write sequence of data item x, that is <W(x), W(y),W(z)> ∈ WS(x), where WS(x) denotes the write sequence set of x. Definition 5 (Read Rules (RR)) Read rules are the association rules generated from Read sequences whose confidence is greater than the user defined threshold (Ψconf). A read rule is represented as {R(x1), R(x2) ...} ⇒ O(x). For all sequential patterns <R(x1), R(x2), …, R(Xn-1), O(xn) > in read sequence set, generate the read rules with the format {R(x1), R(x2) ...} ⇒ O(xn). If the confidence of the rule is larger than the minimum confidence (Ψconf), then it’s added to the answer set of read rules, which implies that before xn, we need to read x1,x2…….. xn-1 For example: The Read Rule corresponding to the read sequence <R(a), R(b), R(c), R(d), W(x)> is: {R(a), R(b), R(c), R(d)} ⇒ W(x) Definition 6 (Write Rules (WR)) Write rules are the association rules generated from write sequences whose confidence is greater than the user defined threshold (Ψconf). A write rule is represented as O(x) ⇒ {W(x1), W(x2) …} For all sequential patterns O(x), W(x1), W(x2), …,(xk) in the write sequence set, generate the write rules with the format O(x)→W(x1), w(x2), …, w(xk). If the confidence of the rule is larger than the minimum confidence (Ψconf), then it’s added in the set of write rules which depicts after updating x, data items x1, x2, …, xk must be updated by the same transaction. For Example: The write rule corresponding to the write sequence
  • 16. <W(x), W(y),W(z)> is W(x) ⇒ {W(y),W(z)} Definition 7 (Critical Data Elements (CDE)) They are semantically defined data elements crucial to the functioning of the system. They are the data attributes of prime significance having direct correlation to the integrity of the system. In a vertically hierarchical organisation, these are the attributes accessed only by the top level management, and the access by lower levels of hierarchy is strictly protected. Type of Attribute Sensitivity Level Critical data Elements Highest Directly Associated Elements Medium Normal Attributes Low Table 3.1 Types of attributes and their sensitivity levels CDEs are tokens of behaviour that our model uses for the malicious activity recognition of users of system. Definition 8 (Critical Rules (CR)) A set of rules that contain a Critical Data Element in its antecedent or consequent. CR = {ζ | (ζ ∈ RR ∨ ζ ∈ WR) ∩ (x ∈ CDE ∩ ({R(x1), R(x2) …} ⇒ O(x) ∪ O(x) ⇒ {W(x1), W(x2) …}))} We propose a method of user Access Pattern Recognition using the Critical Rules. CR recognize the actions and goals of Users from a series of observations on the users' actions and the environmental conditions, i.e. the user query pattern associated to the Critical data elements. Definition 9 (Directly Associated elements (DAE)) The attributes except those present in CDE, which are either part of antecedents or consequents of Critical Rules. DAE = {μi| μi ∈ CR ∩ μi ∉ CDE}. The query patterns as perceived by our model QPAFCS are explored using DAEs that represent the first level of access of the CDEs. A user's behaviour is represented by a set of first-order statements (derived from queries) called attribute hierarchy encoded in first-order logic, which defines abstraction, decomposition and functional relationships between types of access arrangements. The unit-transactions accessing CDEs are decomposed into attribute hierarchy comprising of DAEs, which further represents the user’s most sensitive retrieval pattern. Example:  R(b) → R(a)  R(b), R(c) → R(a) If a is a CDE, then the set {b,c} represents DAEs. Definition 10 (Dubiety Score(φ)) A measure of anomaly exhibited by a user in the past based on his historic transactional data. This score summarizes the user’s historic malicious access attempts. Dubiety Score attempts to quantify the personnel vulnerability that the organisation faces because of a particular user.
  • 17. Dubiety Score is indicative of the amount of deviation between the user’s access pattern and his designated role. Dubiety Score combined with the deviation of user’s present query from his normal behaviour pattern, yields the output of the proposed IDS. For our paper: 0<= φ<=1. (1) Higher the Dubiety Score, more is the evidence against user following the assigned role, that is more is the malicious intent i.e. rogue behaviour. Definition 11 (Dubiety Table) A table maintaining the record of dubiety scores of each user. It contains two attributes: UserID and Dubiety Score. The initial Dubiety scores are set to 1. Uid φ 1001 1 1002 1 1003 1 1004 1 1005 1 Table 3.2 Initial Dubiety Table The dubiety table is updated each time a user performs query. For example: Let user 1001’s deviation from normal query is quantified as 0.81, Then the updated Dubiety table is as shown. Where: ds = deviation from normal query φi = Initial dubiety score. Uid √𝑑𝑠 ∗ фi 1001 0.9 1002 1 1003 1 1004 1 1005 1 Table 3.3 Updated Dubiety Table The Updated Dubeity table is hence stored in memory for further processing.
  • 18. 3.2 Learning Phase We start our learning phase by reading the training dataset into the memory and extracting useful patterns out of it. Our system requires non-malicious training dataset composed of transactions executed by trusted users. The model aims at generating user-profiles from the transaction-logs and quantifies deviation from normal behaviour i.e. this phase aims to recognise and characterise the user activity pattern on the basis of their queries arrangement. The following are various components of architecture of the proposed model: Fig 3(a) Learning Phase Architecture COMPONENTS OF ARCHITECTURE: Training data: A transaction log is a sequential record of all changes made to the database while the actual data is contained in a separate file. The transaction log contains enough information to undo all changes made to the data file as part of any individual transaction. The log records the start of a transaction, all the changes considered to be a part of it, and then the final commit or rollback of the transaction. Each database has at least one physical transaction log and one data file that is exclusive to the database for which it was created. Our initial input to the learning phase algorithm is the transaction log, with only authorised and consistent transactions. This data is free of any unauthorised activity and is used to form user profiles, role profiles etc based on normal user transactions. The logs are scanned, and the following elements are extracted: a. SQL Queries b. The user executing a given query SQL query parser: This is a tool that takes SQL queries as input, parses them and produces sequences (read and write) corresponding to the SQL query as output. The query parser also
  • 19. assigns a unique Transaction ID. The final output consists of two 3 columns: (TID), UID (User ID) and the read and write sequence generated by the parsing algorithm. As an Example, if the following transaction performed by user U1001 is examined: start transaction select balance from Account where Account_Number='9001'; commit; //if all SQL queries succeed rollback; //if any of SQL queries failed or error The parser generates a unique Transaction ID say T1234 followed by parsing the transaction. The parser finally yields: < T1234,U1001,<R(Account_number),R(balance)>> Frequent sequences generator: After the SQL query parser generates the sequences, the generated sequences are pre-processed. Then weights are assigned to data items, for instance the CDEs are given greater weight as compared to DAEs and other normal attributes. Then finally these pre-processed sequences are given as inputs to frequent sequences generator. It uses the prefix span algorithm to generate frequent sequences out of input sequences corresponding to each UID. Rule generator: The frequent sequences are given as inputs to the rule generator module which uses association rule mining to generate read rules and write rules out of the frequent sequences. As an example, if the input frequent sequences are: 1. <R(m),R(n),R(o),W(a)> 2. <R(m),R(n),W(o),W(a)> 3. <R(m),W(n),W(o),W(a)> 4. <W(a),R(b),W(o)> 5. <R(a),R(b),R(m),W(a)> 6. <R(a),R(b),W(m),W(b)> S.No. Frequent Sequences Associated Rules 1 <R(m),R(n),R(o),W(a)> R(m),R(n),R(o) →W(a) 2 <R(m),R(n),W(o),W(a)> R(m),R(n),W(o) →W(a) 3 <R(m),W(n),W(o),W(a)> R(m),W(n),W(o) →W(a) 4 <W(a),R(b),W(o)> W(a),R(b) →W(o) 5 <R(a),R(b),R(m),W(a)> R(a),R(b),R(m) →W(a) 6 <R(a),R(b),W(m),W(b)> R(a),R(b),W(m) →W(b) Table 3.4 Rule Generator for given Example DAE generator: In our approach, we semantically define a class of data items known as Critical data elements or CDEs. These CDEs and rules are given as input to our DAE (Directly
  • 20. associated element) generator which specifies all those elements as DAE which are present in either the antecedent or the consequent of those rules that involve at least one of the CDEs. User vector generator: Using the frequent sequences for the given audit period, it generates the user vectors. A user vector is of the form BID = < UID, w1, w2, w3, ... wn > where wi = |O(ai)|. |O(ai)| represents the total number of times user with the given Uid performs operation (O ∈ {R, W}) on the aforesaid attribute ai in the pre-decided audit period. An audit period τ refers to a period of time such as one year, a time window τ = [t1, t2] or the recent 10 months. User vector is representative of user’s activity. Each of these wi would represent how frequently a user performs the operation on the particular data item. It also can be used in a normalized form, as is used in our proposed model QPAFCS. UVID = <UID, < p(a1), p(a2), p(a3), … p(an)>> where, p(𝑎 𝑘) = 𝑤 𝑘 ∑ 𝑤𝑗𝑤 𝑗 𝜖 𝐵𝑖 p(ak) is defined as the probability of accessing the attribute ak. Value of p(𝑎 𝑘)close to 1 would mean that the user accesses the given attribute frequently. Cluster generator: It takes user vectors and rules as input and generates fuzzy clusters. Users are clustered into different fuzzy clusters based on the similarity of their user vectors[29]. A cluster profile would include Ci = <CID, {R}> where, CID represents the cluster centroid, and {R} is a set of rules which is formed by taking the union of all the rules that the members of the given fuzzy cluster abide by. We have used Fuzzy c-means[26] clustering to create cluster. Each user belongs to a cluster to a certain degree wij. Where: wij represents the membership coefficient of the ith user (ui) with the jth cluster The centre of a cluster (α) is the mean of all points, weighted by their membership coefficients[28]. Mathematically, Algorithm 1: DAE Generator Data: CDE, Set DAE = {}, RR = Set of Read Rules, WR = Set of Write Rules Result: The set of Directly Associated elements DAE Function: DAE Generator (CDE, RR, WR) for Ω є RR ∪ WR do for α є Ω do if α є CDE while β є Ω do DAE {} ⃪ β end end end end
  • 21. 𝑤𝑖𝑗 = 1 ∑ ( ||𝑢 𝑖 −𝛼 𝑗|| || 𝑢 𝑖−𝛼 𝑘|| ) 2 𝑚−1𝐶 𝑘=1 𝛼 𝑘 = ∑ 𝑤(𝑢) 𝑚 𝑢𝑢 ∑ 𝑤(𝑢) 𝑚 𝑢 The objective function that is minimized to create clusters is defined as: 𝑎𝑟𝑔 𝑚𝑖𝑛 ∑ ∑ 𝑤𝑖𝑗 𝑚 ||𝑢 𝑖 − 𝛼𝑗||2 𝐶 𝑗=1 𝑛 𝑖=1 where n is the total number of users, C is the number of clusters, and m is the fuzzifier. The dissimilarity/distance function used in the formation of fuzzy clusters is the modified Jenson Shannon distance[27] which is illustrated as: Given two user vectors[13] UVx = <Ux, < px(a1), px(a2), px(a3), … px(an)>> and UVy = <Uy, < py(a1), py(a2), py(a3), … py(an)>> of equal length n, the modified Jensen Shannon distance is computed as 𝐷(𝑈𝑉𝑝||𝑈𝑉𝑞) = ∑ ( (1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖))log2 (1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖)) (1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤( 𝑎𝑖)) + (1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤(𝑎𝑖)) log2 (1 + 𝑝 𝑦( 𝑎𝑖) ∗ 𝑤(𝑎𝑖)) (1 + 𝑝𝑥( 𝑎𝑖) ∗ 𝑤(𝑎𝑖))) 2 𝑛 𝑖=1 where, w(ai) is the semantic weight associated with the ai th attribute User profile generator: This module takes user vectors and the cluster profiles as input and generates user profiles. A user profile is of the form Ui=<UID, < p(a1), p(a2), p(a3) … p(ak) >, < c1, c2, … cC > > where UID is a unique ID given to each user, <p(a1), p(a2), p(a3), … p(an)> is a 1-D matrix containing the probability of the user accessing a particular attribute, and
  • 22. < c1, c2, … cC > is a vector representing the membership coefficients of the given user for C different clusters. As an Example: Table 3.5 User profile for the given Example Consider a system with 4 fuzzy clusters and 4 attributes, the given table illustrates the profile of user U1001. 3.3 Testing Phase In section 3.2, the learning phase is described, in which the system is trained using non- malicious or benign transactions. Now the trained model can be used to detect malicious transactions. In this phase, a test query is obtained as input and it is compared with the model’s perception of user’s access pattern, and the model perpetually evaluates if the test transaction is malicious. It is first checked whether the user is trying to access a CDE. If yes, the transaction is allowed only if the given user has accessed that CDE before. Next, it is checked if any DAE is being accessed. A user can perform write operation on a DAE iff it is previously written by the same user, otherwise the transaction is termed as malicious. Next, we check if the transaction abides by the rules that are generally followed by similar users. PHASES OF TESTING PHASE: Rule generator: This module takes the sequence as generated by the SQL query parser and gives the rule that the input transaction follows. This can be a read rule or a write rule and indicates the operations done by the user, data attributes accessed by the user and the order in which they are accessed. Now this rule can be checked for maliciousness. CDE Detector: The semantically critical elements referred to in our approach as CDEs are detected in this module. The read/ write rule corresponding to the incoming transaction is checked for the presence of CDEs. If the rule being checked for maliciousness contains a CDE, then it is dealt with using the following policy:- a. If read operation has been performed on any CDE, i.e. r(CDE) is present in the rule and UV[i][r(CDE)] = 0 and UV[i][w(CDE)] = 0 for the given user, then the transaction is termed as malicious. b. If write operation has been performed on any CDE i.e. w(CDE) is encountered and UV[i][w(CDE)] = 0 for the given user, then the transaction is termed as malicious. Inputs Outputs C1 C2 C3 C4 User Vector User profile 0.2 0.2 0.2 0.4 <U1001,0.2,0.109,0.9, 0.6> <U1001,<0.2,0.1,0.9,0.6>,<0.2,0.2,0.2,0.4>>
  • 23. Fig 3(b) Architecture of Testing Phase. DAE Detector: This module addresses the issue of inference attacks on CDEs. As discussed earlier, certain data elements can be used to access the CDEs, i.e. first order inference. This module uses the rules mined in the learning phase to determine which elements can be used to directly infer the DAEs. Our system seeks to prevent inference attacks by especially monitoring the DAEs. We lay emphasis on write operations on DAEs. If write operation has been performed on any DAEs i.e. w(DAE) is present in the rule to be checked and UV[i][w(DAE)] = 0 for the given user, then the transaction is termed as malicious. Dubiety Score Calculator and Analyser: If the transaction has not been found malicious in the previous two modules, we check if the transaction is malicious based on the previous history of the user and the behaviour pattern of all similar users (modified Jenson Shannon distance). To do so, we maintain a record of action of all users by keeping the measure of Dubiety Score(φi). Algorithm 2: CDE Detector Data: Set of rules (ϒ) from test transaction, Set χCDE, UID, User Profile(ϴ) Result: Checks whether the test transaction is malicious or normal with respect to CDE for Ѓє ϒ do for ϱ є Ѓ do if ϱ є χCDE then if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then Raise Alarm; end if r(ϱ) є Ѓ & ϴ[UID][r(ϱ)] == 0 & ϴ[UID][w(ϱ)] == 0 then Raise Alarm; end end end end
  • 24. The deviation of a user’s new transaction with his normal access pattern is referred to as Dubiety, and the relative measure of Dubiety is the Dubiety Score. Our IDS keeps a log of the DS (Dubiety Score) in a separate table. A user who is a potential threat tends to have a high dubiety score. Another intuition that our system follows is that any transaction that a user makes matches significantly either with the transactions the same user or similar users have made in the past. We use a measure ds to keep a track of the maximum similarity of the given rule. We combine ds with φ i to get the final measure of dubiety score φ f for the given user. We define 2 thresholds ФLT and ФUT. ФUT represents the upper limit for the dubiety score of a non- malicious user whereas ФLT denotes the lower limit. This means that if φf for a user comes out to be greater than ФUT, the user is malicious. On the other hand, φf value less than ФLT denotes a benign user.  If the incoming rule (R1) is a write rule, then the consequent of the incoming rule is matched with the corresponding rules in the cluster of which a user is as part. A user is said to be the part of the ith cluster iff: μi > 𝛿. Where, μi is the fuzzy membership coefficient of the given user for the ith cluster. 𝛿 is a user defined threshold.  If the incoming rule (R1) is a read rule, then the antecedent of the incoming rule is matched with the corresponding rules in the cluster of which a user is as part.  In order to quantitatively measure the similarity between two rules, we use modified Jaccard distance[30]: JD = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2) R1 R2 R2| μi > 𝛿 and i [1, k] Algorithm 4: Modified Jaccard Distance Data: Rules R1, R2; 𝛿1, 𝛿2; Set χR1, χR2 Result: Distance between the two rules (Ԏ) Function jcDistance (R1, R2) for Ω є R1 do χR1 ← Ω; end for Ω’ є R2 do χR2 ← Ω’; end Ԏ = 𝛿1∗( 𝜒𝑅1 𝜒𝑅2)– 𝛿2∗( 𝜒𝑅1 𝜒𝑅2 – 𝜒𝑅1 𝜒𝑅2) 𝜒𝑅1 𝜒𝑅2 ; return Ԏ; Algorithm 3: DAE Detector Data: Set of rules (ϒ) from test transaction, Set χDAE, UID, User Profile(ϴ) Result: Checks whether the test transaction is malicious or normal with respect to DAE for Ѓє ϒ do for ϱ є Ѓ do if ϱ є χDAE then if w(ϱ) є Ѓ & ϴ[UID][w(ϱ)] == 0 then Raise Alarm; end end end end
  • 25.  The minimum value of JD is regarded as ds. φi is fetched directly from dubiety table. Final dubiety score for the given user is calculated as: φf =√ 𝑑𝑠 ∗ фi  If φf < ФLT, the transaction is termed as non-malicious. In this case, the current dubiety score in the dubiety table for the given user is reduced by a factor known as “amelioration factor(Å)”. Thus, φi is updated as φi = Å φi  If ФUT > φf ≥ ФLT, the transaction is termed as non-malicious and the dubiety table entry for the given user is updated with φf.  If φf ≥ ФUT the transaction is termed as malicious.  As an Example, Let the initial dubiety table be: Uid φ 1001 0.9 1002 0.8 1003 0.2 1004 0.6 1005 0.7 Table 3.6 Initial dubiety table Let the minimum value of ds corresponding to each user be: Uid ds 1001 0.2 1002 0.3 1003 0.2 1004 0.6 1005 0.3 Table 3.7 Minimum ds values for various Users
  • 26. The calculated dubiety score table: Uid φf= √ 𝑑𝑠 ∗ фi 1001 0.42 1002 0.49 1003 0.2 1004 0.6 1005 0.46 Table 3.8 calculated dubiety scores table Taking ФLT=0.3 and ФUT=0.6 Uid φf Nature of Transaction Updated φf 1001 0.42 Non- malicious 0.42 1002 0.49 Non- malicious 0.49 1003 0.2 Non- malicious 0.198 1004 0.6 Malicious 0.6 1005 0.46 Non- malicious 0.46 Table 3.9 Summary of transactions of various users The Malicious Transactions are blocked in a straightforward fashion and the Non Malicious transactions are processed. Updated Dubiety Table is stored in database.
  • 27. 4. DISCUSSION With regard to a typical credit card company dataset, some examples of critical data elements (CDEs) are: - 1. CVV (denoted by a) Card verification value (CVV) is a combination of features used in credit, debit and automated teller machine (ATM) cards for the purpose of establishing the owner's identity and minimizing the risk of fraud. The CVV is also known as the card verification code (CVC) or card security code (CSC). When properly used, the CVV is highly effective against some forms of fraud. For example, if the data in the magnetic stripe is changed, a stripe reader will indicate a "damaged card" error. The flat-printed CVV is (or should be) routinely required for telephone or Internet- based purchases because it implies that the person placing the order has physical possession of the card. Some merchants check the flat-printed CVV even when transactions are conducted in person. CVV technology cannot protect against all forms of fraud. If a card is stolen or the legitimate user is tricked into divulging vital account information to a fraudulent merchant, unauthorized charges against the account can result. A common method of stealing credit card data is phishing, in which a criminal sends out legitimate-looking email in an attempt to gather personal and financial information from recipients. Once the criminal has possession of the CVV in addition to personal data from a victim, widespread fraud against that victim, including identity theft, can occur. The following are directly associated elements (DAEs) to CVV:- a. Credit card number (denoted by b) b. Name of card holder (denoted by c) c. Card expiry date (denoted by d) Credit Card Number, Name of card holder, Card expiry date are elements that are read before CVV and hence used to validate the CVV entered by the user. Hence the above-mentioned attributes have been classified as DAEs, by our system. Some normal data attributes are: - 1. Gender of Customer (denoted by e) 2. Credit Limit (denoted by f) 3. Customer’s phone number (denoted by g) These are the attributes that have been collected for the fraud detection and are not directly used to access the CDE but are crucial for the process. Some examples of transactions for our proposed approach:  R(b) → R(a)  R(b), R(c) → Ra)
  • 28. 5. EXAMPLE TO OUR APPROACH 1. JC Distance R1: R(c), R(b) → R(a) R2: R(d), R(b) → R(a) The modified JC Distance between R1 & R2 where the hyperparameters are 𝛿1 = 0.70 and 𝛿2 = 0.20, is calculated as JC Distance = 1-𝛿1(R1 R2) - 𝛿2(R1 R2- R1 R2) R1 R2 R1 R2 = 2 R1 R2 = 4 JC Distance = 0.75 2. User Profile Vector B1 = <U1, <0.7, 0.1, 0.6, 0.2, 0.4, 0.0, 0.2, 0.0>, <0.2, 0.3, 0.1, 0.2, 0.167, 0.033> > Here the values in the second tuple <0.7, …0.0> represent the probability of User U1 accessing particular attributes, for instance 0.7 denotes that there is a 70% probability that U1 accesses the first attribute. The values in the third tuple represent the membership of user U1 in the various(k) fuzzy clusters, which is 6 in our case. 3. Dubiety Score Suppose the Dubiety Score φi for User U1 is 0.8. The JC Distance of the test transaction with its cluster is 0.6. Then, φf =√ 𝑑𝑠 ∗ φi φf =√0.6 ∗ 0.8 = 0.69 Setting our hyperparameter ФUT as 0.65. We observe that φf > ФUT. Hence the test transaction is malicious, and an alarm is raised.
  • 29. 6. EXPERIMENTATION In this section, we describe the method of evaluation of the proposed algorithm. Firstly, we describe our dataset. We then calculate various accuracy measures considering different parameters as reference. 6.1 Description of dataset This paper is about anomaly detection of user behaviours. An ideal dataset should be obtained from a practical system with concrete job functions. But in fact, it is very sensitive for almost every organization or company. The performance of the algorithm was analysed by carrying out several experiments on a credit card company dataset adhering to the TPC-C benchmark[18]. The TPC-C schema is composed of a mixture of read only and read/write transactions that replicate the activities found in complex OLTP application environment. The database schema, data population, transactions, and implementation rules were designed to broadly represent modern OLTP systems. We used two audit logs: one for training the model and the second for testing it. The training log comprised of normal user transactions and testing log consisted of a mixture of normal as well as malicious user transactions. Although there are unusual records in real dataset, we also inject some anomalies for detection. The injected anomalies are set differently with the normal behaviour pattern from several aspects. In totality, about 20,000 transactions were used. In total, about 99% of data was non-malicious while less than 1% of data was malicious. Fig. 6(a) shows the distribution of malicious and benign data in the dataset used: Fig 6(a) Frequency of data items and their access frequency The details of CDEs, DAEs and Normal data items has already been given in Section 3 and examples have been discussed in Section 5.
  • 30. The access pattern data hereby shows that CDEs are rarely accessed, that too only by a few user roles and hence, protection of CDEs from malicious access is of a greater significance as compared to DAEs and Normal data elements. 6.2 Cluster Analysis When the number of users/user roles exceeds a given limit, it becomes exceedingly difficult for the IDS to keep track of individual user access patterns and hence detect anomaly. This is the reason that clustering is a better and computationally efficient solution for better performance of IDS. We prefer Fuzzy clustering over hard clustering. Fuzzy clustering (also referred to as soft clustering) is a form of clustering in which each data point can belong to more than one cluster. In non-fuzzy clustering (also known as hard clustering), data is divided into distinct clusters, where each data point can only belong to exactly one cluster. In fuzzy clustering, data points can potentially belong to multiple clusters. Membership grades are assigned to each of the data points(tags). These membership grades indicate the degree to which data points belong to each cluster. Thus, points on the edge of a cluster, with lower membership grades, may be in the cluster to a lesser degree than points in the center of cluster. When we evaluate various performance measures keeping the number of clusters as a reference parameter, it is observed that a particular count for clusters is the most efficient in predicting results. Fig 6(b) Variation of performance with number of clusters Fig 6(b) depicts variation in precision, recall, TNR, accuracy with change in number of clusters. From the graph, we can see that :-  TNR does not vary with the number of clusters, i.e. TNR is invariant.  The precision is always greater than 0.94 and is more or less constant.  Recall reaches optimum value when number of Fuzzy Clusters is greater than 3.  Accuracy also reaches the optimum value when number of clusters is greater than 3.
  • 31. 6.3 Distances and thresholds In section 3.2, we have described Modified Jensen-Shannon distance as a measure to calculate distance between two user vectors of same length. In probability theory and statistics, the Jensen–Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based on the Kullback–Leibler divergence, with some notable (and useful) differences, including that it is symmetric and it is always a finite value. The square root of the Jensen– Shannon divergence is a metric often referred to as Jensen-Shannon distance. We preferred to use modified Jenson-Shannon distance to give weights to data attributes and avoid curse of dimensionality. The variation of modified Jenson-Shannon distance with Euclidean distance is shown in the fig 6(g). In section 3.3, we have defined modified Jaccard distance to quantitatively measure the similarity between two rules. The Jaccard index, also known as Intersection over Union of the Jaccard similarity coefficient, is a statistical measure used for comparing the similarity and diversity of sample sets. The Jaccard coefficient measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets. The variation of modified Jaccard index with Jaccard index is shown in fig 6(h). The variation of precision, recall, TNR, accuracy with the various thresholds, namely 𝛿1, 𝛿2, фUT , фLT that were defined in section 3 is shown in the following figures: Fig 6(c) shows the variation of Precision, recall, TNR, accuracy with 𝛿1. It can be observed from the graph that Precision, TNR and Accuracy increase with the increase in value of 𝛿1, while the value of Recall decreases with increase in value of 𝛿1. Fig 6(e) shows the variation of Precision, recall, TNR, accuracy with 𝛿2. It can be observed from the graph that the value of Precision, TNR and Accuracy starts decreasing when the value of 𝛿2 increases beyond a certain value. Recall, on the other hand, increases for higher values of 𝛿2. Fig 6(d) shows the variation of Precision, recall, TNR, accuracy with фUT. It can be observed from the graph that the value of Precision first decreases and then exponentially increases with the increase in value of фUT. An identical trend is followed by Accuracy. Somewhat similar trend is followed by TNR except that it does not decrease initially. On the contrary, the value of Recall decreases with the increase in value of фUT. Fig 6(f) shows the variation of Precision, recall, TNR, accuracy with фLT. It can be observed from the graph that the values of all the parameters fluctuate a little but remain more or less constant with the increase in value of фLT. With regards to the dataset we have used, following inferences can be done from the graphs: 1. Value of 𝛿1 should be close to 0.65 for optimum performance. 2. Value of 𝛿2 should be close to 0.55 for optimum performance. 3. Value of фUT should be close to 0.59 for optimum performance. 4. Value of фLT should be close to 0.2 for optimum performance.
  • 32. 6.4 Comparison with related methods Table 1 shows the performance measures used for comparison of approaches. Using these performance measures, we will compare our approaches with other related works. Our various approaches are:- Approach 1. Our approach using modified Jenson-Shanon distance and modified Jaccard index. Approach 2. Using unmodified Jaccard index with Jenson-Shanon distance. Approach 3. Using Euclidean distance with unmodified Jaccard index. The various performance measures used for comparison of approaches are shown in Table 6.1.
  • 33. S.No. PERFORMANCE MEASURE FORMULA 1 TNR TN TN + FP 2 Precision TP TP + FP 3 Accuracy TP + TN TN + FP + TP + FN 4 F1 Score 2 ∗ Precision ∗ Recall Precision + Recall 5 PPV TP TP + FP 6 ACC TP + TN TP + TN + FP + FN 7 NPV TN TN + FN 8 FDR FP FP + TP 9 FOR FN TN + FN 10 BM TPR + TNR – 1 11 FPR FP FP + TN 12 FNR FN FN + TP 13 MK PPV + NPV – 1 14 MCC TP × TN − FP × FN √(TP + FP)(TP + FN)(TN + FP)(TN + FN) Table 6.1 Performance Measures In Table 6.2 we have compared the three approaches with each other. Sensitivity Measures Approach 1 Approach 2 Approach 3 PPV 0.96 0.73 0.74 TPR 0.81 0.95 1.00 ACC 0.89 0.80 0.83 F1 Score 0.88 0.83 0.85 NPV 0.83 0.93 1.00 FDR 0.04 0.27 0.26 FOR 0.17 0.07 0.00 BM 0.77 0.60 0.65 FPR 0.03 0.34 0.34 TNR 0.96 0.65 0.65 FNR 0.19 0.05 0.00 MK 0.79 0.66 0.74 MCC 0.78 0.63 0.70 Table 6.2 Comparison of our approaches
  • 34. From the table, following observations can be made:- If we compare Approach 1 with Approach 2, we can observe that:-  TNR and FPR of Approach 1 is a lot better than the TNR and precision for Approach 2.  Approach 1 has also got better accuracy as compared to Approach 2.  Approach 1 has a much lower FPR and FDR score as compared to Approach 2.  Amongst other performance measures, MK and MCC values of approach 1 are also better than that of Approach 2.  Approach 2, on the other hand has got better TPR, NPV and FOR measures as compared to Approach 1.  Both Approach 1 and Approach 2 have got somewhat similar F1 score. In the measures like FPR and TNR where Approach 1 has good performance, Approach 2 performs rather poorly. However, in measures like TPR and NPV, where Approach 2 performs better, Approach 1 also has good performance. For example, both Approach 1 and Approach 2 have similar NPV scores with Approach 2 performing slightly better As Approach 1 performs far better than Approach 2 in most of the measures, we can conclude that the overall performance of Approach 1 is better than Approach 2. If we compare Approach 1 with Approach 3, we observe that:-  TNR and precision of Approach 1 is a lot better than the TNR and precision for Approach3  Approach 1 has also got better accuracy as compared to Approach 3.  Approach 1 also has a much lower FPR and FDR score as compared to Approach 3.  Amongst other performance measures, MK and MCC values of Approach 1 are also slightly better than that of Approach 3.  Approach 3, on the other hand has got better TPR, NPV and FOR measures as compared to Approach 1. In fact, it has the best values for these parameters in the entire table.  Also, both Approach 1 and Approach 3 have got somewhat similar F1 score. In the measures like TNR and precision, where Approach 1 has one of the best scores in the entire table, Approach 3 performs rather poorly. Also, Approach 3 lags far behind in measures like FPR and FDR score. On the other hand, in the measures in which Approach 3 performs better than Approach 1, Approach 1 is also performing quiet nicely. For example, in case of NPV, both the approaches have good scores, with Approach 3 performing better. similar trends are observed in case of all other measures except FNR, where Approach 3 has is far superior. Considering all the above scenario, we can say that the overall even though Approach 3 has the best values for some performance measures, its poor performance in other measures are clearly a disadvantage due to which Approach 1 is better than Approach 3. Table 3 shows a comparison of our approaches with various other related works.
  • 35. Table 6.3 Comparison of our approaches with related works If we compare our approach with other related approaches, we observe that:-  In comparison to HU Panda, our approach works better with respect to all the performance measures considered for the purpose of comparison.  In comparison to the work of Mostafa et al. our approach performs better with respect to all the performance measures that are considered for comparison.  In comparison to the work of Hashemi et al. even though our approach scores just a little less in measures like TNR and precision, it scores a lot better with respect to rest of the performance measures  If we consider the work of Mina Sohrabi et al. our approach performs better with respect to all the performance measures that are present in the table.  In comparison to the work of Majumdar et al. our approach performs better with respect to all the performance measures that we have considered for the purpose of comparison.  With comparison to the work of UP Rao et al. our approach performs better in context to all the measures that are considered in the table for comparison.  In comparison to the work of Elisa Bertino, our approach gives better TNR and precision scores. It also gives comparatively better FDR and FPR scores. In other measures, except TPR and recall, both approaches have somewhat similar score. Since our work is mostly related to finding Critical Data Items in a dataset, higher TNR and precision scores are more desirable as compared to other performance measures. Since our approach performs quiet well with respect to other performance measures as well, better TNR and precision scores can easily cover up lower recall values. Sensitivity Measures Approach 1 Approach 2 Approach 3 HU Panda Hashemi et al. Mostafa et al. Mina Sohrabi et al. Majumdar et al. (2006) EliSa Bertino et al. UP Rao et al.(2016) PPV 0.96 0.73 0.74 0.88 0.97 0.94 0.93 0.88 0.94 0.61 TPR 0.81 0.95 1.00 0.73 0.71 0.75 0.66 0.70 0.91 0.70 ACC 0.89 0.80 0.83 0.81 0.84 0.85 0.80 0.80 0.93 0.64 F1 Score 0.88 0.83 0.85 0.79 0.82 0.83 0.77 0.78 0.92 0.65 NPV 0.83 0.93 1.00 0.77 0.77 0.79 0.73 0.75 0.91 0.68 FDR 0.04 0.27 0.26 0.12 0.03 0.06 0.07 0.13 0.06 0.39 FOR 0.17 0.07 0.00 0.23 0.23 0.21 0.27 0.25 0.09 0.32 BM 0.77 0.60 0.65 0.63 0.69 0.70 0.60 0.60 0.85 0.35 FPR 0.03 0.34 0.34 0.10 0.02 0.05 0.05 0.10 0.06 0.45 TNR 0.96 0.65 0.65 0.90 0.98 0.95 0.94 0.90 0.94 0.65 FNR 0.19 0.05 0.00 0.28 0.29 0.25 0.35 0.30 0.09 0.30 MK 0.79 0.66 0.74 0.65 0.74 0.73 0.66 0.63 0.85 0.29 MCC 0.78 0.63 0.70 0.63 0.72 0.71 0.63 0.61 0.85 0.29
  • 36.  In comparison to the work of Elisa Bertino, our approach gives better TNR and precision scores. It also gives comparatively better FDR and FPR scores. In other measures, except TPR and recall, both approaches have somewhat similar score. Since our work is mostly related to finding Critical Data Items in a dataset, higher TNR and precision scores are more desirable as compared to other performance measures. Since our approach performs quiet well with respect to other performance measures as well, better TNR and precision scores can easily cover up lower recall value
  • 37. 7. CONCLUSION AND FUTURE WORKS In this paper we have tried to detect malicious transactions with a perspective that certain data elements hold more critical information than others. Inference attacks against such data elements are blocked by taking into account user access pattern and also the historic behaviour. A user who regularly behaves as a normal user is gradually allowed to improve his dubiety score. The approach is analysed with respect to different performance parameters by conducting experiments. Finally, it may be concluded that the approach works efficiently in determining the nature of a transaction. We plan to extend our approach from 2-level inference control to n-level inference control, whereby nth order statements will be encoded to attribute-hierarchy and then the n-level attribute tree/graph will be manipulated to form fuzzy clusters and the incoming transactions will be checked against nth access level. Automatic manipulation of semantics to classify attributes as critical data elements may also be considered as future research topic.
  • 38. 8. REFERENCES 1. I-Yuan Lin; Xin-Mao Huang; Ming-Syan Chen “Capturing user access patterns in the Web for data mining” Proceedings 11th International Conference on Tools with Artificial Intelligence, IEEE pp 9-11 Nov. 1999 2. R.S. Sandhu; P. Samarati “Access control: principle and practice” Published in: IEEE Communications Magazine (Volume: 32, Issue: 9, Sept. 1994) 3. Denning, D.E. (1987) An Intrusion Detection Model. IEEE Transactions on Software Engineering, Vol. SE-13, 222-232. 4. Knuth, Donald E., James H. Morris, Jr, and Vaughan R. Pratt. "Fast pattern matching in strings." SIAM journal on computing 6.2 (1977): 323-350. 5. Wang, Ke. “Anomalous Payload-Based Network Intrusion Detection”. Recent Advances in Intrusion Detection. Springer Berlin. doi:10.1007/978-3-540-30143-1_11 6. Douligeris, Christos; Serpanos, Dimitrios N. (2007-02-09). Network Security: Current Status and Future Directions. John Wiley & Sons. ISBN 9780470099735. 7. Christina Yip Chung, Michael Gertz and Karl Levitt (2000), “DEMIDS: a misuse detection system for database systems”, Integrity and internal control information systems: strategic views on the need for control, Kluwer Academic Publishers, Norwell, MA. 8. A. S. McGough, D. Wall, J. Brennan, G. Theodoropoulos, E. Ruck-Keene, B. Arief, et al., "Insider Threats: Identifying Anomalous Human Behaviour in Heterogeneous Systems Using Beneficial Intelligent Software (Ben-ware)", presented at the Proceedings of the 7th ACM CCS International Workshop on Managing Insider Security Threats, Denver, Colorado, USA, 2015. 9. S. D. Bhattacharjee, J. Yuan, Z. Jiaqi, and Y.-P. Tan, "Context-aware graph-based analysis for detecting anomalous activities", presented at the Multimedia and Expo (ICME), 2017 IEEE International Conference on, 2017. 10. P. A. Legg, O. Buckley, M. Goldsmith, and S. Creese, "Automated insider threat detection system using user and role-based profile assessment", IEEE Systems Journal, vol. 11, pp. 503-512, 2015. 11. I. Agrafiotis, A. Erola, J. Happa, M. Goldsmith, and S. Creese, "Validating an Insider Threat Detection System: A Real Scenario Perspective", presented at the 2016 IEEE Security and Privacy Workshops (SPW), 2016. 12. T. Rashid, I. Agrafiotis, and J. R. C. Nurse, "A New Take on Detecting Insider Threats: Exploring the Use of Hidden Markov Models", presented at the Proceedings of the 8th ACM CCSInternational Workshop on Managing Insider Security Threats, Vienna, Austria, 2016. 13. Zamanian Z., Feizollah A., Anuar N.B., Kiah L.B.M., Srikanth K., Kumar S. (2019) User Profiling in Anomaly Detection of Authorization Logs. In: Alfred R., Lim Y., Ibrahim A., Anthony P. (eds) Computational Science and Technology. Lecture Notes in Electrical Engineering, vol 481. Springer, Singapore 14. Yuqing Sun, Haoran Xu, Elisa Bertino, and Chao Sun. 2016. A Data-Driven Evaluation for Insider Threats. Data Science and Engineering Vol. 1, 2 (2016), 73--85. doi:10.1007/s41019-016-0009-x
  • 39. 15. S. Panigrahi, S. Sural and A. K. Majumdar, "Detection of intrusive activity in databases by combining multiple evidences and belief update," 2009 IEEE Symposium on Computational Intelligence in Cyber Security, Nashville, TN, 2009, pp. 83-90. doi: 10.1109/CICYBS.2009.4925094 16. Yi Hu, Bajendra Panda, A data mining approach for database intrusion detection, SAC '04 Proceedings of the 2004 ACM symposium on Applied computing Pages 711-716, doi:10.1145/967900.968048 17. Abhinav Srivastava , Shamik Sural , A. K. Majumdar, Weighted intra-transactional rule mining for database intrusion detection, Proceedings of the 10th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining, April 09-12, 2006, Singapore doi:10.1007/11731139_71 18. TPC-C benchmark: http://www.tpc.org/tpcc/default.asp 19. Mina Sohrabi, M. M. Javidi, S. Hashemi, “Detecting intrusion transactions in database systems: a novel approach”, Journal of Intelligent Info Systems 42:619-644 doi: 10.1007 Springer 2014. 20. UP Rao et. al ,“Weighted Role Based Data Dependency Approach for Intrusion Detection in Database”, International Journal of Network Security, Vol.19, No.3, PP.358-370, May 2017 (doi: 10.6633/IJNS.201703.19(3).05). 21. R. Agrawal, T. lmieliiski, and A. Swami, “Mining Association Rules between Sets of Items in Large Databases”, in Proceedings of the 1993 ACM SIGMOD International Conference on Management of data, 1993. 22. Sattar Hashemi, Ying Yang,Davoud Zabihzadeh and Mohammadreza Kangavari, “Detecting intrusion transactions in databases using data item dependencies and anomaly analysis”, Article in Expert Systems 25(5):460-473. November 2008 doi: 10.1111/j.1468- 0394.2008.00467. 23. Mostafa Doroudian, Hamid Reza Shahriari, “A Hybrid Approach for Database Intrusion Detection at Transaction and Inter-transaction Levels”, 6th Conference on Information and Knowledge Technology (IKT 2014), May 28-30, 2014, Shahrood University of Technology, Tehran, Iran. 24. E. Bertino, A. Kamra, E. Terzi and A. Vakali (2005), "Intrusion detection in RBAC administered databases", in Proceedings of the Applied Computer Security Applications Conference (ACSAC). 25. Lee, V. C.S., Stankovic, J. A., Son, S. H. Intrusion Detection in Real-time Database Systems Via Time Signatures. In Proceedings of the Sixth IEEE Real Time Technology and Applications Symposium, 2000. 26. Weina Wang, Yunjie Zhang, Yi Li and Xiaona Zhang (2006), "The Global Fuzzy C-Means Clustering Algorithm", 2006 6th World Congress on Intelligent Control and Automation, Dalian, 2006, pp. 3604- 3607. 27. Fuglede, Bent; Topsøe, Flemming (2004). "Jensen-Shannon divergence and Hilbert space embedding - IEEE Conference Publication". 28. Dunn, J. C. (1973-01-01). "A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters". Journal of Cybernetics. 3 (3): 32–57. doi:10.1080/01969727308546046. ISSN 0022-0280.
  • 40. 29. A. Mangalampalli and V. Pudi (2009), "Fuzzy association rule mining algorithm for fast and efficient performance on very large datasets", 2009 IEEE International Conference on Fuzzy Systems, Jeju Island, 2009, pp. 1163-1168 30. Vorontsov, I.E., Kulakovskiy, I.V. & Makeev, V.J. Algorithms Mol Biol (2013) 8: 23. “Jaccard index based similarity measure to compare transcription factor binding site models” doi: 10.1186/1748-7188-8-23
  • 41. TABLE OF CONTENTS  DECLARATION  CERTIFICATE  ACKNOWLEDGEMENT  LISTOF ABBREVIATIONS USED  LISTOF FIGURES  LISTOF TABLES  PROBLEMSTATEMENT  MOTIVATION  OBJECTIVE  ABSTRACT  INTRODUCTION  LITERATUREOVERVIEW  OUR PROPOSED APPROACH  RESULT AND EXPLANATION  CONCLUSIONAND FUTUREWORK  REFERENCES
  • 42. LIST OF ABBREVIATIONS USED 1. CDE: Critical Data Elements 2. DAE: Directly Accessed Elements 3. IDS: Intrusion Detection System 4. SPM: Sequential Pattern Mining 5. FPM: Frequent Pattern Mining 6. ARM: Association Rule Mining 7. DT: Dubiety Table 8. FPR: False Positive Rate 9. TPR: True Positive Rate 10. FNR: False Negative Rate 11. TNR: True Negative Rate 12. UV: User Vector 13. Uid: User ID 14. Cid: Cluster ID 15. JCD: Jaccard Distance 16. JSD: Jenson Shannon Distance 17. KL: Kulback Liebler
  • 43. LIST OF FIGURES 1. Fig 3 (a) Architecture of Learning Phase. 2. Fig 3 (b) Architecture of Testing Phase. 3. Fig 6 (a) Distribution of data in the dataset. 4. Fig 6 (b) Variation of performance measures with number of clusters. 5. Fig 6 (c) Variation of Precision, recall, TNR, accuracy with 𝛿1. 6. Fig 6 (d) Variation of Precision, recall, TNR, accuracy with фLT. 7. Fig 6 (e) Variation of Precision, recall, TNR, accuracy with 𝛿2. 8. Fig 6 (f) Variation of Precision, recall, TNR, accuracy with фUT. 9. Fig 6 (g) Variation of modified Jenson-Shannon distance with Euclidean distance. 10. Fig 6 (h) Variation of modified Jenson-Shannon distance with Euclidean distance.
  • 44. LIST OF TABLES 1. Table 3.1 Types of attributes and their sensitivity levels 2. Table 3.2 Initial Dubiety Table 3. Table 3.3 Updated Dubiety Table 4. Table 3.4 Rule Generator for given Example 5. Table 3.5 User profile for the given Example 6. Table 3.6 Initial dubiety table 7. Table 3.7 Minimum ds values for various Users 8. Table 3.8 calculated dubiety scores table 9. Table 3.9 Summary of transactions of various users 10. Table 6.1 Performance Measures 11. Table 6.2 Comparison of our approaches 12. Table 6.3 Comparison of our approaches with related works